Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 7,116 | Comments: 49,943

filter by tags archive
time to read 2 min | 221 words

On an otherwise uneventful morning, the life of the operations guy got… interesting.

What were supposed to be a routine morning got hectic because the database refused to operate normally. To be more exact, the database refused to load a file. RavenDB is generally polite when it run into issues, but this time, it wasn’t playing around. Here is the error it served:

---> System.IO.IOException: Could not set the size of file  D:\RavenData\Databases\Purple\Raven.voron to 820 GBytes

---> System.ComponentModel.Win32Exception (665): The requested operation could not be completed due to a file system limitation

Good old ERROR_FILE_SYSTEM_LIMITATION, I never knew you, because we have never run into an error with this in the past.

The underlying reason was simple, we had a large file (820GB) that was too fragmented. At some point, the number of fragments of the file bypassed the maximum size of the file system.

The KB article about this issue is here. You might be able to move forward more quickly by using the contig.exe tool to defrag a single file.

The root cause here was probably backing up to the same drive as the database, which forced the file system to break the database file into fragements.

Just a reminder that there are always more layers into the system and that we need to understand them all when they break.

time to read 4 min | 705 words

I want to comment on the following tweet:

When I read it, I had an immediate and visceral reaction. Because this is one of those things that sound nice, but is actually a horrible dystopian trap. It confused two very important concepts and put them in the wrong order, resulting in utter chaos.

The two ideas are “writing tests” and “producing high quality code”. And they are usually expressed in something like this:

We write tests in order to product high quality code.

Proper tests ensure that you can make forward progress without having regressions. They are a tool you use to ensure a certain level of quality as you move forward. If you assume that the target is the tests and that you’ll have high quality code because of that, however, you end up in weird places. For example, take a look at the following set of stairs. They aren’t going anywhere, and aside from being decorative, serves no purpose.

image

When you start considering tests themselves to be the goal, instead of a means to achieve it, you end up with decorative tests. They add to your budget and make it harder to change things, but don’t benefit the project.

There are a lot of things that you are looking for in code review that shouldn’t be in the tests. For example, consider the error handling strategy for the code. We have an invariant that says that exceptions may no escape our code when running in a background thread. That is because this will kill the process. How would you express something like that in a test? Because you end up with an error raised from a write to a file that happens when the disk is full that kills the server.

Another example is a critical piece of code that needs to be safely handle out of memory exceptions. You can test for that, sure, but it is hard and expensive. It also tend to freeze your design and implementation, because you are now testing implementation concerns and that make it very hard to change your code.

Reviewing code for performance pitfalls is also another major consideration. How do you police allocations using a test? And do that without killing your productivity? For fun, the following code allocates:

Console.WriteLine(1_000_000);

There are ways to monitor and track these kind of things, for sure, but they are very badly suited for repeatable tests.

Then there are other things that you’ll cover in the review, more tangible items. For example, the quality of error messages you raise, or the logging output.

I’m not saying that you can’t write tests for those. I’m saying that you shouldn’t. That is something that you want to be able to change and modify quickly, because you may realize that you want to add more information in a certain scenario. Freezing the behavior using tests just means that you have more work to do when you need to make the changes. And reviewing just test code is an even bigger problem when you consider that you need to consider interactions between features and their impact on one another. Not in terms of correctness, you absolutely need to test that, but in terms of behavior.

The interleaving of internal tasks inside of RavenDB was careful designed to ensure that we’ll be biased in favor of user facing operations, starving background operations if needed. At the same time, it means that we need to ensure that we’ll give the background tasks time to run. I can’t really think about how you would write a test for something like that without basically setting in stone the manner in which we make that determination. That is something that I explicitly don’t want to do, it will make changing how and what we do harder. But that is something that I absolutely consider in code reviews.

time to read 3 min | 404 words

Valgrind is an essential tool for anyone who is working with native code, especially if you are running C or C++ code. I have a codebase that is about 15,000 lines of C code, and Valgrind is absolutely essential for me to check my work. It has caught quite a few of my slips.

I recently switched systems and when running the same code using Valgrind, I started to get annoying warnings, like this:

==16896==
--16896-- WARNING: Serious error when reading debug info
--16896-- When reading debug info from /tmp/test.txt:
--16896-- can't read file to inspect ELF header
==16896==

The key issue is that this is, as you can imagine, a data file, why is Valgrind attempting to read ELF details from the file?

It took me a while to narrow things down, but I found that I could reproduce this easily with the following code:

If you’ll run this code with the following command, you should see the warning:

clang a.c && valgrind   ./a.out

Note that this is with clang 10.0.0-4ubuntu1 and valgrind-3.16.1. I decided to check what Valgrind is doing using strace, which gave the following output:

Digging a little deeper, let’s highlight the root cause of this:

openat(AT_FDCWD, "test.txt", O_RDWR|O_CREAT|O_TRUNC|O_DSYNC|O_DIRECT|O_CLOEXEC, 0600) = 3
mmap(0x4a4d000, 262144, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0) = 0x4a4d000
pread64(3, 0x1002ea98a0, 1024, 0) = -1 EINVAL (Invalid argument)

I’m opening the test.txt file using the O_DIRECT file, which limits the kind of things that you can do with the file. In particular, it means that all access should be on page aligned memory. The pread64() call is not using a page aligned buffer to read from the file.

What is interesting is that my code isn’t issuing any such call, this is coming from inside of Valgrind itself. In particular, I believe that this is the offending piece of code: di_notify_mmap is called whenever we map code, and is complex. The basic issue is that it does not respect the limits of files created with O_DIRECT and that causes the pread() call to fail. At this point, Valgrind outputs the warning.

Brief look at the code indicate that this should be fine. This is a data mapping, not executable mapping, but it still make the attempt. Debugging into Valgrind is beyond the scope of what I want to do. For now, I changed things so any mmap() won’t use the file descriptor with O_DIRECT, and that resolved things for me.

time to read 2 min | 273 words

I’m very happy to announce that the TypeScript / Node.js client API for RavenDB was recently updated to 5.0. This release updates the API to support Time Series API and bulk insert. Beyond the new API and functionality, we have also put a lot of effort into the ergonomics of this release.

One of the major changes that was made was to the way you use indexes in the API. Thanks to Yawar Jamal are due, for suggesting this improvement and sending the initial PR. What does this means? Well, here is an index definition in the new version:

The actual index definition isn’t that interesting. You can see a longer explanation of exactly what I’m doing in this post. What is really interesting is that I can define this using code, no messing about with strings. This is checked by the compiler and is going to give you a similar developer experience as using Linq in .NET.

I also mentioned ergonomics, right? Let’s look at some of the other features that you now get from the client. It’s funny, because this has nothing to do with code execution, but is very important to just Getting Things Done.

Take a look at this:

image (1)

Even though we are passing a string to the query, we have intellisense to assist us and warn about typos.

That applies all over the API, so you don’t really have to make an effort, It Just Works.

image (2)

time to read 2 min | 224 words

RavenDB Subscriptions allows you to create a query and subscribe to documents that match the query. They are very useful in many scenarios, including backend processing, queues and more.

Subscriptions allow you to define a query on a document, and get all the documents that match this query. The key here is that all documents don’t refer to just the documents that exists now, but also future documents that match the query. That is what the subscription part is all about.

The subscription query operate on a single document at a time, which leads to open questions when we have complex object graphs. Let’s assume that we want to handle via subscriptions all Orders that are managed by an employee residing in London. There isn’t a straightforward of doing this. One option would be to add EmployeeCity to the Orders document, but that is a decidedly inelegant solution. Another option is to use the full capabilities of RavenDB. For Subscription queries, we actually allow you to ask question on other documents, like so:

Now we’ll only get the Orders who employee is in London. Simple and quite elegant.

It does have a caveat, though. We will only evaluate this condition whenever the order changes, not when the employee changed. So if the employee moves, old orders will not be matched against the subscription, but new ones will.

time to read 4 min | 742 words

This is part of the same issue as the previous post. I was tracking a performance regression between RavenDB 4.1 and RavenDB 4.2, there was a roughly 8% performance difference between the two (favoring the older version) which was reported to us. The scenario was very large and complex documents (hundreds of objects in a document, > 16KB of JSON each).

The two code bases have had a lot of (small) changes between them, so it was hard to figure out exactly what was the root cause for the regression. Eventually I found something utterly bizarre. One of the things that we have to do when you save a document is check if the document has been modified. If so, we need to save it, otherwise, we can skip it. Here is the relevant piece of code in 4.1:

image

So this costs 0.5 ms (for very large documents), seems perfectly reasonable. But when looking at this on 4.2, we have:

image

This cost six times as much, what the hell?! To clarify, Blittable is the name of the document format that RavenDB uses. It is a binary JSON format that is highly efficient. You can think about this as comparing two JSON documents, because this is what it is doing.

I mentioned that there are differences between these versions? There have been quite a few  (thousands of commits worth), but this particular bit of code hadn’t changed in years. I just couldn’t figure out what was going on. Then I looked deeper. Here are the cost of these calls. Here is the 4.1 version:

image

And here is the 4.2 version:

image

There are a few interesting things here. First, we can see that we are using Enumerable.Contains and that is where most of the time goes. But much more interesting, in 4.1, we are calling this method a total of 30,000 times. In 4.2, we are calling it 150,000 times!!! Note that CompareBlittable is recursive, so even though we call it on 10,000 documents, we get more calls. But why the difference between these version?

I compared the code for these two functions, and they were virtually identical. In 4.2, we mostly change some error message texts, nothing major, but somehow the performance was so different. It took a while to figure out that there was another difference. In 4.1, we checked the changes in the documents in the order of the properties on the document, but on 4.2, we optimized things slightly and just used the index of the property. A feature of the blittable format is that properties are lexically sorted.

Here is the document in question, in our test, we are modifying Property6, as you can see here:

image

There are a total of 40 properties in this document. And much nesting. In this case, in 4.2, we are scanning for changes in the document using the lexical sorting, which means:

image

The CompareBlittable() function will exit on the first change it detect, and in 4.1, it will get to the changed Property6 very quickly. On 4.2, it will need to scan most of the (big) document before it find a change. That is a large part of the cost difference between these versions.

Now that I know what the issue is, we have to consider whatever behavior is better for us. I decided to use the order of inserted properties, instead of the lexical order. The reasoning is simple. If a user care about that, they can much more easily change the order of properties in the document than the names of the properties. In C#, you can just change the order the properties shows up in the class definition.

I have to say, this was much harder to figure out than I expected, because the change happened in a completely different location and was very much none obvious in how it worked.

time to read 2 min | 323 words

The title of this post is a reference to a quote by Leslie Lamport: “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable”.

A few days ago, my blog was down. The website was up, but it was throwing errors about being unable to connect to the database. That is surprising, the database in question is running a on a triply redundant system and has survived quite a bit of abuse. It took some digging to figure out exactly what was going on, but the root cause was simple. Some server that I never even knew existed was down.

In particular the crl.identrust.com server was down. I’m pretty familiar with our internal architecture, and that server isn’t something that we rely on. Or at least so I thought. CRL stands for Certificate Revocation List. Let’s see where it came from, shall we. Here is the certificate for this blog:

image

This is signed by Let’s Encrypt, like over 50% of the entire internet. And the Let’s Encrypt certificate has this interesting tidbit in it:

image

Now, note that this CRL is only used for the case in which a revocation was issued for Let’s Encrypt itself. Which is probably a catastrophic event for the entire internet (remember > 50%).

When that server is down, the RavenDB client could not verify that the certificate chain was valid, so it failed the request. That was not expected and something that we are considering to disable by default. Certificate Revocation Lists aren’t really used that much today. It is more common to see OCSP (Online Certificate Status Protocol), and even that has issues.

I would appreciate any feedback you have on the matter.

time to read 2 min | 393 words

RavenDB is a document database, as such, it stores data in JSON format. We have had a few cases of users that wanted to use RavenDB as the backend of various blockchains. I’m not going to touch on their reasoning. I think that a blockchain is a beautiful construct, but one that is searching for a good niche to solve.

The reason for this post, however, is that we need to consider one of the key problems that you have to deal with the blockchain, how to compute the signature of a JSON document. That is required so we’ll be able to build a merkle tree, which is at the root of all blockchains.

There are things such as JWS and JOSE to handle that, of course. And rolling your own signature scheme is not advisable. However, I want to talk about a potentially important aspect of signing JSON, and that is that there isn’t really a proper canonical form of JSON. For example, consider the following documents:

All of those documents have identical output. Admittedly, you could argue about the one using multiple Rating properties, but in general, they are the same. But if we look at the byte level representation, that is very far from the case.

A proper way to sign such messages would require that we’ll:

  • Minify the output to remove any extra whitespace.
  • Error on multiple properties with the same key. That isn’t strictly required, but is going to make everything easier.
  • Output them in a sorted order.
  • Normalize the string encoding to a single format.
  • Normalize numeric encoding (for example, whatever you support only double precision floats or arbitrary sized numbers).

Only then can you actually perform the actual signature on the raw bytes. That also means that you can’t just pipe the data to sha256() and call it a day.

Another alternative is to ignore all of that and decide that the only thing that we actually care about in this case is the raw bytes of the JSON document. In other words, we’ll validate the data as raw binary, without caring about the semantic differences. In this case, the output of all the documents above will be different.

Here is a simple example of cleaning up a JSON object to return a stable hash:

That answer the above criteria and is pretty simple to run and work with. Including from other platforms and environments.

time to read 3 min | 533 words

At the beginning of the year, we run into a problematic query. The issue was the use of an in clause vs. a series of OR. You can see the previous investigation results here. We were able to pinpoint the issue pretty well, very deep in the guts of Lucene, our query engine.

Fast Query Slow Query
image image
Time: 1 – 2 ms Time: 60 – 90 ms
image image

The key issue for this query was simple. There are over 600,000 orders with the relevant statuses, but there are no orders for CustomerId “customers/100”. In the OR case, we would evaluate the query lazily. First checking the CustomerId, and given that there have been no results, short circuiting the process and doing no real work for the rest of the query. The IN query, on the other hand, would do things eagerly. That would mean that it would build a data structure that would hold all 600K+ documents that match the query, and then would throw that all away because no one actually needed that.

In order to resolve that, I have to explain a bit about the internals of Lucene. As its core, you can think of Lucene in terms of sorted lists inside dictionaries. I wrote a series of posts on the topic, but the gist of it is:

 

Note that the ids for documents containing a particular term are sorted. That is important for a lot of optimizations in Lucene, which is also a major problem for the in query. The problem is that each component in the query pipeline needs to maintain this invariant. But when we use an IN query, we need to go over potentially many terms. And then we need to get the results in the proper order to the calling code. I implemented a tiered approach. If we are using an IN clause with a small number of terms in it (under 128), we will use a heap to manage all the terms and effectively do a merge sort on the results.

When we have more than 128 terms, that stops being very useful, however. Instead, we’ll create a bitmap for the possible results and scan through all the terms, filling the bitmap. That can be expensive, of course, so I made sure that this is done lazily by RavenDB.

The results are in:

  OR Query IN Query
Invalid CustomerId 1.39 – 1.5 ms 1.33 – 1.44 ms
Valid CustomerId 17.5 ms 12.3 ms

For the first case, this is now pretty much a wash. The numbers are slightly in favor of the IN query, but it is within the measurement fluctuations.

For the second case, however, there is a huge performance improvement for the IN query. For that matter, the cost is going to be more noticeable the more terms you have in the IN query.

I’m really happy about this optimization, it ended up being quite elegant.

time to read 2 min | 333 words

I had a task for which I need to track a union of documents and then iterate over them in order. It is actually easier to explain in code than in words. Here is the rough API:

As you can see, we initialize the value with a list of streams of ints. Each of the streams can contain any number of values in the range [0 … maxId). Different streams can the same or different ids.

After initialization, we have to allow to query the result, to test whatever a particular id was stored, which is easy enough. If this was all I needed, we could make do with a simple HashSet<int> and mostly call it a day.  However, we also need to support iteration, more interesting, we have to support sorted iteration.

A quick solution would be to use something like SortedList<int,int>, but that is going to be massively expensive to do (O(N*logN) to insert). It is also going to waste a lot of memory, which is important. A better solution would be to use a bitmap, which will allow us to use a single bit per value. Given that we know the size of the data in advance, that is much cheaper, and the cost of insert is O(N) to the number of ids we want to store. Iteration, on the other hand, is a bit harder on a bitmap.

Luckily, we have Lemire to provide a great solution. I have taken his C code and translated that to C#. Here is the result:

I’m using BitOperations.TrailingZeroCount, which will use the compiler intrinsics to compile this to a very similar code to what Lemire wrote. This allows us to iterate over the bitmap in large chunks, so even for a large bitmap, if it is sparsely populated, we are going to get good results.

Depending on the usage, a better option might be a Roaring Bitmap, but even there, dense sections will likely use something similar for optimal results.

FUTURE POSTS

  1. Building a social media platform without going bankrupt: Part I–Laying the numbers - 44 minutes from now
  2. Building a social media platform without going bankrupt: Part II–Accepting posts - about one day from now
  3. Building a social media platform without going bankrupt: Part III–Reading posts - 2 days from now
  4. Building a social media platform without going bankrupt: Part IV–Caching and distribution - 3 days from now
  5. Building a social media platform without going bankrupt: Part V–Handling the timeline - 4 days from now

And 5 more posts are pending...

There are posts all the way to Feb 05, 2021

RECENT SERIES

  1. Webinar recording (12):
    15 Jan 2021 - Filtered Replication in RavenDB
  2. Production postmortem (30):
    07 Jan 2021 - The file system limitation
  3. Open Source & Money (2):
    19 Nov 2020 - Part II
  4. re (27):
    27 Oct 2020 - Investigating query performance issue in RavenDB
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats