Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,524
|
Comments: 51,150
Privacy Policy · Terms
filter by tags archive
time to read 4 min | 764 words

I wanted to test low-level file-system behavior in preparation for a new feature for RavenDB. Specifically, I wanted to look into hole punching - where you can give low-level instructions to the file system to indicate that you’re giving up disk space, but without actually reducing the size of the file.

This can be very helpful in space management. If I have a section in the file that is full of zeroes, I can just tell the file system that, and it can skip storing that range of zeros on the disk entirely. This is an advanced feature for file systems. I haven't actually used that in the past, so I needed to gain some expertise with it.

I wrote the following code for Linux:


int fd = open("test.file", O_CREAT | O_WRONLY, 0644);
lseek(fd, 128 * 1024 * 1024 - 1, SEEK_SET); // 128MB file
write(fd, "", 1);
fallocate(fd, // 32 MB hole from the 16MB..48MB range
    FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 
    16 * 1024 * 1024, 32 * 1024 * 1024); 
close(fd);

The code for Windows is here if you want to see it. I tested the feature on both Windows & Linux, and it worked. I could see that while the file size was 128MB, I was able to give back 16MB to the operating system without any issues. I turned the code above into a test and called it a day.

And then the CI build broke. But that wasn’t possible since I tested that. And there had been CI runs that did work on Linux. So I did the obvious thing and started running the code above in a loop.

I found something really annoying. This code worked, sometimes. And sometimes it just didn’t.

In order to get the size, I need to run this code:


struct stat st;
fstat(fd, &st);
printf("Total size: %lld bytes\n",
    (long long)st.st_size);
printf("Actual size on disk: %lld bytes\n", 
    (long long)st.st_blocks * 512);

I’m used to weirdness from file systems at this point, but this is really simple. All the data is 4KB aligned (in fact, all the data is 16MB aligned). There shouldn’t be any weirdness here.

As you can see, I’m already working at the level of Linux syscalls, but I used strace to check if there is something funky going on. Nope, there was a 1:1 mapping between the code and the actual system calls issued.

That means that I have to debug deeper if I want to understand what is going on. This involves debugging the Linux Kernel, which is a Big Task. Take a look at the code in the relevant link. I’m fairly certain that the issue is in those lines. The problem is that this cannot be, since both offset & length are aligned to 4KB.

I got out my crystal ball and thinking hat and meditated on this. If you’ll note, the difference between the expected and actual values is exactly 4KB. It almost looks like the file itself is not aligned on a 4KB boundary, but the holes must be.

Given that I just want to release this space to the operating system and 4KB is really small, I can adjust that as a fudge factor for the test. I would love to understand exactly what is going on, but so far the “file itself is not 4KB aligned, but holes are” is a good working hypothesis (even though my gut tells me it might be wrong).

If you know the actual reason for this, I would love to hear it.

And don't get me started on what happened with sparse files in macOS. There, the OS will randomly decide to mark some parts of your file as holes, making any deterministic testing really hard.

time to read 4 min | 765 words

I’m currently deep in the process of modifying the internals of Voron, trying to eke out more performance out of the system. I’m making great progress, but I’m also touching parts of the code that haven’t even been looked at for a long time.

In other words, I’m mucking about with the most stable and most critical portions of the storage engine. It’s a lot of fun, and I’m actually seeing some great results, but it is also nerve-wracking.

We have enough tests that I’ve great confidence I would catch any actual stability issues, but the drive back toward a fully green build has been a slog.

The process is straightforward:

  • Change something.
  • Verify that it works better than before.
  • Run the entire test suite (upward of 30K tests) to see if there are any breaks.

The last part can be frustrating because it takes a while to run this sort of test suite. That would be bad enough, but some of the changes I made were things like marking a piece of memory that used to be read/write as read-only. Now any access to that memory would result in an access violation.

I fixed those in the code, of course, but we have a lot of tests, including some tests that intentionally corrupt data to verify that RavenDB behaves properly under those conditions.

One such test writes garbage to the RavenDB file, using read-write memory. The idea is to verify that the checksum matches on read and abort early. Because that test directly modifies what is now read-only memory, it generates a crash due to a memory access violation. That doesn’t just result in a test failure, it takes the whole process down.

I’ve gotten pretty good at debugging those sorts of issues (--blame-crash is fantastic) and was able to knock quite a few of them down and get them fixed.

And then there was this test, which uses encryption-at-rest. That test started to fail after my changes, and I was pretty confused about exactly what was going on. When trying to read data from disk, it would follow up a pointer to an invalid location. That is not supposed to happen, obviously.

Looks like I have a little data corruption issue on my hands. The problem is that this shouldn’t be possible. Remember how we validate the checksum on read? When using encryption-at-rest, we are using a mechanism called AEAD (Authenticated Encryption with Associated Data). That means that in order to successfully decrypt a page of data from disk, it must have been cryptographically verified to be valid.

My test results showed, pretty conclusively, that I was generating valid data and then encrypting it. The next stage was to decrypt the data (verifying that it was valid), at which point I ended up with complete garbage.

RavenDB trusts that since the data was properly decrypted, it is valid and tries to use it. Because the data is garbage, that leads to… excitement. Once I realized what was going on, I was really confused. I’m pretty sure that I didn’t break 256-bit encryption, but I had a very clear chain of steps that led to valid data being decrypted (successfully!) to garbage.

It was also quite frustrating to track because any small-stage test that I wrote would return the expected results. It was only when I ran the entire system and stressed it that I got this weird scenario.

I started practicing for my Fields medal acceptance speech while digging deeper. Something here had to be wrong. It took me a while to figure out what was going on, but eventually, I tracked it down to registering to the TransactionCommit event when we open a new file.

The idea is that when we commit the transaction, we’ll encrypt all the data buffers and then write them to the file. We register for an event to handle that, and we used to do that on a per-file basis. My changes, among other things, moved that logic to apply globally.

As long as we were writing to a single file, everything just worked. When we had enough workload to need a second file, we would encrypt the data twice and then write it to the file. Upon decryption, we would successfully decrypt the data but would end up with still encrypted data (looking like random fluff).

The fix was simply moving the event registration to the transaction level, not the file level. I committed my changes and went back to the unexciting life of bug-fixing, rather than encryption-breaking and math-defying hacks.

time to read 1 min | 122 words

I’m trying to pay a SaaS bill online, and I run into the following issue. I have insufficient permissions to pay the invoice on the account. No insufficient funds, which is something that you’ll routinely run into when dealing with payment processing. But insufficient permissions!

 

Is… paying something an act that requires permissions? That something that happens? Can I get more vulnerabilities like that? When I get people to drive-by pay for my bills?

I can’t think of a scenario where you are prevented from paying to the provider. That is… weird.

And now I’m in this “nice” position where I have to chase after the provider to give them money, because otherwise they’ll close the account.

time to read 2 min | 369 words

RavenDB is a .NET application, written in C#. It also has a non trivial amount of unmanaged memory usage. We absolutely need that to get the proper level of performance that we require.

With managing memory manually, there is also the possibility that we’ll mess it up. We run into one such case, when running our full test suite (over 10,000 tests) we would get random crashes due to heap corruption. Those issues are nasty, because there is a big separation between the root cause and the actual problem manifesting.

I recently learned that you can use the gflags tool on .NET executables. We were able to narrow the problem to a single scenario, but we still had no idea where the problem really occurred. So I installed the Debugging Tools for Windows and then executed:

 &"C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\gflags.exe" /p /enable C:\Work\ravendb-6.0\test\Tryouts\bin\release\net7.0\Tryouts.exe

What this does is enable a special debug heap at the executable level, which applies to all operations (managed and native memory alike).

With that enabled, I ran the scenario in question:

PS C:\Work\ravendb-6.0\test\Tryouts>  C:\Work\ravendb-6.0\test\Tryouts\bin\release\net7.0\Tryouts.exe
42896
Starting to run 0
Max number of concurrent tests is: 16
Ignore request for setting processor affinity. Requested cores: 3. Number of cores on the machine: 32.
         To attach debugger to test process (x64), use proc-id: 42896. Url http://127.0.0.1:51595
Ignore request for setting processor affinity. Requested cores: 3. Number of cores on the machine: 32.  License limits: A: 3/32. Total utilized cores: 3. Max licensed cores: 1024
http://127.0.0.1:51595/studio/index.html#databases/documents?&database=Should_correctly_reduce_after_updating_all_documents_1&withStop=true&disableAnalytics=true
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
    at Sparrow.Server.Compression.Encoder3Gram`1[[System.__Canon, System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].Encode(System.ReadOnlySpan`1<Byte>, System.Span`1<Byte>)
    at Sparrow.Server.Compression.HopeEncoder`1[[Sparrow.Server.Compression.Encoder3Gram`1[[System.__Canon, System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], Sparrow.Server, Version=6.0.0.0, Culture=neutral, PublicKeyToken=37f41c7f99471593]].Encode(System.ReadOnlySpan`1<Byte> ByRef, System.Span`1<Byte> ByRef)
    at Voron.Data.CompactTrees.PersistentDictionary.ReplaceIfBetter[[Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Raven.Server, Version=6.0.0.0, Culture=neutral, PublicKeyToken=37f41c7f99471593],[Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Raven.Server, Version=6.0.0.0, Culture=neutral, PublicKeyToken=37f41c7f99471593]](Voron.Impl.LowLevelTransaction, Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Voron.Data.CompactTrees.PersistentDictionary)
    at Raven.Server.Documents.Indexes.Persistence.Corax.CoraxIndexPersistence.Initialize(Voron.StorageEnvironment)

That pinpointed things so I was able to know exactly where we are messing up.

I was also able to reproduce the behavior on the debugger:

image (3)

This saved me hours or days of trying to figure out where the problem actually is.

time to read 9 min | 1666 words

I just completed a major refactoring of a piece of code inside RavenDB that is responsible for how we manage sorted queries. The first two tiers of tests all passed, which is great. Now was the time to test how this change performed. I threw 50M records into RavenDB and indexed them. I did… not like the numbers I got back. It makes sense, since I was heavily refactoring to get a particular structure, I could think of a few ways to improve performance, but I like doing this based on profiler output.

When running the same scenario under the profiler, the process crashed. That is… quite annoying, as you can imagine. In fact, I discovered a really startling issue.

If I index the data and query on it, I get the results I expect. If I restart the process and run the same query, I get an ExecutionEngineException. Trying to debug those is a PITA. In this case, I’m 100% at fault, we are doing a lot of unsafe things to get better performance, and it appears that I messed up something along the way. But my only reproduction is a 50M records dataset. To give some context, this means 51GB of documents to be indexed and 18 GB of indexing. Indexing this in release mode takes about 20 minutes. In debug mode, it takes a lot longer.

Trying to find an error there, especially one that can only happen after you restart the process is going to be a challenging task. But this isn’t my first rodeo. Part of good system design is knowing how to address just these sorts of issues.

The indexing process inside RavenDB is single-threaded per index. That means that we can rule out a huge chunk of issues around race conditions. It also means that we can play certain tricks. Allow me to present you with the nicest tool for debugging that you can imagine: repeatable traces.

Here is what this looks like in terms of code:

In this case, you can see that this is a development only feature, so it is really bare-bones one. What it does is capture the indexing and commit operations on the system and write them to a file. I have another piece of similarly trivial code that reads and applies them, as shown below. Don’t bother to dig into that, the code itself isn’t really that interesting. What is important is that I have captured the behavior of the system and can now replay it at will.

The code itself isn’t much, but it does the job. What is more important, note that we have calls to StopDatabase() and StartDatabase(), I was able to reproduce the crash using this code.

That was a massive win, since it dropped my search area from 50M documents to merely 1.2 million.

The key aspect of this is that I now have a way to play around with things. In particular, instead of using the commit points in the trace, I can force a commit (and start / stop the database) every 10,000 items (by calling FlushIndexAndRenewWriteTransaction). When using that, I can reproduce this far faster. Here is the output when I run this in release mode:

1 With 0
2 With 10000
3 With 10000
4 With 10000
5 With 10000
6 With 10000
7 With 10000
8 With 10000
9 With 10000
10 With 10000
11 With 10000
Fatal error. Internal CLR error. (0x80131506)

So now I dropped the search area to 120,000 items, which is pretty awesome. Even more important, when I run this in debug mode, I get this:

1 With 0
2 With 10000
Process terminated. Assertion failed.
   at Voron.Data.Containers.Container.Get(Low...

So now I have a repro in 30,000 items, what is even better, a debug assertion was fired, so I have a really good lead into what is going on.

The key challenge in this bug is that it is probably triggered as a result of a commit and an index of the next batch. There is a bunch of work that we do around batch optimizations that likely cause this sort of behavior. By being able to capture the input to the process and play with the batch size, we were able to reduce the amount of work required to generate a reproduction from 50M records to 30,000 and have a lead into what is going on.

With that, I can now start applying more techniques to narrow down what is going on. But by far the most important aspect as far as I’m concerned is the feedback cycle. I can now hit F5 to run the code and encounter the problem in a few seconds.

image

It looks like we hit a debug assertion because we keep a reference to an item that was already freed. That is really interesting, and now I can find out which item and then figure out why this is the case. And at each point, I can simply go one step back in the investigation, and reproduce the state, I have to hit F5 and wait a bit. This means that I can be far more liberal in how I figure out this bug.

This is triggered by a query on the indexed data, and if I follow up the stack, I have:

image

This is really interesting, I wonder… what happens if I query before I restart the database? With this structure, this is easy to do.

image

This is actually a big relief. I had no idea why restarting the database would cause us to expose this bug.

Another thing to note is that when I ran into the problem, I reproduced this on a query that sorted on a single field. In the test code, I’m testing on all fields, so that may be an asset in exposing this faster.

Right now the problem reproduces on the id field, which is unique. That helps, because it removes a large swath of code that deals with multiple terms for an entry. The current stage is that I can now reproduce this issue without running the queries, and I know exactly where it goes wrong.

image

And I can put a breakpoint on the exact location where this entry is created:

image

By the way, note that I’m modifying the code instead of using a conditional breakpoint. This is because of the performance difference. For a conditional breakpoint, the debugger has to stop execution, evaluate the condition and decide what to do. If this is run a lot, it can have a huge impact on performance. Easier to modify the code. The fact that I can do that and hit F5 and get to the same state allows me to have a lot more freedom in the ergonomics of how I work.

This allows me to discover that the entry in question was created during the second transaction. But the failure happens during the third, which is really interesting. More to the point, it means that I can now do this:

image

With the idea that this will trigger the assert on the exact entry that cause the problem. This is a good idea, and I wish that it worked, but we are actually doing a non-trivial amount of work during the commit process, so now we have a negative feedback and another clue. This is happening in the commit phase of the indexing process. It’s not a big loss, I can do the same in the commit process as well. I have done just that and now I know that I have a problem when indexing the term: “tweets/1212163952102137856”. Which leads to this code:

image

And at this point, I can now single step through this and figure out what is going on, I hope.

When working on complex data structures, one of the things that you need to do is to allow to visualize them. Being able to manually inspect the internal structure of your data structures can save you a lot of debugging. As I mentioned, this isn’t my first rodeo. So when I narrowed it down to a specific location, I started looking into exactly what is going on.

Beforehand, I need to explain a couple of terms (pun intended):

  • tweets/1212163952102137856 – this is the entry that triggers the error.
  • tweets/1212163846623727616 – this is the term that should be returned for 1679560

Here is what the structure looks like at the time of the insert:

image

You can notice that the value here for the last page is the same as the one that we are checking for 1679560.

To explain what is going on will take us down a pretty complex path that you probably don’t care about, but the situation is that we are keeping track of the id in two locations. Making sure to add and remove it in both locations as appropriate. However, at certain points, we may decide to shuffle things around inside the tree, and we didn’t sync that up properly with the rest of the system, leading to a dangling reference.

Now that I know what is going on, I can figure out how to fix it. But the story of this post was mostly about how I figured it out, not the bug itself.

The key aspect was to get to the point where I can reproduce this easily, so I can repeat it as many times that is needed to slowly inch closer to the solution.

time to read 4 min | 741 words

I’m doing a pretty major refactoring inside of RavenDB right now. I was able to finish a bunch of work and submitted things to the CI server for testing. RavenDB has several layers of tests, which we run depending on context.

During development, we’ll usually run the FastTests. About 2,300 tests are being run to validate various behaviors for RavenDB, and on my machine, they take just over 3 minutes to complete. The next tier is the SlowTests, which run for about 3 hours on the CI server and run about 26,000 tests. Beyond that we actually have a few more layers, like tests that are being run only on the nightly builds and stress tests, which can take several minutes each to complete.

In short, the usual process is that you write the code and write the relevant tests. You also validate that you didn’t break anything by running the FastTests locally. Then we let CI pick up the rest of the work. At the last count, we had about 9 dedicated machines as CI agents and given our workload, an actual full test run of a PR may complete the next day.

I’m mentioning all of that to explain that when I reviewed the build log for my PR, I found that there were a bunch of tests that failed. That was reasonable, given the scope of my changes. I sat down to grind through them, fixing them one at a time. Some of them were quite important things that I didn’t take into account, after all. For example, one of the tests failed because I didn’t account for sorting on a dynamic numeric field. Sorting on a numeric field worked, and a dynamic text field also worked. But dynamic numeric field didn’t. It’s the sort of thing that I would never think of, but we got the tests to cover us.

But when I moved to the next test, it didn’t fail. I ran it again, and it still didn’t fail. I ran it in a loop, and it failed on the 5th iteration. That… sucked. Because it meant that I had a race condition in there somewhere. I ran the loop again, and it failed again on the 5th. In fact, in every iteration I tried, it would only fail on the 5th iteration.

When trying to isolate a test failure like that, I usually run that in a loop, and hope that with enough iterations, I’ll get it to reproduce. Having it happen constantly on the 5th iteration was… really strange. I tried figuring out what was going on, and I realized that the test was generating 1000 documents using a random. The fact that I’m using Random is the reason it is non-deterministic, of course, except that this is the code inside my test base class:

image

So this is properly initialized with a seed, so it will be consistent.

Read the code again, do you see the problem?

image

That is a static value. So there are two problems here:

  • I’m getting the bad values on the fifth run in a consistent manner because that is the set of results that reproduce the error.
  • This is a shared instance that may be called from multiple tests at once, leading to the wrong result because Random is not thread safe.

Before fixing this issue so it would run properly, I decided to use an ancient debugging technique. It’s called printf().

In this case, I wrote out all the values that were generated by the test and wrote a new test to replay them. That one failed consistently.

The problem was that it was still too big in scope. I iterated over this approach, trying to end up with a smaller section of the codebase that I could invoke to repeat this issue. That took most of the day. But the end result is a test like this:

As you can see, in terms of the amount of code that it invokes, it is pretty minimal. Which is pretty awesome, since that allowed me to figure out what the problem was:

image

I’ve been developing software professionally for over two decades at this point. I still get caught up with things like that, sigh.

time to read 5 min | 916 words

I’m working on improving the performance of Corax, RavenDB’s new search engine. Along the way, I introduced a bug, a fairly nasty one. At a random location, while indexing a ~50 million documents corpus, we are getting an access violation exception. That means that I messed something up.

That makes sense, given that my changes were mostly about making things lower-level. Working directly with pointers and avoiding length checks. At our speed, even the use of Span can be a killer for performance, and we want to be as close to the raw metal as possible. The particular changeset that I was working on was able to improve the indexing speed from 90,000 per second to 120,000 per second. That is a change that I absolutely want to keep, so I started investigating it.

I mentioned that it is a fairly nasty problem. A truly nasty problem would be heap corruption that is discovered after the fact and is very hard to trace. In this case, it was not consistent, which is really strange. One of the important aspects of Corax is that it is single-threaded, which means that a lot of complexity is out the window. It means that for the same input, we always have the same behavior. If there is any variance, such as not crashing all the time, it means that there are external factors involved.

At any rate, given that it happened at least half the time, I was able to attach WinDBG to the process and wait for the exception to happen, this is what I got:

(5e20.1468): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
Corax!Corax.IndexWriter.AddEntriesToTermResultViaSmallPostingList+0x953:
00007ffa`24dcea53 c4e261902411    vpgatherdd xmm4,dword ptr [rcx+xmm2],xmm3 ds:0000026d`516514e7=????????

Now, look at the last line, that is an interesting one, we use the VPGATHERDD assembly instruction. It is gathering packed DWORD values, in C#, this is generated using the Avx2.GatherVector128() method. We are using that to do some bit packing in this case, so this makes a lot of sense.

Next, let’s see what we get from the exception:

0:074> .exr -1
ExceptionAddress: 00007ffafc2bfe7c (KERNELBASE!RaiseException+0x000000000000006c)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000080
NumberParameters: 2
   Parameter[0]: 0000000000000000
   Parameter[1]: 0000026d51650000
Attempt to read from address 0000026d51650000

All of this points to an out-of-bounds read, but why is that? The call we have for GatherVector128() is used inside a method named: ReadAvx2(). And this method is called like this:

private unsafe static ulong Read(int stateBitPos, byte* inputBufferPtr, int bitsToRead, int inputBufferSize, out int outputStateBit)
{
    if ((stateBitPos + bitsToRead) / 8 >= inputBufferSize)
        throw new ArgumentOutOfRangeException();
    if ( Avx2.IsSupported)
    {
        return ReadAvx2(stateBitPos, inputBufferPtr, bitsToRead, out outputStateBit);
    }
    return ReadScalar(stateBitPos, inputBufferPtr, bitsToRead, out outputStateBit);
}

It is an optimized approach to read some bits from a buffer, I’ll skip the details on exactly how this works. As you can see, we have a proper bounds check here, ensuring that we aren’t reading past the end of the buffer.

Except…

That we aren’t actually checking this. What we are doing is checking that we can access the bytes range, but consider the following scenario:

image

We have a memory page and a buffer that is located toward the end of it.  We are now trying to access the last bit in the buffer, using ReadAvx2(). If we’ll check the actual bytes range, it will pass, we are trying to access the last byte.

However, we are going to call GatherVector128(), which means that we’ll actually access 16 bytes(!), and only the first byte is in the valid memory range, the rest is going to be read from the next page, which isn’t mapped.

This also explains why we are not always failing. If the next page is valid (which is subject to the decisions of the operating system allocator), it will pass. So that is why we didn’t have 100% reproduction. In fact, this is the sort of bug that is very easy to hide for a very long time in the system, given that it is dependent on the actual memory structure of the application.

Once we figured out what was going on, it was pretty easy to understand, but the fact that the AVX instructions will read after the end of the buffer was really confusing. Because even when we used Span, and its range checks, it would be completely ignored. Makes total sense, given that those aren’t really methods, but compiler intrinsics that are translated to direct machine instructions.

Amusingly enough, now that we found the problem, we ran into something very similar a long while ago. Then it was the wrong instruction being used (loading a word, instead of a byte), that would fail, but the same overal setup. It will sometimes fail, depending on the state of the next page in the memory.

We actually built some tooling around managing that, we call that electric fence memory. We allocate memory so any out-of-band access would always hit invalid memory, stopping us in our tracks. That means that I can get easy reproduction of those sorts of issues, and once we have that, the rest isn’t really that interesting, to be honest. It’s just a normal bug fix. It’s the hunt for the root cause that is both incredibly frustrating and quite rewarding.

time to read 2 min | 393 words

A customer reported a scenario where RavenDB was using stupendous amounts of memory. In the orders of tens of GB on a system that didn’t have that much load.

Our first suspicion was that this is an issue with reading the metrics, since RavenDB will try to keep as much of the data in memory, which sometimes leads users to worry. I spoke about this at length in the past.

In this case, that wasn’t the case. We were able to drill down into the exact cause of the memory usage and we found out that RavenDB was using an abnormally high amount of memory. The question was why that was, exactly.

We looked into the common operations on the server, and we found a suspicious query, it looked something like this:

from index 'Sales/Actions'
where endsWith(WorkflowStage, '/Final')

The endsWith query was suspicious, so we looked into that further. In general, endsWith requires us to scan all the unique terms for a particular field, but in most cases, there aren’t that many unique values for a field. In this case, however, that wasn’t the case, here are some of the values for WorkflowStage:

  • Workflows/3a1af12a-b5d2-4c96-9348-177ebaacab6c/Step-2
  • Workflows/6aacc86c-2f28-4b8b-8dee-1024314d5add/Final

In total, there were about 250 million sales in the database, each one of them with a unique WorflowStage value.

What does this mean, in terms of RavenDB query execution? Well, the fields are indexed, but we need to effectively do:

This isn’t the actual code, but it will show you what is going on.

In other words, in order to process this query, we have to scan (and materialize) all 250 million unique terms for this field. Obviously that is going to consume a lot of memory.

But what is the solution to that? Instead of doing an expensive endsWith query, we can move the computation from the query time to the index time.

In other words, instead of indexing the WorkflowStage field  as is, we’ll extract the information we want from it. The index would have one of those:

IsFinalWorkFlowStage = doc.WorkflowStage.EndsWith(“/Final”),

WorkflowStagePostfix = doc.WorkflowStage.Split(‘/’).Last()

The first one will check whether the value is final or not, while the second just gets the (one of hopefully a few) postfixes for the field. We can then query using equality instead of endsWith, leading to far better performance and greatly reduced memory usage, since we don’t need to materialize any values during the query.

time to read 2 min | 277 words

image A user of ours called us, quite frantic. They are running a lot of systems on RavenDB, and have been for quite some time.

However, very recently they started to run into severe issues. RavenDB would complain that there isn’t sufficient memory to run.

The system metrics, however, said that there are still gobs of GBs available (I believe that this is the appropriate technical term).

After verifying the situation, the on-call engineer escalated the issue. The problem was weird. There was enough memory, for sure, but for some reason RavenDB would be unable to run properly.

An important aspect is that this user is running a multi-tenant system, with each tenant being served by its own database. Each database has a few indexes as well.

Once we figured that out, it was actually easy to understand what is going on.

There are actually quite a few limits that you have to take into account. I talked about them here. In that post, the issue was the maximum number of tasks defined by the system. After which, you can no longer create new threads.

In this case, the suspect was: vm.max_map_count.

Beyond just total memory, Linux has a limit on the number of memory mappings that a process may have. And RavenDB uses Voron, which is based on mmap(), and each database and each index typically have multiple maps going on.

Given the number of databases involved…

The solution was to increase the max_map_count and add a task for us, to give a warning to the user ahead of time when they are approaching the system's limits.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Challenge (75):
    01 Jul 2024 - Efficient snapshotable state
  2. Recording (14):
    19 Jun 2024 - Building a Database Engine in C# & .NET
  3. re (33):
    28 May 2024 - Secure Drop protocol
  4. Meta Blog (2):
    23 Jan 2024 - I'm a JS Developer now
  5. Production Postmortem (51):
    12 Dec 2023 - The Spawn of Denial of Service
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}