Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,322 | Comments: 50,640

Privacy Policy Terms
filter by tags archive
time to read 3 min | 592 words

A customer called the support hotline with a serious problem. They had a large database and wanted to add another replica node to it. This is a fairly standard thing to do, of course. The problem was that somewhere around the 70% mark, the replication process stalled. All the metrics were green, the mentor node and the new node had perfect connectivity, and there were no errors in the logs.

Typical reasons for replication to stall usually involve connectivity issues, but in this case, we could see that there was no such sign of that. In fact, the mentor node kept sending (empty) batches to the destination node. That shouldn’t be the case, however. If we have nothing to send, there shouldn’t be a batch sent over the wire. That was the only hint of something wrong.

We also looked into what information RavenDB could tell us about the system, and noticed that we have a performance hint about big documents. Some of them exceeded 32MB in size, which is… quite a lot.  That doesn’t really relate so much to replication, however. It would surely slow it down, but it should work.

Looking into the logs, we could see that the mentor node was attempting to send a batch, but it was sending zero documents. Digging deeper, we saw an entry about skipping documents, that was… strange. Cross referencing the log statement with the source code revealed that RavenDB decided that it is sending too much in the batch and aborted it. But… it isn’t sending anything in the batch.

What is actually going on is that the database in question is an encrypted one. Encrypted databases in RavenDB are encrypted in both disk and memory. The only time that we decrypt a document is when there is an active transaction reading it. During that time, we hold that in locked memory (so it wouldn’t be paged to disk). As a result of that, we try to limit the size of transactions in encrypted databases. When we replicate data between nodes, we open a read transaction on the source node, read the documents that we need to replicate and send them to the other side.

There is a small caveat here, each node in an encrypted database can use a different encryption key, so we aren’t sending the encrypted data, but the plain text. Of course, the communication itself is encrypted, so nothing can peek into the data in the middle.

By default, we’ll stop a replication batch in an encrypted database after we locked more than 64 MB of memory. A replication batch of 64 MB is plenty big enough, after all. However… we didn’t take into account a scenario where a single document may cause us to consume more than 64 MB in locked memory. And we put the check to close the replication batch very early in the process.

The sequence of operations was then:

  • Start a replication batch
  • Load the first document to send
  • Realize that we locked too much memory and close the batch
  • Send a zero length batch

Rinse and repeat, since we can’t make any forward progress.

The actual solution was to set the “Replication.MaxSizeToSendInMb” configuration option to a higher value, enough to send even the biggest documents the customer has. At that point, there was forward progress again in the system and the replication was completed successfully.

We still consider this a bug, and we’ll fix it so there won’t be a hang in the system, but I’m happy to see that we were able to do a configuration change and get everything up to speed so quickly.

time to read 4 min | 759 words

imageA typical production postmortem story is a tale of daring dives deep into the guts of your system. It is a journey into the intricacies of dependencies between multiple components, the delicate balance of distributed processes that got just the wrong level of alignment to cause some havoc. A production postmortem is a toil of mystery that can last for weeks. This isn’t one of those tales, however. In this one, the entire thing was wrapped up within fifteen minutes. So what was the issue?

The initial premise was pretty straightforward. A customer was running RavenDB in production, but due to their topology, their RavenDB instances are not exposed to the outside world directly. Instead, they route the connection to Azure Web Application Firewall and through Azure Front Door. I have no comment on the actual decision to route through those firewalls. The problem the customer had was that Azure Front Door doesn’t support web-sockets, RavenDB studio makes extensive use of them for a bunch of reasons and there are certain features that are also dependent on them (such as aggressive caching, the Changes() API, etc).

The customer wanted everything to work, and asked if RavenDB can support a long polling method, to avoid the issue entirely.

This is an XY Problem.

There was much confusion to be had, between our support team, yours truly and the technical people of the customer. Here is the issue, the problem the customer experienced is simply not possible. There is absolutely no way that they can run into this issue.

Here is the deal:

  • RavenDB is a secured-by-default database, which assumes that it is always running in a hostile environment.
  • For security, RavenDB uses TLS 1.2 or higher to safeguard the data in transit.
  • For authentication, RavenDB uses mutual authentication for both client & server using X509 certificates.

Take those three together and you’ll realize that the very design of RavenDB forces you to do SSL termination (here I’m using TLS & SSL as interchangeable terms) at the RavenDB process directly. We have to do it in this manner, since otherwise we wouldn’t be able to validate the certificate from the client.

The customer in this case was running in a secured mode, but was completely unable to use web sockets.

Again, that is not possible. Let me explain why.

If RavenDB is the entity that does SSL termination (in this case, doing the cryptographic handshake, authentication, etc) then anything in the middle between RavenDB and the client is dealing with an encrypted stream of bytes that are indistinguishable from random noise.

In other words, there shouldn’t be a way to not support web sockets, since any proxy in the middle shouldn’t be able to tell what the content of the request is.

This design by RavenDB also prevents you from forwarding requests, since the SSL stream must reach to RavenDB directly (as-is). Otherwise, RavenDB will not be able to authenticate the client certificate.

When we looked at the actual server in question, it quickly became apparent what the issue was. The customer was accessing RavenDB using HTTPS, as is proper. However, RavenDB itself was not configured to run in a secured manner. In other words, the client was accessing RavenDB using HTTPS, but the proxies in the middle will then connect to RavenDB itself using HTTP (no security). That means that RavenDB talks to the proxy with no encryption and the proxy is able to see into the requests. That leads, of course, to the situation where the supported feature set of the proxy impacts what capabilities RavenDB can utilize.

This is a broken setup, I want to point out. It is also a highly misleading setup, because RavenDB is running in unsecured mode, but you are using HTTPS to access it. We intend to make this configuration setup raise an alert and block this from deployments. RavenDB goes to great lengths to ensure that you won’t have those pitfalls to stumble into. I have to admit that we have never actually considered this sort of setup as a scenario. I am strongly reminded of this.

RavenDB is amenable to running behind a proxy, of course. The key to doing so successfully is that the proxy is responsible for TCP traffic only, never interfering with the (encrypted) content that goes over the wire. As a result of this requirement, we don’t need to worry about the capabilities of the various proxies. As long as it is able to support TCP connections, all features of RavenDB will work.

time to read 3 min | 457 words

We run into a really interesting bug. For some reason, the system was behaving in a totally unexpected manner for some parts of the data. For pretty much the same input, we would get the wrong result, and we couldn’t figure out why.

Here is our source data:

This is some metrics data about servers, and you’ll note that we report the CPU load for each core on the instance and that the results are sorted based on the actual load. Here is what this looks like, the image on the left is wrong while the image on the right is right (pun intended).

image

Why do we have this behavior?

Well, let’s look at the actual data, shall we?

image

They are… the same. Exactly the same, in fact. We can throw that into diff engine and they will tell me that they are identical (except for the document id).

What is going on here?

Well, here is the issue, what you see is not what you get. Look at the JSON text that I have above, and compare that to the documents we see in the images. RavenDB shows the documents in a nicely formatted manner, and along the way, it messed up something pretty important.

In our case, we used an object to hold the various details about the instances. And we relied that the insertion sort order for the properties would stay the same when reading the document. That is actually the case, and RavenDB goes to great lengths to ensure that this is the case. However…

In order to prettify the document, we call to JSON.parse() and JSON.stringify() (on the client side), which give us nicely formatted output. Along the way, however, we run into JavaScript and its “ideas” about how things should work. In particular, JavaScript threats properties whose key is a number in a different way than other values. All the numeric properties will be sorted according to their integer value, while non numeric values will be sorted using insertion order.

That only applies to documents that were modified in the studio, however. The RavenDB server and client API are keeping the properties in insertion order. Only if you modified the document using the Studio will you get this. But because we always show the documents in the same manner, it was invisible to us for a long while.

For that matter, it took an embarrassingly long time of debugging this problem, because (naturally) whenever we viewed the data, we did that with formatting, which meant that we never actually saw the differences between the raw versions of the documents.

time to read 2 min | 294 words

I read this post and it took me very little time to spot a pretty nasty bug. Here is the relevant section:

The bug is on line #4. The code assumes that a call to read() will return less than the requested number of bytes only at the end of the file. The problem with that approach is that this is explicitly documented to not work this way:

It is not an error if the returned value n is smaller than the buffer size, even when the reader is not at the end of the stream yet. This may happen for example because fewer bytes are actually available right now (e. g. being close to end-of-file) or because read() was interrupted by a signal.

This is a super common error in many cases. And in the vast majority of the cases, that would work. Except when it wouldn’t.

The underlying implementation of File::read() will call read() or ReadFile(). ReadFile() (Windows) is documented to read as much as you requested, unless you hit the end of file. The read() call, on Unix, is documented to allow returning less than requested:

It is not an error if this number is smaller than the number of bytes requested

Aside from signals, the file system is free to do a partial read if it has some of the data in memory and some not. I’m not sure if this is implemented in this manner, but it is allowed to do so. And the results for the code above in this case are absolutely catastrophic (decryption will fail, encryption will emit partial information with no error, etc).

I’m writing this blog post because reading the code made the error jump at me. Was bitten by this assumption too many times.

time to read 8 min | 1507 words

The topic of this post is a bug in RavenDB, a pretty serious one. The end result is that a user reported that they got an error from RavenDB that they are unable to read a stored document. In some cases, RavenDB needs to read a document on startup, which means that it wasn’t able to start up if that document had this behavior.

As you can imagine, this is one of those issues that gets our full and immediate attention. The error itself gave us a lot of information:

 Dictionary mismatch on Dic #375
   at Voron.Data.Tables.ZstdLib.AssertSuccess(UIntPtr v, CompressionDictionary dictionary)

This is related to RavenDB’s document compression behavior. In order to get a great compression ratio from our documents, we train RavenDB on the recent documents that you have and generate a compression dictionary. The problem at hand is that the compression dictionary we have and the compression dictionary that was actually used are different. As you can see from the error, we are using zstd as the compression algorithm. When zstd generates a dictionary it will (by default) generate an id from that document that is mostly based on the xxhash64 of its content, rounded to 32 bits. You can see the relevant part here. This is pretty nice, since it means that there is a good chance that we’ll detect the wrong dictionary.

So now we know what is going on, but we don’t understand why.

When we wrote this feature, we were quite aware that we’ll not be able to make any sort of sense from the documents if we don’t have the right dictionary. For that reason, we store the dictionaries three times. Once inside of RavenDB itself and twice in ancillary files, which we can use during recovery. This sort of error should be utterly impossible. And yet, we had run into that in production, so we have to dig deeper still.

The primary suspect was the dictionary training portion. One of the things that RavenDB does on a continuous basis is measure the compression ratio of the documents, if we aren’t able to hit a good compression ratio, RavenDB will try to generate a new dictionary from the most recent documents and see if that new dictionary can do better. This can be very helpful in maintaining good compression rates. As your documents change, RavenDB will detect that and realize that it can do better, retrain on the recent data and compress even further. The problem is that this code path is also quite tricky, we first compress the document using the current dictionary, then we try generating a new dictionary and see if compressing with the new dictionary is better. If that is the case, we can install the new dictionary for future operations, otherwise, we need to discard it.

I suspected that the issue was somewhere around that area, we might not be handling the rejection of the new dictionary properly. So I went into the code and started digging, but I found absolutely nothing. The entire process is covered in tests and has been in production for close to 18 months, so this isn’t something that obvious.

After spending quite a bit of time on the issue, I decided that the code is perfect, it handled everything properly and taken into account all the right behaviors.

Clearly the fault was elsewhere. Before setting out to blame the nearest cat (you can never trust those), I had an idea, what if the problem wasn’t during the training process, but afterward?

Well, that doesn’t really matter, does it? RavenDB is a transactional database, if we had a failure after the training process, we’ll have to discard some of the data, for sure, but that would be about it. Unless, what if we have some state that wasn’t transactional? As part of looking at the compression training code, I ran into just such a scenario. Running the training to generate a new compression dictionary is an expensive proposition, so we don’t want to do that often. As such, we’ll do that for only about 1K document changes where we exceed the desired compression ratio by over 10%. How do we know to act every 1K documents? Well, we have a counter that we increment on every change. That value is incremented using Interlocked.Increment() and isn’t part of the transactional state. If the transaction is aborted, the value is still incremented.  The actual value doesn’t matter, mind, only that it is moving forward, so that isn’t an issue.

I mentioned the dictionary id before, but I should clarify that this is the zstd’s dictionary id. Internally, RavenDB uses a different value. That value is simply the sequence number of the dictionary, RavenDB counts the number of generated dictionaries and gives the new dictionary the next available value. That value, by the way, is part of the transaction. If we rollback a transaction, we’ll use the same dictionary id. But that doesn’t matter, of course.

When using compression dictionaries, we need to load them from a buffer. There is quite a bit of work that is involved in that, there is memory allocation, entropy tables to load, etc. In order to save repeated work, RavenDB caches the compression dictionaries (after all, their whole point is to be used repeatedly). That cache can be used by multiple transactions at the same time (two read transactions using the same dictionary will use the same instance).

Given all of this information, here is the sequence of events that we need to get the error in question:

  1. The user enabled documents compression.
  2. The user runs a transaction with at least four commands, which needs to satisfy the following conditions.
  3. A document write as the first action.
  4. Then a write to document whose compression ratio exceeded the expected ratio by over 10%, as a result, RavenDB tried to train a new compression dictionary.
  5. That dictionary had a better compression ratio and was accepted as the new default compression dictionary.
  6. RavenDB persisted the new dictionary and used that to compress the new document.
  7. Another command (in the same transaction) had stored a document in the same collection, now RavenDB will read the new dictionary and store that in a cache.
  8. A third command runs, but this one throws an error (such as optimistic concurrency violation).

At this point, RavenDB will rollback the entire transaction and return the error to the user. Let’s say the user has chosen to submit the same two documents again, shall we?

For the first command, we’ll again discover that the compression ratio (of the old compression dictionary) is insufficient. We will not generate a new compression dictionary, why is that? Remember the counter that we increment using Interlocked? That one was not rolled back, so we’ll need to wait for another 1K documents for the stars to properly align for us. That doesn’t impact correctness in any way, shape or form, however.

At this stage, the stage is set, but everything is still okay. The problem will happen on the next time that we’ll trigger a new dictionary. At that point, we’ll again scan the most recent documents, build a dictionary, etc. However, the dictionary id that RavenDB will use will be identical to the dictionary id that we previously discarded. The data that dictionary was trained on, however, will almost certainly be different. We persist the new dictionary to disk and everyone is happy, the new document that we wrote will use the new compression dictionary and we are perfectly fine.

The next write for this collection, however, will run into a problem. It will need to use the current (the new one) dictionary when we want to make a write. In order to do that, it will load the value using the cache, but there is already a value for that dictionary in the cache, the same dictionary that was discarded. At this point, RavenDB will start compressing documents using the in memory dictionary while the on disk dictionary is different.

If you’ll try to access the document which triggered the new dictionary, you’ll get an error, but documents that were modified later will continue working with no issue. Until you restart, of course.

On restart, we’ll read the dictionary from disk, where we wrote the new dictionary, at this point, all those documents that we wrote will give us the error above. Note that the sequence of events has to be very exact, you need to have a dictionary training as part of a multi act transaction which failed after the dictionary training has been successful and wrote additional documents. In a year and a half of production usage and very heavy load, that happened only a couple of times, it seems.

The issue has been fixed, of course and we’ll be rolling it out to both users and cloud customers. We’ll now rollback such in memory state on a transaction rollback as well, avoiding this issue entirely. It is amazing to me that despite very careful planning, it wasn’t the code itself that caused a problem, but a sequence of independent operations and failure modes that we never even considered about this.

time to read 1 min | 84 words

I run into the following code during code review and had an immediate and visceral reaction.

This is a (bad) attempt to add thread safety, because you are getting a value through a read only interface, but there is still the mutable instance to work with at the source, and now you have someone that observes the instance while it is being mutated, outside the lock.

The proper way to handle this is to copy the list (under the lock) and return a distinct copy.

time to read 1 min | 144 words

I asked about a slow piece of code and why it is so slow. The answer is pretty simple, I messed up, take a look at this piece of code:

image

When the Value is an int, I’m creating a span from the address of a long variable (initialized to zero). That means that I have a lot of hash collisions, which means that adding to the dictionary turned into a sequential operation, which means that the actual cost here is O(N**2).

Interestingly enough, you can’t write this code without the use of unsafe. I actually didn’t know that the scope of the variable was even valid here to have its address taken. That was very hard to debug, because the problem was hidden away somewhere very far from where we looked at.

time to read 1 min | 81 words

The following code takes a long time to run. In fact, I’m writing this blog post while this is running, and I’m not sure how long that will take.

Update: This took over a minute to complete on my machine (which is pretty nice).

The killer is that this code is correct, it does the right thing, but it can be slow. I stripped a much larger scenario to ~50 lines of code, can you see what is going on? And why?

time to read 3 min | 422 words

Following my previous posts, about the use after free bug that was found in a pull request, I put a lot of effort into testing the validity of the fix. As it turned out, my code was right and the fix worked properly. However, I uncovered a race condition in the .NET MemoryCache implementation. Here is what I got when I put the system under load:

Unhandled exception. System.ArgumentException: Unable to sort because the IComparer.Compare() method returns inconsistent results. Either a value does not compare equal to itself, or one value repeatedly compared to another value yields different results. IComparer: 'System.Comparison`1[Microsoft.Extensions.Caching.Memory.CacheEntry]'.
   at System.Collections.Generic.ArraySortHelper`1.Sort(Span`1 keys, Comparison`1 comparer) in System.Private.CoreLib.dll:token 0x60066cd+0x1d
   at System.Collections.Generic.List`1.Sort(Comparison`1 comparison) in System.Private.CoreLib.dll:token 0x600688b+0x3
   at Microsoft.Extensions.Caching.Memory.MemoryCache.g__ExpirePriorityBucket|27_0(Int64& removedSize, Int64 removalSizeTarget, Func`2 computeEntrySize, List`1 entriesToRemove, List`1 priorityEntries) in Microsoft.Extensions.Caching.Memory.dll:token 0x6000061+0x21
   at Microsoft.Extensions.Caching.Memory.MemoryCache.Compact(Int64 removalSizeTarget, Func`2 computeEntrySize) in Microsoft.Extensions.Caching.Memory.dll:token 0x600005b+0xff
   at Microsoft.Extensions.Caching.Memory.MemoryCache.OvercapacityCompaction(MemoryCache cache) in Microsoft.Extensions.Caching.Memory.dll:token 0x6000059+0xad
   at System.Threading.ThreadPoolWorkQueue.Dispatch() in System.Private.CoreLib.dll:token 0x6002b7c+0x110
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() in System.Private.CoreLib.dll:token 0x6002c66+0x190

There are a few interesting things here. First, we can see that this killed the process, this isn’t just an error, this is an error from a background thread that ended up unhandled and killed everything. That is a nope, for server applications. The second issue is that the error is strange. What exactly is going on here?

Here is the relevant piece of code that throw the error, inside the MemoryCache:

priorityEntries.Sort((e1, e2) => e1.LastAccessed.CompareTo(e2.LastAccessed));

This is a really interesting line, because of what it does. The priorityEntries is a local list of cache entries, which we need to sort by the last access, to figure out what we can evict. What can go wrong here?

Well, the MemoryCache is a naturally concurrent instance, of course. What happens when we have an access to the entry in question? We’ll update the LastAccessed value. And if we do this just right, we may give the sort function different values for the same cache entry, leading to this problem.

The bug in question was in place for as long as I can track the MemoryCache codebase. In the best case scenario, it will cause evictions of data that shouldn’t be evicted. Not a major issue, and unlikely to be noticed. But if it fail in this manner, it will kill the process, so very likely to be noticed.

My test scenario had a lot of concurrency and a very small cache (leading to a lot of evictions / compactions), I’m guessing that is why I’m able to trigger this so easily.

time to read 2 min | 239 words

After presenting the issue of how to return items to the array pool without creating a use after free bug, I asked you how you would fix that. There are several ways to try to do that, you can use reference counting scheme, or try to use locking, etc. All of those options are expensive, what is worse, they are expensive on a routine basis, not just for the free the buffer code path.

Instead, I changed the way we are handling returning the buffer. Take a look at the following code:

This may require some explanation. I’m using a ConditionaWeakTable here, that was added to the runtime to enable dynamic properties on objects. Basically, it creates a table that you can lookup by an object to get a key. The most important feature is that the runtime ensures that the associated reference lifetime match the key object lifetime. In other words, when we add the buffer in the eviction callback, we ensure that the ReturnBuffer we register will live at least as long as the buffer.

That means that we can let the GC do the verification job. We’ll now return the buffer back to the pool only after the GC has ensured that there are no outstanding references to it. Not a lot of code, and an elegant solution. This also ensures that we are only paying the code on eviction (likely rare), and not all the time.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Challenge (66):
    06 May 2022 - Spot the optimization–solution
  2. Production postmortem (37):
    29 Apr 2022 - Deduplicating replication speed
  3. Recording (3):
    11 Apr 2022 - Clean Architecture with RavenDB
  4. Answer (10):
    07 Apr 2022 - Why is this code broken?
  5. Request for comments (2):
    10 Mar 2022 - Removing graph queries from RavenDB
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats