Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,386 | Comments: 50,790

Privacy Policy Terms
filter by tags archive
time to read 5 min | 969 words

One of the big changes in RavenDB is the new search engine, Corax. We want to replace Lucene with a purpose built search engine, capable of doing everything that we can do with Lucene, but far faster.

In this post, I want to discuss a particular detail of the implementation. Managing the posting list. In information retrieval, the idea of a posting list is that we have a list of document ids that match a particular term. I’m ignoring the details of how we create or manage the list. The basic idea is that we have a need to store a potentially large number of document ids, update them on the fly as well query them. Lucene manages that by creating immutable segments, but that design decision doesn’t match the way we want to do things in Corax. The problem with immutable segments is that you’ll eventually need to run compaction, which can be really expensive.

As a database, we already have a really good data structure that matches most of those requirements, it turns out. A B+Tree can do a pretty good approximation of a posting list, but it does have a problem. It’s not efficient in terms of space usage. A document id is a 64 bits integer, and we can make a number of assumptions about it. Therefore, I create a dedicated B+Tree like structure to hold the posting list. This B+Tree is called a Set inside of Voron, and it can hold any number of uint64 values. Internally, this is managed as two separate types of pages. The Branch pages (which are fairly standard B+Tree branches) and the Leaf pages.

Here is the first version of the Set Leaf Page implementation:

image

Let’s figure out what is going on here. The page has a notion of a base. That means all the documents IDs that have the same upper 33 bits. Basically, inside a single page, all the IDs are in the same 2 billion range. That means that even though the document ids are 64 bits, in the context of a single page, we can treat them as 32 bits integers. That turns out to be important, since most integer compression routines work on 32 bits integers.

How does that work? We have a section in the page that is dedicated to raw values, and we insert values into that section until it is full. Whenever that happens, we compress the raw values section using PForDelta compression. The page will then contain multiple compressed segments and a single raw values segment. Each compressed segment is non-overlapping with the other compressed segments (but may overlap with the raw values). PForDelta is really good in compressing integers, especially if it is able to figure out patterns in your data. And documents IDs in Corax are explicitly designed to have common patterns so it will be able to take advantage of this behavior. When we read the entries from the page, we merge the compressed segments with the raw values and return the full details.

The code isn’t particularly simple, but has a fairly straightforward approach to the problem once you understand the design.

One thing that I haven’t touched is the notion of removals. That is an important concept, and we handle that in an interesting way. Remember that I said that the baseline for the page is the upper 33 bits? That is because the numbers inside the page are 31 bits in length, we reserve the top bit to mark a value as a tombstone marker.  In other words, when we write to the raw values, we can have either a new value or a removed value, distinguished using the uppermost bit.

When we fill the raw values segment, we compress it alongside the relevant compressed segments. At that point, we’ll filter out the removed values. This is very similar to the segments that Lucene is using, but we do that on a page boundary (8KB), not across all values.

We are able to push a lot of values into a single page. We see typically thousands to tens of thousands of documents IDs fitting into a single 8KB page. That is pretty amazing, since even if you have a posting list that has millions of entries, the actual disk reads are minimal.

The design was with us throughout most of the development of Corax, but we ran into a serious issue with the way it works when we started to build the query optimizer for Corax.

That is an interesting statement, I think you’ll agree. What is the relation between a (very) low-level design of the on-disk data format and the result of the query optimizer?

Well, it turns out that one of the things that we need to know for the query optimizer is: “How big is this posting list?”

This question is actually really important to be able to generate optimal queries. And given the structure we have, we can provide a good answer to that, most of the time, but not always.

Why is that?

The problem is what happens when we want to remove a value from the set, or add an existing value. If the value already exists inside a compressed segment, we don’t open the compressed segement (which will require re-writing it from scratch), so we record an addition that is actually spurious. Conversely, if we try to remove a value that isn’t in the set, we’ll wrongly decrement the number of entries in the posting list, leading to issues with a mismatch between the record number of entries and the actual number we have in the posting list.

That was very confusing to figure out, I’ll admit. It was also really hard to fix, but I’ll address that in the next post in the series.

time to read 2 min | 337 words

A database indexing strategy is a core part of achieving good performance. About 99.9% of all developers have a story where adding an index to a particular query cut the runtime from seconds or minutes to milliseconds. That percentage is 100% for DBAs, but the query was cut from hours or days to milliseconds.

The appropriate indexing strategy is often a fairly complex balancing act between multiple competing needs. More indexing means more I/O and cost on writes, but faster reads. RavenDB has a query optimizer engine that will analyze your queries and generate the appropriate set of indexes on the fly, without you needing to think much about it. That means that RavenDB will continuously respond to your operational environment and changes in it. The end result is an optimal indexing strategy at all times.

This automatic behavior applies only to automatic indexes, however. RavenDB also allows you to define your own indexes and many customers run critical business logic in those indexes. RavenDB now has a feature that aims to help you manage/organize your indexes by detecting redundant definitions & unqueried indexes which can be removed or merged.

The index cleanup feature is now exposed in the Studio (since build 5.4.5):

image

When you select it, the Studio will show you the following options:

image

You can see that RavenDB detected that two indexes can be merged into a single one, and additionally there are some indexes that haven’t been used in a while or have been completely superseded by other indexes.

RavenDB will even go ahead and suggest the merged index for you:

image

The idea is to leverage RavenDB’s smarts so you won’t have to spend too much time thinking about index optimization and can focus on the real value-added portions of your system.

time to read 3 min | 457 words

A customer called us with a problem. They set up a production cluster successfully, they could manually verify that everything is working, except that it would fail when they try to connect to it via the client API.

The error in question looked something like this:

CertificateNameMismatchException: You are trying to contact host rvn-db-72 but the hostname must match one of the CN or SAN properties of the server certificate: CN=rvn-db-72, OU=UAT, OU=Computers, OU=Operations, OU=Jam, DC=example, DC=com, DNS Name=rvn-db-72.jam.example.com

That is… a really strange error. Because they were accessing the server using: rvn-db-72.jam.example.com, and that was the configured certificate for it. But for some reason the RavenDB client was trying to connect directly to rvn-db-72. It was able to connect to it, but failed on the hostname validation because the certificates didn’t match.

Initially, we suspected that there is some sort of a MITM or some network appliance that got in the way, but we finally figured out that we had the following sequence of events, shown in the image below. The RavenDB client was properly configured, but when it asked the server where the database is, the server would give the wrong URL, leading to this error.

image

This deserves some explanation. When we initialize the RavenDB client, one of the first things that the client does is query the cluster for the URLs where it can find the database it needs to work with. This is because the distribution of databases in a cluster doesn’t have to match the nodes in the cluster.

Consider this setup:

image

In this case, we have three nodes in the cluster, but the “Orders DB” is located only on two of them. If we query the rvn-db-72 database for the topology of “Orders DB”, we’ll get nodes rvn-db-73 and rvn-db-74. Here is what this will look like:

image

Now that we understand what is going on, what is the root cause of the problem?

A misconfigured server, basically. The PublicServerUrl for the server in question was left as the hostname, instead of the full domain name.

This configuration meant that the server would give the wrong URL to the client, which would then fail.

This is something that only the client API is doing, so the Studio behaved just fine, which made it harder to figure out what exactly is going on there. The actual fix is trivial, naturally, but figuring it out took too long. We’ll be adding an alert to detect and resolve misconfigurations like that in the future.

time to read 6 min | 1157 words

RavenDB has a really nice feature, it allows you to index data from related documents. Consider the following document structure:

image

We have tickets, vehicles, and users, and we want to issue a search on all the tickets issued to Joe. Leaving aside whether this is the proper way to handle this, here is what the index would look like:

What we are doing here is walk the reference graph and index data from related documents. So far, so good. The cool thing about this feature is that RavenDB is in charge of ensuring that if we update the owner of the vehicle or the name of the user, the Right Thing will happen.

Of course, I wouldn’t be writing this blog post if we didn’t run into a problem in this scenario.

The way it works, for each collection referenced by the index, RavenDB maintains a list of the last document that was chceked for changes in the collection. That way, on modification of a related document, we can tell that we need to re-index a particular document.

This looks something like this:

In other words, for each document that was loaded by another during indexing, we keep a list of the referencing documents.

Let’s say that we update document vehicles/200. That would be written to the storage with a new etag, and the index would wake up. It would ask to get all the documents in the Vehicles collection after etag 456, get vehicles/200 and then check the ReferencedBy and find that the document tickets/100 loaded it. At this point, it will re-index tickets/100 to ensure we have the latest values.

There is quite a bit more to this process, of course, I’m skipping on a lot of optimizations and detail work. For the purpose of this post, we don’t need any of that.

A customer reported that (very rarely), an index similar to the one above would “miss” on updates. That should not be possible. As much as I love this feature, conceptually, it is a very simple one, there isn’t much here that can fail. And yet, it did. Figuring out what was happening required us to look very deeply into the exact series of steps that were taken to produce this output. It turns out that our approach had a hole in it.

We assume that the writes would always happen in an orderly fashion. In other words, that the writes would be consistent. But there is no actual requirement for that.

Consider what happens if I write just the ticket document to the database:

  • RavenDB will index the ticket document
  • It will attempt to load the associated vehicle, figure out that there is no such document and move on
  • The related user document, of course, is not known at this point (since there is no vehicle document)

The end result is that we have the following data internally:

That is fine, when we’ll add the vehicle and the user, we’ll do the appropriate wiring, no?

In almost all cases, that is exactly what will happen. However, consider the metadata above. We are concerned here with tickets/100, but there is also tickets/20, whose references exist properly. So the structure we have right now in terms of reference tracking is:

image

It’s important to note that the references are always kept from the initial 'tickets' document. So even though the path from tickets/20 to users/99 goes through vehicles/19, the relationship is a direct association.

What will happen if we’ll insert just the users/300 document now? Well, there is no reference to this document, so we’ve no reason to do anything with it. But that isn’t a problem. When vehicles/200 is inserted, this will be fixed.

On the other hand, if we add just vehicles/200 to the database (with users/300 not being present), that is a change in a tracked document, which will cause us to index the referencing document (tickets/100) again and move us to this state:

image

When we will then add users/300, document tickets/100 will have the record of this reference and we’ll re-index it.

In other words, we are covered on both sides. Except, that there is still this pesky (and impossible) problem that the user is seeing.

Now, consider the following state of affairs, we are back in the initial state, both vehicles/200 and users/300 are missing in the database and tickets/20, vehicles/19 and users/99 are there.

We add vehicles/200 to the database, and there is a re-indexing process going on. At the same time that we re-index tickets/100 because of the new vehicles/200 document, we are adding the users/300 document in a separate transaction.

That means that during the indexing of tickers/100, we’ll see document vehicles/200 but not the users/300 document (even though it exists).

That is still not a problem, we’ll write the referencing record and on the next batch, detect that we have a user that we haven’t seen and re-index the document again.

Except… what if we didn’t update just the users/300 document in this case, what if we also updated users/99 at the same transaction (and after we insert document users/300).

Depending on the exact timings, we may end up missing document users/300 (because there was no reference to it at the time) but will notice that document users/99 was updated (we already had it referenced). Since users/99 was modified after users/300, we’ll record that we observed all the changes in the Users collection before users/99. That, crucially, also includes the users/300 that we never noticed.

This is confusing, I’ll freely admit. In order to reproduce this bug you need a non-standard pattern for creating references, a chain of at least two references, multiple independent references with different states, and an unlucky draw from Murphy with the exact timing of transactions, indexing and order of operations.

The root cause was that we recorded the newly added document reference in memory, and only updated them when the entire indexing batch was completed. During that time, there may have been multiple transactions that modified the documents. But because we didn’t sync the state until the end of the batch, we would end up missing this case. Solving the problem once we knew what was going on involved moving a single line of code from the outer loop to an inner one, basically.

Writing a reproducible test case was actually far harder, since so many things had to fall just so this would happen. I have to admit that I don’t have any strong conclusions about this bug. It isn’t something systematic or an issue that we missed. It is a sequence of unfortunate events with a very low probability of occurring that we  never actually considered.

The really good thing about this issue is that it is the first one in this particular area of the code in quite some time. That means that this has been quite stable for many scenarios.

time to read 4 min | 680 words

I mentioned earlier that B+Trees are a gnarly beast to implement properly. On the face of it, this is a really strange statement, because they are a pretty simple data structure. What is so complex about the implementation? You have a fixed size page, you add to it until it is full, then you split the page, and you are done. What’s the hassle?

Here is a simple scenario for page splits, the following page is completely full. We cannot fit another entry there:

image

Now, if we try to add another item to the tree, we’ll need to split the page, and the result will be something like this (we add an entry with a key: users/050):

image

How did we split the page? The code for that  is really simple:

As you can see, since the data is sorted, we can simply take the last half of the entries from the source, copy them to the new page and call it a day. This is simple, effective, and will usually work just fine. The key word here is usually.

Given a B+Tree that uses variable size keys, with a page size of 4KB and a maximum size of 1 KB for the keys. On the face of it, this looks like a pretty good setup. If we split the page, we can be sure that we’ll have enough space to accommodate any valid key, right? Well, just as long as the data distribution makes sense. It often does not. Let’s talk about a concrete scenario, shall we? We store in the B+Tree a list of public keys.

This looks like the image below, where we have a single page with 16 entries and 3,938 bytes in use, and 158 bytes that are free. Take a look at the data for a moment, and you’ll notice some interesting patterns.

image

The data is divided into two distinct types, EdDSA keys and RSA keys. Because they are prefixed with their type, all the EdDSA keys are first on the page, and the RSA keys are last. There is a big size difference between the two types of keys. And that turns out to be a real problem for us.

Consider what will happen when we want to insert a new key to this page. We still have room to a few more EdDSA keys, so that isn’t really that interesting, but what happens when we want to insert a new RSA key? There is not enough room here, so we split the page. Using the algorithm above, we get the following tree structure post split:

image

Remember, we need to add an RSA key, so we are now going to go to the bottom right page and try to add the value. But there is not enough room to add a bit more than 512 bytes to the page, is there?

What happens next depends on the exact implementation. It is possible that you’ll get an error, or another split, or the tree will attempt to proceed and do something completely bizarre.

The key here (pun intended) is that even though the situation looks relatively simple, a perfectly reasonable choice can hide a pretty subtle bug for a very long time. It is only when you hit the exact problematic circumstances that you’ll run into problems.

This has been a fairly simple problem, but there are many such edge cases that may be hiding in the weeds of B+Tree implementations. that is one of the reasons that working with production data is such a big issue. Real world data is messy, it has unpredictable patterns and stuff that you’ll likely never think of. It is also the best way I have found to smoke out those details.

time to read 5 min | 861 words

I love B+Trees, but they can be gnarly beasts, with the number of edge cases that you can run into. Today’s story is about a known difficult place, page splitting in the tree. Consider the following B+Tree, showing a three-level tree with 3 elements on each page.

image

Consider what will happen when we want to insert a new value to the tree, the value: 27. Given the current state of the tree, that should go on the page marked in red:

image

But there is no place for the new value on this page, so we have to split it. The tree will then look like so, we split the page and now we need to add the new page to the parent, but that one also doesn’t have room for it:

image

So we are now in a multi-level split process. Let’s see what this looks like when we go up the tree. This is the final state of the tree when we are done doing all the splits:

image

The reason for all of this is that we need to add 27 to the tree, and we haven’t done that yet. At this stage, we got the tree back in order and we can safely add the new value to the tree, since we made sure we have enough space.

However, note that the exact same process would apply if we were adding 27 or 29. The page that we’ll add them to, however, is different.

This can be quite complex to keep track of, because of the recursive nature of the process. In code, this looks something like this:

I am skipping on some details, but that is the gist of it. So we do the split (recursively if needed) and then after we wired the parent page properly, we find the right location for the new value.

An important aspect here is the cursor. We use that to mark our current location in the tree, so the cursor will always contain all the parent pages that we are currently searching upon. A lot of the work that we are doing in the tree is related to the cursor.

Now, look at the code and consider the behavior of this code when we insert the value 29. It will correctly generate this page:

image

However.. what happens if we’ll insert 27?

Well, when we split the page, we went up the tree. And then we had another split, and then we went down another branch. So as written, the result would be adding the 27 to the same page as we would the 29. This would look like this:

image

Look at the red markers. We put entry 27 on the wrong page.

Fixing this issue is actually pretty hard, because we need to keep track of the values as we go up and down the tree. For fun, imagine what happens in this exact scenario, but when you have 6 levels in the tree and you end up in a completely different location in the tree.

I spent a lot of time struggling with this issue, including getting help from some pretty talented people, and the final conclusion we got was “it’s complicated”.

I don’t want complications here, I need it to be as simple as possible, otherwise, we can’t make any sort of sense here. I kept spinning more and more complex systems to resolve this, when I realized that I just looked at the problem in the wrong manner all along.

The issue was that I was trying to add the new value to the tree after I sorted out the structure of the tree, but there was actually nothing that forced me to do that. Given that I already split the page at this stage, I know that I have sufficient space to add the key without doing anything else.  I can first add the key to the right page, then write the split page back to the tree. In this case, I don’t need to do any sort of backtracking or state management .

Here is what this looks like:

And with this change, the entire class of problems related to the tree structure just went away.

I’m very happy with this result, even if it is a bit ironic. Like the problem at hand, a lot of the complexity was there because I had to backtrack the implementation decisions and go on a new path to solve this.

Also, I just checked, the portion that controls page splits inside Voron has had roughly 1 change a year for the past 5 years. Given our scope and usage, that means that it has been incredibly stable in the face of everything that we could throw at it.

time to read 3 min | 559 words

There is a great article discussing how SQLite is handling transactions at fly.io. Which led me to the great documentation on the WAL mode for SQLite. And that led me to think about the differences between how SQLite does it and how Voron does it.

Both SQLite and Voron share asame behavior, they use Copy on Write and make the modifications for the pages in the database on a copy of the data. That means that readers can continue to operate with zero overhead, reading the stable snapshot of their data.

However, SQLite works by copying the data to the WAL file directly and modifying it there. Voron doesn’t use this approach. Instead, we have the notion of scratch space where this is done. Look at the figure below, which showcase the difference between the databases:

image

In SQLite, any modifications are written to the WAL file and operated on there. When you want to commit a transaction in SQLite, you’ll compute the checksum of all the pages modified in the transaction and write a commit record to the disk, at which point you’ll need to issue an fsync() call.

Voron, on the other hand, will copy the data that is modified in the transaction into scratch space (essentially, just some memory we allocated). On commit, it will not write the data to the WAL. Instead, it will take the following actions:

  • Compute a diff of the current state of the page compared to its initial state, writing only the modifications.
  • Compress the resulting output.
  • Compute a checksum of all the pages that were modified.
  • Write the compressed output and the checksum as a single write call.

Voron opens the file with O_DIRECT | O_DSYNC, the idea is that we don’t need to call fsync() at any stage, and we significantly reduce the number of system calls we have to make to commit a transaction.

Other transactions, at the same time, will use the data in the scratch space, not the WAL, to read the current state of pages. Voron also supports MVCC, so you may have multiple copies of the data in memory at once (for different transactions).

Voron is able to significantly reduce the total amount of I/O we have to use for writes, because we only write the changes in the data between page versions and even that is compressed. We typically can safely trade off the additional CPU work in favor of I/O costs and still come up ahead.

Another reason we went with this route is that we use memory mapped files, and on Windows, those aren’t coherent with file I/O calls. That means that mixing reading via mmap() and writing via file I/O (which is what we want to do to avoid fsync() calls) wouldn’t really work. Voron also benefits from not having to deal with multiple processes running at the same time, since it is typically deployed from within a single process.

Finally, the fact that we use scratch spaces separately from the WAL means that we put that somewhere else. You can have a fast empheral disk (common on the cloud) for scratch files, very fast (but small) disk for the WAL journal and standard disk for the actual data. This sort of configuration gives you more choices on how to deal with the physical layout of your data.

time to read 6 min | 1086 words

I love trees. Not the ones that produce oxygen, I mean, I guess they are nice too, but I’m talking about trees in software. To be rather more exact, I’m fascinated by B+Trees and all their permutations. In a very real sense, most databases are all about B+Trees. To give you some context, Voron alone, at this time, has the following:

  • Tree – the core used to map an arbitrary byte string key to an arbitrary sized value. This supports many operations, including compressed values, overflow pages, etc. Typically used to implement storage indexes over non-numeric data.
  • Fixed Size Tree – a more compact implementation when more data is known, with int64 key and a fixed size value (which may be zero). Typically used to implement storage indexes over numeric data.
  • Multi Tree – a tree whose keys are arbitrary byte string keys and whose value is itself a tree (with no value) that can contain multiple values for a single entry. Used to implement non-unique storage indexes.
  • Compact Tree – a tree for arbitrary byte string keys to an int64 value. The keys are compressed and we can pack a lot of values in a single page. Used to handle the field terms in Corax.
  • Set – a tree used to efficiently store int64 values in an ordered manner without duplicates. The data is heavily compressed and is used to handle large posting lists in Corax.

And I’m probably forgetting a few. Those are all various permutations on a B+Tree. And together they create a really interesting set of capabilities for Voron. Today I want to talk to you about an issue we ran into when building Corax, in particular, this one:

image

Before we dive into exactly what is going on here, I feel that I need to give some background. Corax is an indexing engine, and at its core is an inverted index. When we need to delete an entry, we need to remove references to it from all the terms that it is a part of. The code producing the above is implemented roughly like this:

That particular piece accounts for almost 80% of the indexing commit time (and a significant fraction of the total indexing time). That is… just the way it is, I’m afraid. We have to process things in this manner. Lucene does things differently, it will mark the entry as deleted (and then ignore it afterward) eventually merging the results away. I really didn’t want to go in that route. One of the key design principles for Corax is that we want to avoid deferred payments, we want to pay everything in “cash” right now.

If we look deeper into the DeleteCommit() function, we can see some interesting details:

image

The DeleteIdFromExactTerm() is the piece that is doing the majority of the work, accounting for over 60% of the runtime for processing deletes. Let’s dig deeper here:

image

And we can see that this is actually really fast. The total cost for invoking this method is 5.5 μs. The problem is that we invoke it 4 million times.

And looking into the actual costs, what is the issue? It is the TryGetValue() and FindPageFor() calls, those scan through the tree structure to find the relevant place from which to remove the value.

As a reminder, here is a tree structure, (it isn’t a B+Tree, mind, because we don’t need that to understand what is going on). What we have is a lot of calls to Remove() on this tree, and we always need to start searching from the root. It turns out that this is quite expensive:

image

Consider what happens if I want to remove 15 and 20. I have to compare 25, 14, 19, 16 and15. Then I have to go back to the root and compare 25, 14, 19, 22 and 20. There are a lot of commonalities here that I can take advantage of. If I can operate locally, I can probably do better, no?

We rewrote the delete code to be more like this:

The idea is that we do the deletion of terms in two stages. First, we gather all the ids we need to delete for all the terms from all the entries that are being deleted. Then we sort those values, and then we invoke a batch delete method.

Unlike the individual RemoveValue() calls, we can now take advantage of the structure of the tree. In the case of wanting to remove [15,20], we can scan the tree (25,14, 19, 16, 15)  to get to the first item that we remove. Then we proceed using the tree’s own structure. So deleting 20 means comparing (16, 19, 22, 20). In this case, we saved one operation, which isn’t that meaningful. But B+Tree’s most beautiful property is that they are dense. In this case, we are removing values from posting lists, which may contain millions of entries, and it isn’t uncommon to be able to pack thousands of entries per page. That means that the savings are huge.

This particular optimization was sufficient to ensure that we just don’t need to care about deletes any longer.

image

The ability to apply the operation in a batch fashion meant that we could amortize all of the setup costs, which lead to a far greater reaction than I actually expected, to be honest.

The code in question can make a lot of assumptions about what is going on (the input is sorted, values are present in the tree, etc). That makes things so much easier.

As a result, Corax can keep paying all costs upfront and still be so much faster than Lucene (who defers things to the background). To give some context, here are the relevant costs for Lucene:

image

Lucene spends more time processing deletes than Corax does for the whole commit process Smile. Note that this is just a small part of the costs that deletes have for Lucene, there is an excellent blog post describing those from Elastic. The costs range between 20% – 46% slowdown in queries! That is an important reason to avoid deferring work, I think we can agree.

time to read 4 min | 698 words

I read this blog post from Discord, which presents a really interesting approach to a problem that they ran into. Basically, in the cloud you have the choice between fast and ephemeral disks and persistent (and much slower) disks.

Technically speaking you can try to get really fast persistent disks, but those are freaking expensive. Ultra disks on Azure and io2 disks on AWS and Extreme persistent disks for GCP. The cost for a single 1TB disk with 25K IOPS would be around 1,800 USD / month, for example. That is just for the disk!

To give some context, an r6gd.16xlarge instance with 64 cores and 512 GB (!) of RAM as well as 3.8 TB of NVMe drive will cost you less than that. That machine can also do 80K IOPS, for context.

At Discord scale, they ran into the limitations of I/O in the cloud with their database and had to come up with a solution for that.

My approach would be to… not do anything, to be honest. They are using a database (ScyllaDB) that is meant to run on commodity hardware. That means that losing a node is an expected scenario. The rest of the nodes in the cluster will be able to pitch in and recover the data when the node comes back or is replaced. That looks like the perfect scenario for running with fast ephemeral disks, no?

There are two problems with ephemeral disks. First, they are ephemeral Smile, wich means that a hardware problem can make the whole disk go poof. Not an issue, that is why you are using a replicated database, no? We’ll get to that.

Another issue that is raised by Discord is that they rely on disk snapshots for backups and other important workflows. The nice thing about snapshotting a cloud disk is that it is typically fast and cheap to do. Otherwise you need to deal with higher level backup systems at the database layer. The issue is that you probably do want that, because restoring a system from a set of disk snapshots that were taken at different times and at various states is likely to be… challenging.

Discord's solution to this is to create a set of ephemeral disks that are mirrored into persistent disks. Reads are going to the fast disks and writes are spread around. A failure on the ephemeral disks will lead to a recovery of the data from the network disk.

Their post has the gory details and a lot more explanation aside. I wanted to write this post for a simple reason. As a database guy, this system scares me to pieces.

The issue is that I don’t know what kind of data consistency we can expect in such an environment. Do I get any guarantees about the equivalence of data between the fast disks and the network disk? Can they diverge from one another? What happens when you have partial errors?

Consider a transaction that modifies three different pages that end up going to three different fast disks + the network disk. If one of the fast disks has an error during write, what is written to the persistent disk?

Can I get an out-of-date read from this system if we read from the network disk for some reason? That may mean that two pages that were written in one transaction are coming back as different versions. That will likely violate assumptions and invariants and can lead to all sorts of… interesting problems.

Given that this system is meant to handle the failure modes, it sounds really scary because it is an additional layer of complexity (and one that the database is unaware of) to deal with.

Aside from the snapshots, I assume that the reason for this behavior is to avoid the cost of rebuilding a node when there is a failure. I don’t have enough information to say what is the failure rate and the impact on the overall system, but the solution provided is elegant, beautiful, and quite frankly, pretty scary to me.

There have been quite a few unknowns that we had to deal with in the realm of storage. But this adds a whole new layer of things that can break.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Recording (6):
    17 Nov 2022 - RavenDB in a Distributed Cloud Environment
  2. RavenDB Indexing (2):
    20 Oct 2022 - exact()
  3. Production postmortem (45):
    03 Oct 2022 - Do you trust this server?
  4. Webinar recording (15):
    26 Aug 2022 - Modeling Relationships and Hierarchies in a Document Database
  5. re (32):
    16 Aug 2022 - How Discord supercharges network disks for extreme low latency
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats