Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 580 words

rodentia-icons_fsguard-plugin-urgent-300pxI talked about the details of Voron in the previous posts, how it handles journaling, MVCC and cleaning up after itself. In this post, I want to focus on another aspect that needs to be considered, the various costs of running  Voron on production systems. In particular, the competing I/O requirements.

So what do we have with Voron?

  • A (potentially very large) memory mapped data file. Buffered writes and fsync once every 1 minute / 2GB.
  • Scratch files (small memory mapped files) marked as temporary and delete on close.
  • Journal files requiring durable writes.

In terms of priorities, we want to give high priority to the journal files, then to writing to the data file (so it will happen all the time, not just when we call fsync). Scratch files should only be written to disk under memory pressure, and we should strive to avoid that if possible.

On both Windows and Linux, there are ways to ask the system to start flushing the data to disk (Windows uses FlushViewOfFile, Linux uses sync_file_range), but in practice, when we flush the data to disk we need to also ensure durability, so we call FlushViewOfFile + FlushFileBuffers on Windows and msync(MS_SYNC) on Linux to ensure that. Technically speaking, we could do this in two stages, allowing the system some time to do this lazily, then calling FlushFileBuffers / fsync, but we haven’t found that to be advantageous in terms of complexity, and sync_file_range documentation is scary.

Another aspect that we need to consider is the fact that we are not along out there. A typical RavenDB database will have multiple Voron instances running, and a typical RavenDB server will have multiple RavenDB databases running. So we are talking about typically having dozens or more Voron instances in a single process. We need to avoid a conflict between all of those instance, each of which is trying to make use of all the system resources by itself. This kind of disharmony can kill the performance of the server, all the while giving the best performance in any benchmark where you are running a single instance.

We solved this by having a single actor responsible for scheduling the flushing of all the Voron instances inside a process. It accept flush requests and make sure that we aren’t loading the I/O system too much. This means that we might actually defer flushing to disk under load, but in practice, reducing the I/O competition is going to improve throughput anyway, so that is likely to be better in the end. At the same time, we want to take advantage of the parallelism inherit in many high end systems (RAID, cloud, etc) which can handle a lot of IOPS at the same time. So the policy is to give a certain number of Voron instance the chance to run in parallel, with adjustments depending on the current I/O load on the system.

Journal writes, however, happen immediately, have high priority and should take precedent over data file writes, because they have immediate impact on the system.

We are also experimenting with using the operation system I/O priorities, but that is a bit hard, because most of those are about reducing the I/O priorities. Which we sort of want, but not that much.

time to read 4 min | 751 words

imagebot

In the previous post, I talked about the usage of scratch files to enable MVCC and the challenges that this entails. In this post, I want to talk about the role the transaction journal files play in all of this. I talked a lot about how to ensure that transaction journals are fast, what goes into them, etc. But this post is how  they are used inside Voron.

The way Voron stores data inside the transaction journal is actually quite simple. We have a transaction header, which contains quite a bit of interesting information, and then we have all the pages that were modified in this transaction, compressed.

image

The fact that we are compressing pages can save on a lot of the amount of I/O we write. But the key aspect here is that a transaction is considered committed by the Voron when we complete the write of the entire thing to stable storage. See the post above to a complete discussion on why it matters and how to do this quickly and with the least amount of pain.

Typically, the transaction journal is only used during recovery, so it is write only. We let the journal files to grow to about 64MB in size, then we create new ones. During database startup, we check what is the last journal file and journal file position that we have synced (more on that later), and we start reading from there. We read the transaction header and compare its hash to the hash of the compressed data. If they match (as well as a bunch of other checks we do), then we consider this to be a valid commit, and then we decompress the data into a temporary buffer and we have all the dirty pages that were written in that transaction.

We can then just copy them to the appropriate location in the data file. We continue doing so until we hit the end of the last file or we hit a transaction which is invalid or empty. At that point we stop, consider this the end of the valid committed transactions, and complete recovery.

Note that at this point, we have written a lot of stuff to the data file, but we have flushed it. The reason is that flushing is incredibly expensive, especially during data recovery where we might be re-playing a lot of data. So we skip it.  Instead, we rely on the normal flushing process to do this for us. By default, this will happen within 1 minute of the database starting up, in the background, so it will reduce the interruption for regular operations. This gives us a very fast startup time. And our in memory state let us know where is the next place we need to flush from the log, so we don’t do the same work twice.

However, that does mean that if we fail midway through, there is absolutely no change in behavior. In recovery, we’ll write the same information to the same place, so replaying the journal file become idempotent operation that can fail and recover without a lot of complexity.

We do need to clear the journal files at some point, and this process happens after we synced the data file. At that point, we know that the data is safely stored in the data file, and we can update our persistent state on where we need to start applying recovery the next time. Once those two actions are done, we can delete the old (and now unused) journal files. Note that at each part of the operation, the failure mode is to simply retry the idempotent operation (copying the pages from the journal to the data file), and there is no need for complex recovery logic.

During normal operation, we’ll clear the journal files once it has been confirmed that all the data it has was successfully flushed to the disk and that this action has been successfully recorded in stable storage. So in practice, database restarts after recovery are typically very fast, only needing to reply the last few transactions before we are ready for business again.

time to read 5 min | 1000 words

In my previosus post, I talked about how Voron achieves MVCC. Instead of modifying data in place, we copy the page or pages we want to modify to a scratch buffer and modify that. When the write transaction completes, we are updating a Page Translation Table so any reference to the pages that were modified would go to the right place in the scratch file.

Note, Voron uses mmap files as scratch buffers. I use the term scratch buffer / scratch file to refer to the same thing.

That is all well and good, and if you are familiar with how virtual memory works, this is exactly the model. In effect, every transaction get a snapshot of the entire database as it was when it was opened. Read transactions don’t modify the data, and are ensured to have a stable snapshot of the database. The write transaction can modify the database freely, without worrying about locking or stepping over other transactions.

This is all pretty simple, and the sole cost that we have when committing the transaction is flushing all the dirty pages to disk, and then making an atomic pointer swap to update the Page Translation Table.

However, that is only part of the job, if all the data modifications happens on the scratch buffer, what is going on with the scratch files?

Voron has a background process that monitor the database activity, and based on certain policy (size, time, load factor, etc) it will routinely write the data from the scratch files to the data file. This is a bit of an involved process, because we can’t just do this blindly.

Instead, we start by seeing what is the oldest active transaction that is currently operating. We need to find that out to make sure that we aren’t writing any page that this transaction might visit (thus violating the snapshot isolation of the transaction). Once we have the oldest transaction, we gather all the pages from the Page Translation Table that came from older transactions and write them to the data file. There are a couple of tricks that we use here. It is very frequent for the same page to be modified multiple times (maybe we updated the record several times in different transactions), so we’ll have multiple copies of it. But we don’t actually need to copy all of them, we just need to copy the latest version (up to the oldest active transaction).

The process of copying all the data from the scratch file to the data file can happen concurrently with both read and write transactions. After the flush, we need to update the PTT again (so we open a very short write transactions to do that), and we are done. All the pages that we have copied from the scratch buffer are marked as free and are available for future transactions to use. 

Note, however, that we haven’t called fsync on the data file yet. So even though we wrote to the data file, we made a buffered write, which is awesome for performance, but not so much for safety. This is done intentionally, for performance reasons. In my next post, I’ll talk about recovery and safety at length, so I’ll just mention that we fsync the data file once a minute or one once every 2GB or so. The idea is that we give the OS the time to do the actual flush on the background, before we just in and demand that this will happen.

Another problem that we have with the scratch buffer is that, like any memory allocation routine, it has issues. In particular, it has to deal with fragmentation. We use power of two allocator to reduce fragmentation as much as possible, but certain workloads can fragment the memory in such a way that it is hard / impossible to deal with it. In order to deal with that issue, we keep track on not just the free sections in the scratch buffer, but also on the total amount of used memory. If a request cannot be satisfied by the scratch buffer because of fragmentation, but there is enough free space available, we’ll create a new scratch file and use that as our new scratch. The old one will eventually be freed when all read transactions are over and all the data has been flushed away.

Scratch files are marked as temporary and delete of close, so we don’t actually incur a high I/O cost when we create new ones, and it typically only when we have very high workload of both reads and writes that we see the need to create new scratch files.This tend to be drastically cheaper than trying to do compaction, and it actually work in all cases, while compaction can fail in many cases.

You might have noticed an issue with the whole system. We can only move pages from the scratch file to the data file if it was modified by a transaction that is older than the oldest current transaction. That means that a long running read transaction can stall the entire process. This typically is only a problem when we are seeing very high write usage as well as very long read transactions, which pushes the envelope on the size of the scratch buffer but at the same time doesn’t allow to clean it.

Indeed, using Voron, you are typically aware on the need to close transactions in a reasonable timeframe. Within RavenDB, there are very few places where a transaction can span a long time (streaming is pretty much the only case in which we’ll allow it, and it is documented that if you have a very long streaming request, that push memory usage on the server up because we can’t clean the transaction). In practice, even transactions that takes multiple minutes are fine under moderate write load, because there is enough capacity to handle it.

time to read 5 min | 804 words

I talked about the different aspects of building a database engine in detail in the past month or so. But I tried to talk about each topic independently, so it will make sense. The problem is that in the real world, there are actually quite a lot of related stuff that impact on one another. This series of posts is meant to tie everything together, so you’ll have a better understanding how the design decisions in one place being affected by the requirement in somewhere that seems utterly unrelated.

Before we can talk about the implementation details, let us see what we are trying to achieve. Voron is:

  • High performance.
  • Single write, multiple readers (MVCC)
  • Fully ACID

In this post, I’m not going to talk about the data model, or how we sort it, or anything like that. No, we are at a much lower level than that. We are at how we access the raw data pages and manage them.

There are actually multiple players involved here. We have the journal for durability of writes, we have the data file to store the data, the scratch file to implement Multi Versioning Concurrency Control and the Page Translation Tables to provide snapshot isolation for concurrent transactions.

The  design of Voron is immensely simplified by the fact that we choose to go with a single writer model. We share this design decision with other databases engines such as LMDB, LevelDB, RocksDB, etc. Concurrent write transactions are much more complex and require a lot more effort, and you still have the serialization at the journal level, although I explored multiple ways around it. With Voron, we decided to go with a single write transaction for the simplicity, and then implemented transaction merging on top of that, which give us a tremendous performance boost in high load scenarios.

But let us talk about MVCC. The idea is that we have concurrent versions of the data, so each transaction has a snapshot of the entire database and can operate on that without fear of write transactions modifying data while it is running. Let us explore how this works when the database starts.

The key to that is the notion of the page translation table, from now on, known as the PTT. When the database starts, we have an empty PTT, and the data file itself. We open a read transaction, which has the following data:

ReadTx-1:

  • PTT: [ /* no entries */
  • Data file

Whenever the read transaction need to read a page, it consults the PTT, find that there is nothing there, and read the page from the data file. We keep the read transaction open, and open a new write transaction. It also gets a PTT and the data file, but it also needs to keep track of a few other things:

WriteTx-2:

  • PTT: [/* no entries */]
  • Data file
  • Dirty pages

Now, we want to make a change to the database, which happens to fall on Page #3. Here we have problem, we can’t modify the data file directly, ReadTx-1 is still running, and it might want to read the data in Page #3 at some point. Instead of modifying the data directly, we copy the page into the scratch file.

The scratch file is just a temporary file that we use to store data copies. After we copy the data, we update the PTT. Now when we search for Page #3, we’ll find the location of the page in the scratch file. As far as the write transaction is concerned, this doesn’t matter. A page is a page is a page, and it doesn’t matter where it is at.

Committing the transaction means taking all of the dirty pages in the write transaction and writing them to the log. After which we atomically set the PTT for the write transaction as the global PTT. Now, all future read transactions will have the new PTT and when they will ask for Page #3, they will get the page from the scratch file.

A new write transaction that needs to (again) modify Page #3, will create another copy of the Page inside the scratch file.This ends up looking like this:

image

We have three copies of Page #3. One for the original read transaction (in the data file), one for the current read transactions (yellow in the scratch file) and the current modified page (orange in the scratch file) that we are writing to.

When the write transaction completes, we again flush the dirty pages to the journal and then publish our PTT so all future transactions can see the changes.

Of course, that is just one side of it, in my next post, I’ll discuss how we clear the scratch file and move data back to the data file.

time to read 3 min | 522 words

I mentioned that maintaining physical ids is important for performance reasons in my previous post, but I skipped on exactly why. The short answer is that if I have a physical ids, it is much easier to implement locality and much easier to implement parallel locality.

Let us imagine a database whose size is about 100GB, running on a machine that has 6 GB of RAM. You need to do run some sort of computation that traverse the graph, but doing so naively will likely cause us to trash quite a lot, as we page memory in and out of the disk, only to jump far away in the graph, paging even more, and effectively killing all your performance.

Instead, we can do something like this, let us imagine that you have a machine with 4 cores on it, and the previous mention setup. And then you start 4 threads (each marked with a different color on the image, and start processing nodes.

image

However, there is a trick here, each thread has a queue, and only ids that fall without the area of responsibility of the thread will arrive there. But we aren’t done, inside a thread we define additional regions, and route requests to process each region into each own queue.

Finally, within each thread, we process one region at a time. So the idea is that while we are running over a region, we may produce work that will need to run on other regions (or even other threads), but we don’t care, we queue that work and continue emptying the work that exists on our own region. Only once once we have completed all work in a particular region will we move to the next one. The whole task complete when, in all threads, there are no more regions with work to be done.

Note that the idea here is that each thread is working on one region at a time, and that region maps to a section of the database file that was memory mapped. So we keep that are of the page cache alive and well.

When we move between regions, we can hint to the memory manager that we are going to need the next region, etc. We can’t escape the need to process the same region multiple times, because processing in one region may lead us to processing in another, and then back, but assuming we run the regions using least recently accessed, we can take advantage on the stuff remaining in the page cache (hopefully) from the previous run and using that.

Which is why the physical location on disk is important.

Note that the actual query that we run is less important. Typical graph queries are fall into one of two categories:

  • Some sort of Breadth First Search or Depth First Search and walking through the graph. 
  • Finding a sub-graph in the larger graph that matches this criteria.

In both cases, we can process such queries using the aforementioned process, and the reduction in random work that the database has to do is big.

time to read 4 min | 647 words

A few posts ago, I talked about the problem of having unstable ids, in particular, ids that can be reused. That leads to quite a lot of complexity, as anyone who ever had to deal with Lucene documents ids knows.

So we are willing to pay something toward stable ids, the questions is what?

One way of doing that is to just store the physical id (unstable) and a virtual id (stable) in a B+Tree (actually, a pair of them, since you’ll need to refer to them back and forth). That means that for the most part, internally to the engine, we’ll use the physical id (with its nice property of having O(1) access time), but externally we’ll expose the stable virtual id (probably sequential numbering, since that is easiest).

Note that I still want to use the physical ids, I’ll discuss exactly why that is important in my next post, for now, let us just say that it is an important component of ensuring high performance for large datasets.

The problem with using B+Tree is that the cost of finding the virtual <—> physical id mapping is O(logN), which for 10 million nodes and 100 million edges is 23 & 24 respectively. Except that this isn’t the real cost function for B+Tree.

Assuming that we have 255 items per page, we actually would need to do 4 page lookups, and a total of 54 comparisons to find the right value. For the edges, we would need 5 page look ups and over 60 comparisons.  Note that this isn’t an issue on its own, but it is an issue when we are talking about having this kind of cost in the hot path of the application. And this is very likely going to be in the hot path.

Oh, there are ways around it, we can only translate back and forth at the edges of the database, so internally we’ll always use the physical address, and only translate it out when we are done. But that is hard to actually do properly, since you need the virtual address for a whole lot of stuff all over the place.

We can steal the idea of page translation tables from the processor. Something like this:

image

Effectively, we’ll lazy allocate segments of pages and pull them together into a hierarchy. So finding out the physical address of id 84 would involve looking at the root, finding the next page down with mod operation, and so forth until we find the right value and check there. This has the advantage of being simple, O(1) and obvious. It is also pretty good in terms of space saving, since the virtual id can be “stored” without taking any space (it is the position of the physical id in the “array” we created.

This has one drawback, there is no way to recover space. Because the indexer into this data structure is meaningful, we can’t just compact things. Once space is allocated, that is it.  Now, to be fair, the cost in size here for all 100 million edges is about 0.75 GB, so not meaningful in the long run, but if we have a busy database that always write and delete, we have no way to recover the space.

The “proper” answer, by the way, is to implement an external hash table. That has the property of O(1), can grow and shrink as the amount of data changes. I’m not presenting it here mostly because it is something that we haven’t yet had the need to implement in Voron, so it isn’t something that I can just show and move on. Beside, it is fun to explore all the wrong ways of doing something.

time to read 2 min | 247 words

I talked about high level and low level data operations. So far, all we have imageseen are very low level operations (get node, get edges for, etc).

Let us see how we’ll deal with a bigger challenge. In this case, we want to implement a classic graph operation, doing a depth first search, filtering by both nodes and edges.

Here is how we can implement this:

In the real world, we’ll need quite a bit more. On each node (and edge) we’ll need to decide if to return it from the query, or just traverse through it, etc. And that is just to start with.

But I think this demonstrate the point of how to layer behavior on top of the lower level database operations.

There is one thing that we need to talk about still, this code will actually use a lot of individual transactions, one for each independent operation. That is quite expensive, we can open a single transaction and pass it to the functions we call, so there is just a single cost for the entire duration of the operation.

Other things we can do is to explicitly designate specific scenarios as important and change the design so we can answer them very quickly (as in the O(1) cost for accessing nodes/edge data).

time to read 2 min | 396 words

image

I  keep calling this a toy database, and it is true for more reasons than the code existing mostly as unconnected snippets. When defining the storage layer, I breezed through quite a lot of stuff because they didn’t really matter for the point I was trying to make.

We’ll start with talking about node IDs. When we save a node to the database, we get an int64 ID back. What is that? We know that it gives us an O(1) access time to the node (or the edge data), but that’s about it. Typically, we don’t expose the internal table ID in this manner. Because the ID we get back from the Insert corresponds to the location on the disk where that data exists. So the node ID is something that is arbitrary and doesn’t correlate to anything. It isn’t a sequential number, or something that the user defines, or anything like that.

That can lead to issues. In particular, if you want to look up a node by some property, you need to have a way to do so that will retrieve its ID, and only then you can do graph operations from there. The common solution is to use a Lucene index for searching by the properties and finding the root node(s) and proceed from there.

But what about deletes? Deleting a node is possible, and when you do that, the space that was reserved for that node will be freed, and can be reused, so you’ll have a different node with the same ID. This leads to some awkwardness (you can see that with the Lucene document IDs, which have the same problem, although for very different reasons).

Updates also pose a problem, because if you need to extend the size of the node, it might be relocated, which changes the ID. Deleting is a challenge, because you need to remove the deleted node ID from all the edges that reference it, instead of cleaning it up on the fly.

This leads us to either abstract the physical ID with a virtual one (and pay the O(logN) cost for resolving it) or find a way to deal with the above inside your API.

time to read 2 min | 368 words

image

So far we looked into how we can store the nodes and the edges, and we explored some interesting data structures inside Voron. Now, let’s see how we can traverse the graph.

So getting the node is pretty easy, and remember that to access the table by ID gives us an O(1) cost.

Now, what about finding all the edges from a node?

This is a lot heftier, but let’s try to break it into individual pieces. First, we find the tree holding all the edges of that particular type, then we access (by the ID of the source node) the fixed size tree that holds all the connected nodes and the edge data ID.

Finally, we access the edges data (if it exists) to get the data about the edge itself, after which, we return it all to the caller.

Unlike the previous method, here we can’t claim to have O(1) costs. Instead, our costs are composed of:

  1. O(logN) – where N is the number of edges types (typically, very few), to search for the right edges tree.
  2. O(logN) – where N is the number of source nodes (typically, very large, but the fixed size trees are very good at being fast, and they don’t have to do much).
  3. O(N) – where N is the number of connection between the source node and the destination node.

I’m excluding the cost of loading the edge data, since this is an O(1) operation and is directly relevant to the iteration over all nodes connected to the source node.

Ideally, we can find a way to turn the 2nd operation into an O(1) cost, but that should be more than good enough for the moment.

Now, this just gives us traversal of the nodes, but going from here to Breadth First Search / Depth First Search and doing them well is fairly trivial, although there are a few things that we'll need to talk about, but that would serve as a separate post.

time to read 3 min | 559 words

We are going to be using Voron to actually handle the storage of the data. Voron is a low level storage engine, which provide, among other things, fully ACID, MVCC and high performance.

For today, what'll look at are the following operations:

  • Add a node.
  • Add an edge between two nodes.
  • Traverse from a node to all its edges (cheaply)
  • Traverse from an edge to the other node (cheaply).

Here is the interface code that we will need to build:

I tried to make it as simple as it possible could be. I’m using NameValueCollection because it is a trivial property bag to serialize / deserialize without bringing any additional dependencies.

Let us focus on the initialization of this class. We need to create a StorageEnvironment in Voron, and setup some structures.

Voron has several different storage options for data, and in this case, we are using a table. A table (which is very much unlike a RDMBS table) it a way to store records that has some indexes, but in this case, we are doing something strange. We are defining a table that has no indexes (this is so strange that we need to explicitly allow it). This is useful because tables manages indexes for us automatically, but in this case we will use them strictly for storage. We’ll see why in just a bit.

 

The code here is a bit strange. Tables are actually meant to hold multiple values, and define indexes on this. So using them to store a single value is something that you wouldn’t normally do. Except that the id that we get back from the table has a very interesting property. It has O(1) cost of access. So given a node id, I can access it directly, regardless of how big my database is. That is a property that I want, because effectively random access of nodes is something that happens quite a lot and is highly desirable.

Now, let us see how we can connect two edges, shall we. This code ignores all error handling, missing nodes, etc. It is meant to explain the core stuff, this is a toy database, after all.

Here we continue doing strange stuff. We already know that we use the empty schema table to have an O(1) access to data. And we store the edge’s data there. But then we run into some new stuff. We create a B+Tree called “Edges_”+type, which hold all of the edges of a particular type. But the content of this B+Tree is not simple. Instead, it is using fixed size trees. Those are also B+Trees, but they have well known size both for keys (which must be long) and the value (which must be small < 256 bytes). Because they are very compact, we can pack quite a lot of data into them, and work with them efficiently.

The end result is that we are now storing the node data and access it at O(1) cost. We also store a B+Tree full of fixed size tree (whose name is the source node id) and whose keys are the destination nodes, and the values are the edge data.

Confusing yet? Not much different than Dictionary<SourceNodeId, Dictionary<DestinationNodeId, EdgeDataId>>. That is quite enough for today, tomorrow I’ll show how we can traverse the data, and what kind of costs are associated with it.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}