Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,210 | Comments: 46,211

filter by tags archive

Playing with the StackOverflow datasets : All the wrong ways

time to read 4 min | 603 words

As I mentioned, given a 45 GB file containing all the StackOverflow questions since 2008, I wanted to transform items like this:

Into this:

This turn out to be rather tricky to do. The file is sorted by the row id, but there are tricks here:

  • You can’t assume that all the answers to a question are located near the question itself.

Ideally, we could keep a dictionary of the questions and their answers until we have all the answers to the question, they write it all out. The problem here is that a question might have answered years after it was asked, which means that the answer id would place it very far from the question in the file, which means that we have to keep the question in memory for a very long time. In practice, this happens quite frequently, and it plays havoc with trying to limit the memory utilization for this approach.

  • While you would expect that answers will always have a row id that is higher than their questions, that isn’t always the case.

Let us take this question, about Haskell monads, which has the following answer.

The answer id is: 2388

The question id is: 44965

That means that if you scan the file sequentially, you are going to find the answers before their questions, somethings a lot before (I’m guessing that this happens if you have edits to a question, so a popular question might get edited long after it was asked & had answers).

The next thing I tried was to group all the questions together, then process them. Because I can’t do that in memory, I tried writing them to disk directly:

  • /questions/44965/q
  • /questions/44965/2388

The idea is that I can write them out to the file system, and then traverse the file system to gather all of them. The problem is that we are talking about over 32 million questions & answers, that isn’t really going to work for us using most normal file systems.

So I tried writing it all to a zip file, which was successful, but resulted in a zip file that was 21GB in size, and was very good in ensuring that nothing could open it before I run out of patience.

Eventually I just gave us and used Voron to handle it, here are the relevant sections:

image

Remember that Voron stores everything in a B+Tree, and allows to do ordered operations, so once this is done, all I need to do is traverse the tree, get all the records that have the same parent id, and voila, I have the grouping I wanted.

If I wanted to do it without the use of a database (or Voron), I would probably go with the following manner:

  • Read each entry from the source data, and write it to another file, noting its location.
  • Write “parent id, id, location” to a separate index file
  • Sort the index file.
  • Start reading from the index file as long as I have the same parent id, and merge those questions.

Why separate file? Because XmlReader does buffering, so it will be really hard to just in the file from just reading, by writing it out, we have the proper position to seek to.

Note that this will require a lot of random I/O, but that is pretty much just what it has to be.

RavenDB Retrospective: Unbounded result sets

time to read 4 min | 750 words

Image result for team retrospectiveWe spent some time recently looking into a lot of our old design decisions. Some of them make very little sense today (json vs. blittalbe as a good example), but made perfect sense at the time, and were essential to actually getting the product out.

Some of those design decisions, however, are still something that I very firmly believe in.  This series of posts is going to explore those decisions, their background and how they played out in the real world. So, without further ado, let us talk about unbounded result sets.

The design of RavenDB was heavily influenced by my experience as That NHibernate Guy (I got started with NHibernate over a decade ago, if you can believe that), where I saw the same types of error, repeated over and over again. I then read Release It!, and I suddenly discovered that I wasn’t alone fighting those kind of demons. When I designed RavenDB, I set out explicitly to prevent as many of those as I possibly could.

One of the major issues that I wanted to address was Unbounded Result Sets, simply put, this is when you have:

SELECT * FROM OrderLines WHERE OrderID = 1555

And you don’t realize that this order has three million line items (or, which is worst, that most of your orders have a few thousands line items, so you are generating a lot of load on the database, only to throw most of them away).

In order to prevent this type of issue, RavenDB has the notion of mandatory page sizes.

  • On the client side, if you don’t specify a limit, we’ll implicitly add one (by default set to 128).
  • On the server side, there is a database wide maximum page size (by default set to 1024). The server will trim all page sizes to the max if they are larger.

I think that this is one of the more controversial decisions in RavenDB design, and one that got a lot of heated discussion. But I still think that this is a good idea,because I have seen what happens when you don’t do that.   And the arguments are mostly about “RavenDB should trust developers to know what they are doing” and a particular irate guy called me while I was out shopping to complain how I broke the sacred contract of Linq with regards to “queries should return all by default, even if this is ten billion results”. I pointed out that this is actually configurable, and if he wanted to set the default to any size he wanted, he could do that, but apparently it is supposed to be “shoot my own foot first, then think” kind of deal.

Even though that I still think that this is a really good idea, we have added some features over the years to make it easy for people to access the entire dataset when they need it. Streaming has been around since 2.5 or so, giving you a dedicated API to stream unbounded results. Streams were built to make it efficient to process large sets of data, and they allow both client & server to process the data in parallel, instead of batching huge responses on the server, then consuming ridiculous amounts of memory on the client before giving you the full result set. Instead, you can get each result as soon as it arrive from server, and you can process it and send it further.

In 4.0, we are going to change the behavior of the paging limits so:

  • If you don’t specify a limit, we’ll supply a limit clause of 25 items. If there are more than 25 items, we’ll throw an exception (unless you asked otherwise in the conventions).
  • If you supply a limit explicitly, it will work as expected and page through the data.

The idea is that we want to reduce the surprise for users, and that can give them the experience to draw upon early on. Another thing that we’ll do is make sure that the operations guys can also change that, likely with an environment variable or something like that. If you need to modify the conventions on the fly, you usually have hard time deploying a new version, and an immediate action is needed.

In this manner, we can help users avoid expensive requests to the server, and they can be explicit with what they need to do.

Debug & Operations as a feature: Tracking allocations costs

time to read 2 min | 310 words

One of the things that we have learned from supporting RavenDB in production is that you by default, everything is a black box into which you have exactly zero input. And in order to figure out what the problems are, you need to use expert tools (WinDBG or VM MAP for example) that are typically more focused on developers, and not usually available in production.

In RavenDB 4.0, we have started from the get go with the notion that everything we do must be exposed, tracked and monitored. Here is the results of the latest effort in that direction.

image

And:

image

There are several important things here. First, you can see that we are tracking the managed and unmanaged allocations that are happening in the system. More than that, we are now able to track down exactly which part of the system is responsible for that.

In the screenshots above, you can see that the UsageIpAndQuantity index has allocated about 65 MB of unmanaged memory, and that we have a few memory mapped files storing the data for index #3.

The idea is that we can now glance at this endpoint and tell very quickly what is going on. And this is something that can be done in production. In fact, that is something that we’ll expose in the studio so you can see those value change over time.

We are also waiting for the CoreCLR to expose the managed allocations on  a per thread basis, which will give us even better metrics.

Stack arena memory allocation context

time to read 4 min | 705 words

Controlling memory allocations is something that any performant service need to do. In RavenDB 4.0 we have taken that to heart, and most of our memory usage is manual, instead of relying on the GC.

Now, there are a lot of really good reasons to want to avoid manual memory management. We have decades of usage telling us that this is a hard problem, with some nasty issues associated with it. But when we are talking about performance different of orders of magnitude, I’ll accept that burden.

The first thing that we did when making this decision is to see if we can change the way we approach unmanaged memory management to make it easier to avoid all of the common issues. We made several decisions:

  • All native allocations must go through a context.
  • A context is a unit of work that is single threaded.
  • A context is bounded in scope.
  • A context is reusable.

By placing those sort of limits on the native memory usage, we managed to gain a lot of simplicity. To start with, because the context is single threaded, we don’t have to implement an allocation strategy that is thread safe. Instead, we can just allocate a bunch of memory in the context, and parcel the memory from there. This looks like this:

image

The arena has allocated a certain size from the operation system. Allocating from that point is just a matter of bumping the next allocation pointer and handing that memory to the caller.

The nice thing about is that we don’t have the possibility for memory leaks, when the context scope ends, we can reset the next allocation pointer to the start, and we have “freed” all that memory. And we can start using it  again, without having to release it to the OS and get it back again. There are a whole bunch of edge cases (what happens when the arena is completely full, etc, but they are pretty obvious).

Not having to deal with multiple threads also present an interesting optimization. Once we are done with the memory, we can return it to the arena, but instead of maintaining free lists, etc, we do the following. If the returned memory is at the top of the arena (the last thing allocated), then we just bump the next allocation down, and the next allocation can reuse that memory. If that isn’t the case, we’ll do nothing, and rely on the context scope to handle this.

That, in turn, means that if I’m careful to free in the reverse order of allocations, I can get, within a single scope, memory reuse very cheaply. That is important, as we’ll see in a bit.

The context is also scoped to a certain operation, such as processing a request or an index batch. Because of that, even if we need to grow the arena, that is limited to however many operations we have to do inside that scope. And once the scope is over, we aren’t disposing of the context and releasing the memory, instead, we pool them so the next action will be able to use them. So a single OS level allocation is used across many actions, and the memory free cost is just setting a pointer.

But some scope need to do a lot of work, consider an index that is processing a big batch, in order to reduce the amount of memory that is being used, that is where the ability to free the top memory allocation comes in handy, because even inside the same scope, we can get memory reuse.

All in all, the allocation strategy was chosen to play to our strengths, and allow us to avoid doing multi threaded work in the hot paths, it also means that we get quite a lot of memory reuse, both within a scoped context and in different scopes, which reduce the overall amount of memory that we need to allocate. And the really great thing, it doesn’t come with a big pile of time spent in GC.

N+1 queries are hardly a feature

time to read 3 min | 531 words

I run into this article, talking about N+1 issues from a “fresh” perspective. In the article, there is a quote by DHH there that goes like this:

If you have N+1 query it means you’re executing one SQL query per element so if you have 50 emails in an inbox, that’d be 50 SQL calls, right? That sounds like a bug. Well in a Russian doll caching setup, it’s not a bug, it’s a feature. The beauty of those individual calls are that they’re individually cached, on their own timeline, and that they’re super simple.

Now, I have been involved with OR/Ms and databases for over a decade now, and I have spent large parts of my career working to resolve performance problems in database driven applications.

In a word, the idea that having a larger amount of simpler queries is better is nonsense. In particular, it completely ignores the cost of going to the database. Sure, a more complex query may require the database to do additional work, and if you are using caching, then you’ll not have the data in the cache in neat “cache entry per row”. But in practice, this leads to applications doing hundreds of queries per page view, absolute reliance on the cache and tremendous cost at startup.

In RavenDB, most queries are so fast that when we measured, the network component was the largest cost we had to face.

Let us do the numbers, shall we? Let us assume that we are talking about the best case, we have the database machine and the web machine in the same datacenter. Cost of roundtrip in the same datacenter is 0.5 ms. Now, let us go back to the 50 emails example above, shall we? We need to send 50 queries to the database. We’ll assume that we have a perfect database that takes no time at all to answer. That is still 25 ms wasted just on roundtrip times.

And the problem is that this is usually a lot more than 50 queries per page when you adopt this kind of silliness. You typically see hundreds or thousands of them, and the database isn’t really able to answer you in no time, so expect to see much bigger delays in practice.

But wait, I hear you say, this is all about caching, you are ignoring that part. Well, no, I’m not. If you are using a distributed cache, the costs over the network are exactly the same. So this is only relevant if you are using a cache on the same server, but then you run into issues when you have different caches on different machines hold different information. Not to mention that you are now in the wonderful world of having to worry about cache invalidation strategies and aligning them with the business requirements.

And for fun, what happens one of your cache nodes goes down? You got it, you just created a DoS attach on your own database.

On the other hand, you can actually create proper queries, get the data in as few roundtrips as possible and trust the database engine to do its work.

Voron InternalsReducing the journal

time to read 7 min | 1285 words

We spend a lot of time trying to reduce our sync I/O cost with Voron, namely, the actual journal write to the disk. This is very expensive, because we have to hit the actual disk, forgoing any buffering.

So anything that can reduce that cost is a really good idea. We spent some time looking at dynamic compression ratios heuristics, to see if it is worth it. Basically, we tried to figure out which option to use:

image

The idea is that based on the speed of the hard disk in use, we can decided whatever it is worth it or not to spend more time compressing the journal entry before saving it. We tested a system where the I/O duration would be balanced against compression speed and size, and adjust automatically.

It failed, horribly. Basically, even on the fastest drives we could find, it was almost always better to compress at the highest level, because the cost of going to disk is so high.

imageThere is another aspect of this, however. The cost of going to disk isn’t linear to the size you are writing. I used the example of putting your groceries in the trunk. The fuel cost of the trip is not really going to be dominated by the weight of the groceries. After writing this statement, I fact checked myself. According to Auto Blog, each 100 pounds (50 KG) of added weight will increase the fuel utilization by about 1%. What is going to dominate the cost, however, is how much do you have to drive.

In the same manner, writing to the disk is impacted by the amount you write, but writing 4KB or 20KB has roughly the same cost anyway. Writing 2 MB is much longer, but not as much as you would expect. Note that all of those numbers assume no buffering all the way to disk, and using DMA.

We then tried to see what happen if we would just avoid compressing small writes. Anything smaller than 64KB is going to be compressed to less than 64KB, but the actual cost of writing to disk isn’t going to change, so we can save the compression costs. That actually improved performance a little bit for fast drives, but it hurt us on slow ones.

I had an interesting discussion with Alex on the usage of diff compression in the journal. This can take advantage on the fact that in many cases, we don’t modify full pages, so we can write just the changes out to disk. He was kind enough to include a few implementations of that for us to look at, those are RLE0 (Zero Run Length Encoding) implementations, and I’ll use RLE to refer to it from now on.

Reducing I/O is always good, and this promised to give a substantial boost, but the actual design details that cropped us are really interesting.  Diff compression can be simple, like the RLE0 in this link, effectively, outputting something like:

... [x bytes unchanged][y bytes changed][byte 1 .. y][z bytes unchanged] ...

Or they can be much more complex, like bsdiff or xdelta. RLE handles the scenario where some bytes changes nicely, but fails badly if there is a single added byte (since it simply check for equality, we’ll see all the bytes are different). Algorithms like bsdiff or xdelta can handle much more complex differences, but they are drastically more expensive. For my purposes, bsdiff has runtime complexity of O( 2N * logN ) and memory utilization of 17N. It other words, to get the diff of 4 pages, we’ll need 272KB and about 230K operations.

Algorithms like that are usually meant for distributions. In other words, they are meant for cases where you can spend as much time as you want generating the diff, and you benefit from reduced download times. A modern usage of those is the Courgette  project, for reducing the size of Chrome updates. It doesn’t matter if generating the update takes 3 hours, since it will be downloaded millions of times, and a 600KB saved in this manner will pay themselves many time over.

But those kind of costs are not something that we can pay. Analysis of our memory usage patterns also showed that in many cases, we are using mostly fixed addressing. In other words, we’ll typically change only small parts of the page, and we don’t tend to have moving writes. When we do (typically on defrag), we do them on a page boundary, so RLE implementation should generate good savings.

We have an implementation that we are currently testing, but while you can read the code, what is more interesting is the assumptions that we are making.

We scan the original and modified buffers using longs. We can safely assume that the buffers we scan are always sized in pages, so we don’t need to worry about buffers whose size isn’t divisible in sizeof(long), this make the code much simpler. We also don’t bother to encode identical parts, instead, we record the (start, count, raw bytes) differences from the original. There is a small optimization there for long runs of zeros (to make it cheaper to delete data), but beyond that, we do very little.  I’ll have a separate post to dive into the actual implementation details and considerations that drove it, but that is for later.

An important reason why we don’t keep track of the unmodified data is that we don’t need it, and that we can’t actually trust the original data. Consider the case where we actually need to use the journal to recover. We do that by running through all of the transactions, and applying the diffs to the data. The problem is that we may fail midway through the recovery process, so the state of the data is not known. When applying a diff, if we use the original data, we might actually see data from a later transaction (which was applied, but we don’t know about it since we crashed before we can make a note of that). Because of this, we only use the modified data, which is safe to apply multiple times. Note that this assumes that modifying a page can not corrupt the page. In other words, if I have a 4 KB page, and I write a value to the 3rd byte, it isn’t going to cause any change to any other byte. Aside from that, we don’t require that the bytes that we modified will be there on restart, because we’ll overwrite them until we are sure that we properly synced them.

Another aspect of the diff operation that we aren’t actually all that worried about the data size (which is interesting, since we really want to reduce it), the reason for that is that we are going to throw all the diffed data into the compressor anyway. The idea is that even after the diff, we are still likely to find data to compress among the modifications on the transaction.

Currently, the steps to write a transaction to disk are:

  • Get all the modified pages.
  • For each of those, compute the difference between it and the previous version of that page.
  • Compress all the diffs of all the pages.
  • Write the compressed data to disk in a safe manner.

Implementing Omni Search

time to read 3 min | 573 words

We have run a UX study on the RavenDB studio, and we have learned some interesting things about our own software. I’ll probably blog about it more in the future, in this post, I want to focus on one of the issues that was raised in the UX study. the search function. Here is how it looks now:

image

Is is a pretty simple feature. Given a prefix of a document id, show all the matches, and allow to go directly to the document.

In the UX study, the users utterly ignored the help text when the search box is empty and tried to put index names there to quickly find the relevant index.

image

This behavior makes… absolute sense. Of course they would assume that this is something that you can do.

So now we have new requirements for the search box:

  1. Allow to search for indexes or transformers or documents.
  2. Allow to search using contains, rather than starts with.
  3. Allow to search for functionality inside the studio.

This will allow the user to use the search box as the go to location to quickly do things in RavenDB.

The first item is pretty easy to explain, right? I can search for UsersIndex or Users/1. The second is a bit more problematic. In particular, given a database with several million documents, doing a contains query on the id is not practical. Oh, we can do a whole bunch of tricks around ngrams, preparing ahead of time, etc. But they aren’t worth it. This is a small feature, and we can’t have it costing us a lot during normal operations.

So we came up with the following design for how the search will work:

  • Searching for indexes and transformers will use contains, because there are usually very few of those and it is cheap to do.
  • Searching for documents will first:
    • Try to find using the exact prefix (“users/’ will find “users/1”, “users/2”, etc)
    • Then get the collection names and prepend them to the search term (so “123” will find “users/123”, “companies/123”, etc)
    • Then get the collection names and prepend them to the search term and do a prefix query (so “123” will find “users/1234”, “companies/12345”)

The idea is that all of those are pretty cheap to do, and we can do them without running into high costs all over the place. And it will give you good way to jump around in your database and find the relevant stuff easily.

Finally, we have the 3rd requirement. What is that about?

Well, one of the things that we have found if that RavenDB has enough features that navigating through them has became a problem. For example, if you want to do a db restore, you need to go to Manage Your Server, then Restore. And it is something that users need to hunt for. The idea is that they can just put “backup” in the search box, and the option that will pop up is the backup screen. So you can skip the hunting through screen and “who moved my cheese” moments.

Voron internalsI/O costs analysis

time to read 3 min | 580 words

rodentia-icons_fsguard-plugin-urgent-300pxI talked about the details of Voron in the previous posts, how it handles journaling, MVCC and cleaning up after itself. In this post, I want to focus on another aspect that needs to be considered, the various costs of running  Voron on production systems. In particular, the competing I/O requirements.

So what do we have with Voron?

  • A (potentially very large) memory mapped data file. Buffered writes and fsync once every 1 minute / 2GB.
  • Scratch files (small memory mapped files) marked as temporary and delete on close.
  • Journal files requiring durable writes.

In terms of priorities, we want to give high priority to the journal files, then to writing to the data file (so it will happen all the time, not just when we call fsync). Scratch files should only be written to disk under memory pressure, and we should strive to avoid that if possible.

On both Windows and Linux, there are ways to ask the system to start flushing the data to disk (Windows uses FlushViewOfFile, Linux uses sync_file_range), but in practice, when we flush the data to disk we need to also ensure durability, so we call FlushViewOfFile + FlushFileBuffers on Windows and msync(MS_SYNC) on Linux to ensure that. Technically speaking, we could do this in two stages, allowing the system some time to do this lazily, then calling FlushFileBuffers / fsync, but we haven’t found that to be advantageous in terms of complexity, and sync_file_range documentation is scary.

Another aspect that we need to consider is the fact that we are not along out there. A typical RavenDB database will have multiple Voron instances running, and a typical RavenDB server will have multiple RavenDB databases running. So we are talking about typically having dozens or more Voron instances in a single process. We need to avoid a conflict between all of those instance, each of which is trying to make use of all the system resources by itself. This kind of disharmony can kill the performance of the server, all the while giving the best performance in any benchmark where you are running a single instance.

We solved this by having a single actor responsible for scheduling the flushing of all the Voron instances inside a process. It accept flush requests and make sure that we aren’t loading the I/O system too much. This means that we might actually defer flushing to disk under load, but in practice, reducing the I/O competition is going to improve throughput anyway, so that is likely to be better in the end. At the same time, we want to take advantage of the parallelism inherit in many high end systems (RAID, cloud, etc) which can handle a lot of IOPS at the same time. So the policy is to give a certain number of Voron instance the chance to run in parallel, with adjustments depending on the current I/O load on the system.

Journal writes, however, happen immediately, have high priority and should take precedent over data file writes, because they have immediate impact on the system.

We are also experimenting with using the operation system I/O priorities, but that is a bit hard, because most of those are about reducing the I/O priorities. Which we sort of want, but not that much.

Voron internalsThe transaction journal & recovery

time to read 4 min | 751 words

imagebot

In the previous post, I talked about the usage of scratch files to enable MVCC and the challenges that this entails. In this post, I want to talk about the role the transaction journal files play in all of this. I talked a lot about how to ensure that transaction journals are fast, what goes into them, etc. But this post is how  they are used inside Voron.

The way Voron stores data inside the transaction journal is actually quite simple. We have a transaction header, which contains quite a bit of interesting information, and then we have all the pages that were modified in this transaction, compressed.

image

The fact that we are compressing pages can save on a lot of the amount of I/O we write. But the key aspect here is that a transaction is considered committed by the Voron when we complete the write of the entire thing to stable storage. See the post above to a complete discussion on why it matters and how to do this quickly and with the least amount of pain.

Typically, the transaction journal is only used during recovery, so it is write only. We let the journal files to grow to about 64MB in size, then we create new ones. During database startup, we check what is the last journal file and journal file position that we have synced (more on that later), and we start reading from there. We read the transaction header and compare its hash to the hash of the compressed data. If they match (as well as a bunch of other checks we do), then we consider this to be a valid commit, and then we decompress the data into a temporary buffer and we have all the dirty pages that were written in that transaction.

We can then just copy them to the appropriate location in the data file. We continue doing so until we hit the end of the last file or we hit a transaction which is invalid or empty. At that point we stop, consider this the end of the valid committed transactions, and complete recovery.

Note that at this point, we have written a lot of stuff to the data file, but we have flushed it. The reason is that flushing is incredibly expensive, especially during data recovery where we might be re-playing a lot of data. So we skip it.  Instead, we rely on the normal flushing process to do this for us. By default, this will happen within 1 minute of the database starting up, in the background, so it will reduce the interruption for regular operations. This gives us a very fast startup time. And our in memory state let us know where is the next place we need to flush from the log, so we don’t do the same work twice.

However, that does mean that if we fail midway through, there is absolutely no change in behavior. In recovery, we’ll write the same information to the same place, so replaying the journal file become idempotent operation that can fail and recover without a lot of complexity.

We do need to clear the journal files at some point, and this process happens after we synced the data file. At that point, we know that the data is safely stored in the data file, and we can update our persistent state on where we need to start applying recovery the next time. Once those two actions are done, we can delete the old (and now unused) journal files. Note that at each part of the operation, the failure mode is to simply retry the idempotent operation (copying the pages from the journal to the data file), and there is no need for complex recovery logic.

During normal operation, we’ll clear the journal files once it has been confirmed that all the data it has was successfully flushed to the disk and that this action has been successfully recorded in stable storage. So in practice, database restarts after recovery are typically very fast, only needing to reply the last few transactions before we are ready for business again.

Voron internalsCleaning up scratch buffers

time to read 5 min | 1000 words

In my previosus post, I talked about how Voron achieves MVCC. Instead of modifying data in place, we copy the page or pages we want to modify to a scratch buffer and modify that. When the write transaction completes, we are updating a Page Translation Table so any reference to the pages that were modified would go to the right place in the scratch file.

Note, Voron uses mmap files as scratch buffers. I use the term scratch buffer / scratch file to refer to the same thing.

That is all well and good, and if you are familiar with how virtual memory works, this is exactly the model. In effect, every transaction get a snapshot of the entire database as it was when it was opened. Read transactions don’t modify the data, and are ensured to have a stable snapshot of the database. The write transaction can modify the database freely, without worrying about locking or stepping over other transactions.

This is all pretty simple, and the sole cost that we have when committing the transaction is flushing all the dirty pages to disk, and then making an atomic pointer swap to update the Page Translation Table.

However, that is only part of the job, if all the data modifications happens on the scratch buffer, what is going on with the scratch files?

Voron has a background process that monitor the database activity, and based on certain policy (size, time, load factor, etc) it will routinely write the data from the scratch files to the data file. This is a bit of an involved process, because we can’t just do this blindly.

Instead, we start by seeing what is the oldest active transaction that is currently operating. We need to find that out to make sure that we aren’t writing any page that this transaction might visit (thus violating the snapshot isolation of the transaction). Once we have the oldest transaction, we gather all the pages from the Page Translation Table that came from older transactions and write them to the data file. There are a couple of tricks that we use here. It is very frequent for the same page to be modified multiple times (maybe we updated the record several times in different transactions), so we’ll have multiple copies of it. But we don’t actually need to copy all of them, we just need to copy the latest version (up to the oldest active transaction).

The process of copying all the data from the scratch file to the data file can happen concurrently with both read and write transactions. After the flush, we need to update the PTT again (so we open a very short write transactions to do that), and we are done. All the pages that we have copied from the scratch buffer are marked as free and are available for future transactions to use. 

Note, however, that we haven’t called fsync on the data file yet. So even though we wrote to the data file, we made a buffered write, which is awesome for performance, but not so much for safety. This is done intentionally, for performance reasons. In my next post, I’ll talk about recovery and safety at length, so I’ll just mention that we fsync the data file once a minute or one once every 2GB or so. The idea is that we give the OS the time to do the actual flush on the background, before we just in and demand that this will happen.

Another problem that we have with the scratch buffer is that, like any memory allocation routine, it has issues. In particular, it has to deal with fragmentation. We use power of two allocator to reduce fragmentation as much as possible, but certain workloads can fragment the memory in such a way that it is hard / impossible to deal with it. In order to deal with that issue, we keep track on not just the free sections in the scratch buffer, but also on the total amount of used memory. If a request cannot be satisfied by the scratch buffer because of fragmentation, but there is enough free space available, we’ll create a new scratch file and use that as our new scratch. The old one will eventually be freed when all read transactions are over and all the data has been flushed away.

Scratch files are marked as temporary and delete of close, so we don’t actually incur a high I/O cost when we create new ones, and it typically only when we have very high workload of both reads and writes that we see the need to create new scratch files.This tend to be drastically cheaper than trying to do compaction, and it actually work in all cases, while compaction can fail in many cases.

You might have noticed an issue with the whole system. We can only move pages from the scratch file to the data file if it was modified by a transaction that is older than the oldest current transaction. That means that a long running read transaction can stall the entire process. This typically is only a problem when we are seeing very high write usage as well as very long read transactions, which pushes the envelope on the size of the scratch buffer but at the same time doesn’t allow to clean it.

Indeed, using Voron, you are typically aware on the need to close transactions in a reasonable timeframe. Within RavenDB, there are very few places where a transaction can span a long time (streaming is pretty much the only case in which we’ll allow it, and it is documented that if you have a very long streaming request, that push memory usage on the server up because we can’t clean the transaction). In practice, even transactions that takes multiple minutes are fine under moderate write load, because there is enough capacity to handle it.

FUTURE POSTS

  1. RavenDB Restorspective - 7 days from now

There are posts all the way to Oct 07, 2016

RECENT SERIES

  1. Interview question (3):
    29 Sep 2016 - Stackoverflow THAT
  2. Voron internals (5):
    13 Sep 2016 - The diff is the way
  3. Database Building 101 (8):
    25 Aug 2016 - Graph querying over large datasets
  4. Production postmortem (16):
    23 Aug 2016 - The insidious cost of managed memory
  5. Digging into the CoreCLR (3):
    12 Aug 2016 - Exceptional costs, Part II
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats