Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,210 | Comments: 46,211

filter by tags archive

How the debugger lied to my face with no shame at all

time to read 1 min | 155 words

Take a look at the following code:

image

As you can see, something pretty freaky is going on here.

We put a value in there, and then we try to get it out, we get a totally different value.

The real repro happened in a pretty complex piece of code, and because the old value was different than the value in the debugger, we were certain that we had something that was changing data behind our back. Only after we marked the memory as read only and still saw that behavior we realized that this is actually a debugger eval bug. Instead of reading the Id property as a long, it is reading it as a byte, leading to this misleading value.

Only cost me most of my hair.

RavenDB 3.5 RC2 is released

time to read 2 min | 224 words

You can now get RavenDB 3.5 RC2 in the download page. This is going to be the last RC before the RTM release, which we expect in the next week or two.

Please take it for a spin, we appreciate any feedback that you have. While the RC build is out there, we are running internal tests, stressing RavenDB under load in various configuration, to see if we can break it.

The changes since the last RC are:

  • Reduce the number of threads being used in the system.
  • Support for dynamic document in More Like This queries
  • Display number of alerts in the studio
  • Reduced studio load times
  • Warn about configuration values that suspect
  • Reduced allocations when reading large number of documents
  • Better optimistic concurrency control on the client, can now specify that on a per document basis.
  • Better handling for clusters, resolved a race condition on the client deciding who the leader is and added better logging and operational support.
  • Fixed index & side by side index replication in master master scenario.
  • Alert if the paging file is on HDD and there is an SSD on the system.
  • Clusters now only select an up to date node as the leader.
  • Fixed perf issue when starting out and loading a large number of dbs concurrently.
  • Better CSV import / export semantics.

Voron InternalsReducing the journal

time to read 7 min | 1285 words

We spend a lot of time trying to reduce our sync I/O cost with Voron, namely, the actual journal write to the disk. This is very expensive, because we have to hit the actual disk, forgoing any buffering.

So anything that can reduce that cost is a really good idea. We spent some time looking at dynamic compression ratios heuristics, to see if it is worth it. Basically, we tried to figure out which option to use:

image

The idea is that based on the speed of the hard disk in use, we can decided whatever it is worth it or not to spend more time compressing the journal entry before saving it. We tested a system where the I/O duration would be balanced against compression speed and size, and adjust automatically.

It failed, horribly. Basically, even on the fastest drives we could find, it was almost always better to compress at the highest level, because the cost of going to disk is so high.

imageThere is another aspect of this, however. The cost of going to disk isn’t linear to the size you are writing. I used the example of putting your groceries in the trunk. The fuel cost of the trip is not really going to be dominated by the weight of the groceries. After writing this statement, I fact checked myself. According to Auto Blog, each 100 pounds (50 KG) of added weight will increase the fuel utilization by about 1%. What is going to dominate the cost, however, is how much do you have to drive.

In the same manner, writing to the disk is impacted by the amount you write, but writing 4KB or 20KB has roughly the same cost anyway. Writing 2 MB is much longer, but not as much as you would expect. Note that all of those numbers assume no buffering all the way to disk, and using DMA.

We then tried to see what happen if we would just avoid compressing small writes. Anything smaller than 64KB is going to be compressed to less than 64KB, but the actual cost of writing to disk isn’t going to change, so we can save the compression costs. That actually improved performance a little bit for fast drives, but it hurt us on slow ones.

I had an interesting discussion with Alex on the usage of diff compression in the journal. This can take advantage on the fact that in many cases, we don’t modify full pages, so we can write just the changes out to disk. He was kind enough to include a few implementations of that for us to look at, those are RLE0 (Zero Run Length Encoding) implementations, and I’ll use RLE to refer to it from now on.

Reducing I/O is always good, and this promised to give a substantial boost, but the actual design details that cropped us are really interesting.  Diff compression can be simple, like the RLE0 in this link, effectively, outputting something like:

... [x bytes unchanged][y bytes changed][byte 1 .. y][z bytes unchanged] ...

Or they can be much more complex, like bsdiff or xdelta. RLE handles the scenario where some bytes changes nicely, but fails badly if there is a single added byte (since it simply check for equality, we’ll see all the bytes are different). Algorithms like bsdiff or xdelta can handle much more complex differences, but they are drastically more expensive. For my purposes, bsdiff has runtime complexity of O( 2N * logN ) and memory utilization of 17N. It other words, to get the diff of 4 pages, we’ll need 272KB and about 230K operations.

Algorithms like that are usually meant for distributions. In other words, they are meant for cases where you can spend as much time as you want generating the diff, and you benefit from reduced download times. A modern usage of those is the Courgette  project, for reducing the size of Chrome updates. It doesn’t matter if generating the update takes 3 hours, since it will be downloaded millions of times, and a 600KB saved in this manner will pay themselves many time over.

But those kind of costs are not something that we can pay. Analysis of our memory usage patterns also showed that in many cases, we are using mostly fixed addressing. In other words, we’ll typically change only small parts of the page, and we don’t tend to have moving writes. When we do (typically on defrag), we do them on a page boundary, so RLE implementation should generate good savings.

We have an implementation that we are currently testing, but while you can read the code, what is more interesting is the assumptions that we are making.

We scan the original and modified buffers using longs. We can safely assume that the buffers we scan are always sized in pages, so we don’t need to worry about buffers whose size isn’t divisible in sizeof(long), this make the code much simpler. We also don’t bother to encode identical parts, instead, we record the (start, count, raw bytes) differences from the original. There is a small optimization there for long runs of zeros (to make it cheaper to delete data), but beyond that, we do very little.  I’ll have a separate post to dive into the actual implementation details and considerations that drove it, but that is for later.

An important reason why we don’t keep track of the unmodified data is that we don’t need it, and that we can’t actually trust the original data. Consider the case where we actually need to use the journal to recover. We do that by running through all of the transactions, and applying the diffs to the data. The problem is that we may fail midway through the recovery process, so the state of the data is not known. When applying a diff, if we use the original data, we might actually see data from a later transaction (which was applied, but we don’t know about it since we crashed before we can make a note of that). Because of this, we only use the modified data, which is safe to apply multiple times. Note that this assumes that modifying a page can not corrupt the page. In other words, if I have a 4 KB page, and I write a value to the 3rd byte, it isn’t going to cause any change to any other byte. Aside from that, we don’t require that the bytes that we modified will be there on restart, because we’ll overwrite them until we are sure that we properly synced them.

Another aspect of the diff operation that we aren’t actually all that worried about the data size (which is interesting, since we really want to reduce it), the reason for that is that we are going to throw all the diffed data into the compressor anyway. The idea is that even after the diff, we are still likely to find data to compress among the modifications on the transaction.

Currently, the steps to write a transaction to disk are:

  • Get all the modified pages.
  • For each of those, compute the difference between it and the previous version of that page.
  • Compress all the diffs of all the pages.
  • Write the compressed data to disk in a safe manner.

A different view on creation

time to read 2 min | 262 words

The following code shows several increasingly complex way to create a shared instance of an object:

What are the differences? The _requestExecuters value is a concurrent dictionary. And the major difference is the kind of allocations that happen in each call.

In the first scenario, we’ll create a new RequestExecuter each time that we call this line. We’ll still use only a single instance, of course, but we still create (and discard) an instance per call.

In the second scenario, we are passing a delegate, so we’ll only create the RequestExecuter once. Or so it seems. The problem is that under concurrent load, it is possible that we’ll have two RequestExecuters created, only one of which will be used. If we have any unmanaged resources in the RequestExecuter, that can cause a leak. Another issue is that we are using the method parameter, which forces the compiler to capture it, and allocate a new delegate instance per call.

The third scenario is the same as the second one, but we aren’t capturing the parameter, so the compiler will not need to create a new delegate instance per call.

The forth one is using a lazy value. This way, we avoid the race in creating the RequestExecuter, but we still create a new lazy instance per call.

And the fifth one is using a lazy instance and a cached delegate version, so there are no extra allocations there. There is still the race to create the lazy instance, but that should happen rarely, and it isn’t holding any expensive resources.

Meeting the Joel Test 2.0

time to read 7 min | 1216 words

I run into this post, which updates the (by now) venerable Joel Test to our modern age. I remember reading the Joel Test (as well as pretty much anything by Joel) at the beginning of my career and I’m pretty sure that it influenced the way I choose employers and designed software. Seeing this post, I decided to see how Hibernating Rhinos would rank on this test today. I put both the original and updated version, and my comments are below.

  Original Updated
1

Do you use source control?

 
2

Can you make a build in one step?

Can you build and deploy your software in one step?

3

Do you make daily builds?

Do you build on every commit?

4

Do you have a bug database?

 
5

Do you fix bugs before writing new code?

 
6

Do you have an up-to-date schedule?

Do you measure your progress in terms of value delivered?

7

Do you have a spec?

Do you have a runnable spec?

8

Do programmers have quiet working conditions?

Does your environment foster collaboration?

9

Do you use the best tools money can buy?

 
10

Do you have testers?

Is testing everyone's responsibility?

11

Do new candidates write code during their interview?

 
12 Do you do hallway usability testing?  

 

  • Source control – Yes, thankfully, I think that the days of anyone not using source control for anything but a scratch project are behind us. Now the arguments are which source control.
  • Build & deploy in one step – Yes, the build process runs on a Team City server, and while it sometimes require some TLC (I’m looking at you , nuget), it pretty much runs without us having to pay much attention to it.
  • Build & verify on every commit – No. But yes. What we do is have the build server run the full suite on every Pull Request, which is our unit of integration. Commits are far less important, because we break them apart to make them easier to review.
  • Bug database – Yes, but see also the next topic. We mostly use it for bugs we find, and features / improvements, not so much for customers bugs.
  • Do you fix bugs before writing new code – No. But yes. The reason this is complex to answer is how you define bugs. A customer issue is typically handled from A to Z on the spot. We have a rotating function of support engineer that handle such scenarios, and they prioritize that over their routine work.
  • Do you have a schedule / do you measure progress in term of value – We have a rough schedule, with guidelines about this is hard deadline and this is a nice deadline. Hard deadline is about meeting outside commitments, typically. Nice deadlines are about things we would like to do, but we won’t kill ourselves doing them. We do have a sense of what is important and what isn’t. By that I mean is that we have a criteria for “we should be chasing after this” and “this is for when we run out of things to do”.
  • Do you have a (runnable) spec?  - Yes, we have a spec. It isn’t runnable, and I’m not sure what a runnable spec for a database would be. The spec outline thinks like the data format and how we do data fetches for indexes, architectural considerations and rough guidelines into where we are going. It isn’t detailed to the point of being runnable, and I don’t like the idea very much.
  • Developers have quite working conditions / environment encourage collaboration  – The typical setup we have is a separate office for every two developers. I typically see people move around the offices and collaborate on all sort of stuff. If it bugs the other dev in the room, they usually have headphones to deal with it, but that isn’t happening enough to be a major problem. A common issue for people who leave their workstation unattended and use headphones is that by the time they get back and put the headphones, the music has been changes to something suitably amusing, such as this one.
  • Best tools that money can buy – Procurement in Hibernating Rhinos is a process, it involves sending an email with “I need this tool”, and you must include the link to the tool. Then you have to wait anything between 15 minutes to 24 hours (depending on when you asked), and you’ll get the license. I have seen far too many stupid decisions of “oh, we don’t have a budget for this 200$ tool but we’ve no problem paying the 2000$ that it would cost us in time” to suffer that.
  • Testers / everyone is a responsible – Yes. Every single dev is writing tests, and every single PR is sent after tests has been run locally, and then on the build server.
  • Candidates write code in interview – Yes, oh yes they do.
  • Hallway usability testing – See below, too complex to answer here.

RavenDB has multiple level of “user interface”. The most obvious one is the RavenDB studio, but the one that we spend the most time on is the external (and internal) APIs. For the API, we have a review process in place to make sure that we are consistent and make sense. Most of the time we are doing things that follow the same design line as before, so there is not much to think about. For big things, we typically also solicit feedback from the community, to make sure that we aren’t looking into with colored glasses.

For our actual user interface, the Studio, we used to just have the other devs look at the new functionality.  But that led to a lot of stuff that worked, but the amount of attention we actually paid to the UI used to be really variable. Some features we would iterate over for multiple weeks, getting them just right (the most common operations, as we see them). But other stuff was just “we need to expose this functionality, let us do this”, which led to almost one to one mapping of the server side concept to the UI, which isn’t always helpful for the users.

We have started with a full UX study of the RavenDB Studio, and we are going to be doing full design analysis on each of our views with an eye to improve it significantly by 4.0.

Implementing Omni Search

time to read 3 min | 573 words

We have run a UX study on the RavenDB studio, and we have learned some interesting things about our own software. I’ll probably blog about it more in the future, in this post, I want to focus on one of the issues that was raised in the UX study. the search function. Here is how it looks now:

image

Is is a pretty simple feature. Given a prefix of a document id, show all the matches, and allow to go directly to the document.

In the UX study, the users utterly ignored the help text when the search box is empty and tried to put index names there to quickly find the relevant index.

image

This behavior makes… absolute sense. Of course they would assume that this is something that you can do.

So now we have new requirements for the search box:

  1. Allow to search for indexes or transformers or documents.
  2. Allow to search using contains, rather than starts with.
  3. Allow to search for functionality inside the studio.

This will allow the user to use the search box as the go to location to quickly do things in RavenDB.

The first item is pretty easy to explain, right? I can search for UsersIndex or Users/1. The second is a bit more problematic. In particular, given a database with several million documents, doing a contains query on the id is not practical. Oh, we can do a whole bunch of tricks around ngrams, preparing ahead of time, etc. But they aren’t worth it. This is a small feature, and we can’t have it costing us a lot during normal operations.

So we came up with the following design for how the search will work:

  • Searching for indexes and transformers will use contains, because there are usually very few of those and it is cheap to do.
  • Searching for documents will first:
    • Try to find using the exact prefix (“users/’ will find “users/1”, “users/2”, etc)
    • Then get the collection names and prepend them to the search term (so “123” will find “users/123”, “companies/123”, etc)
    • Then get the collection names and prepend them to the search term and do a prefix query (so “123” will find “users/1234”, “companies/12345”)

The idea is that all of those are pretty cheap to do, and we can do them without running into high costs all over the place. And it will give you good way to jump around in your database and find the relevant stuff easily.

Finally, we have the 3rd requirement. What is that about?

Well, one of the things that we have found if that RavenDB has enough features that navigating through them has became a problem. For example, if you want to do a db restore, you need to go to Manage Your Server, then Restore. And it is something that users need to hunt for. The idea is that they can just put “backup” in the search box, and the option that will pop up is the backup screen. So you can skip the hunting through screen and “who moved my cheese” moments.

Benchmarking with Go

time to read 3 min | 469 words

Every now and then, you really need to get out of your comfort zone, I decided that what I want to do is to play a bit with Go, which I haven’t done yet. Oh, I have read Go code, quite a lot of it, but it isn’t the same as writing and actually using it.

We are doing a lot of performance work recently, and while some of that is based entirely on micro benchmarks and focused on low level details such as the number of retired instructions at the CPU level, we also need to see the impact of larger changes. We have been using WRK to do that, but it is hard to get it running on Windows and we had to do some nasty things in Lua scripting to get what we wanted.

I decided that I’ll take the GoBench tool and transform it into a dedicated tool for benchmarking RavenDB.

Here is what we want to test:

  1. Read document by id
  2. Query documents by index
  3. Query map/reduce results
  4. Write new documents (single document)
  5. Write new documents (multiple documents in tx)
  6. Update existing documents

This is intended to be a RavenDB tool, so we won’t be trying to do anything generic, we’ll be writing specialized code.

In terms of the interface, gobench is using command line parameters to control itself, but I think that I’ll pass a configuration file instead. I started to write about the format of the configuration file, when I realized that I’m being absolutely stupid.

I don’t need to do that, I already have a good way to specify what I want to do in the code. It is called the code. The actual code to run the HTTP requests is here. But this is basically just getting a configuration object and using it to generate requests.

Of far more interest to me is the code that actually generate the requests themselves. Here is the piece that tests read requests:

We just spin off a number of go routines, that each does a portion of the work. This gives us concurrent clients and the ability to hammer the server. And the amount of code that we need to write for this is minimal.

To compare, here is the code for writing to the databases:

And then we are left with just deciding on a particular benchmark configuration. For example, here is us running simple load test for both reads and writes.

image

I think that this matches the low overhead for configuration, readability and high degree of flexibility quite well.

Voron internalsI/O costs analysis

time to read 3 min | 580 words

rodentia-icons_fsguard-plugin-urgent-300pxI talked about the details of Voron in the previous posts, how it handles journaling, MVCC and cleaning up after itself. In this post, I want to focus on another aspect that needs to be considered, the various costs of running  Voron on production systems. In particular, the competing I/O requirements.

So what do we have with Voron?

  • A (potentially very large) memory mapped data file. Buffered writes and fsync once every 1 minute / 2GB.
  • Scratch files (small memory mapped files) marked as temporary and delete on close.
  • Journal files requiring durable writes.

In terms of priorities, we want to give high priority to the journal files, then to writing to the data file (so it will happen all the time, not just when we call fsync). Scratch files should only be written to disk under memory pressure, and we should strive to avoid that if possible.

On both Windows and Linux, there are ways to ask the system to start flushing the data to disk (Windows uses FlushViewOfFile, Linux uses sync_file_range), but in practice, when we flush the data to disk we need to also ensure durability, so we call FlushViewOfFile + FlushFileBuffers on Windows and msync(MS_SYNC) on Linux to ensure that. Technically speaking, we could do this in two stages, allowing the system some time to do this lazily, then calling FlushFileBuffers / fsync, but we haven’t found that to be advantageous in terms of complexity, and sync_file_range documentation is scary.

Another aspect that we need to consider is the fact that we are not along out there. A typical RavenDB database will have multiple Voron instances running, and a typical RavenDB server will have multiple RavenDB databases running. So we are talking about typically having dozens or more Voron instances in a single process. We need to avoid a conflict between all of those instance, each of which is trying to make use of all the system resources by itself. This kind of disharmony can kill the performance of the server, all the while giving the best performance in any benchmark where you are running a single instance.

We solved this by having a single actor responsible for scheduling the flushing of all the Voron instances inside a process. It accept flush requests and make sure that we aren’t loading the I/O system too much. This means that we might actually defer flushing to disk under load, but in practice, reducing the I/O competition is going to improve throughput anyway, so that is likely to be better in the end. At the same time, we want to take advantage of the parallelism inherit in many high end systems (RAID, cloud, etc) which can handle a lot of IOPS at the same time. So the policy is to give a certain number of Voron instance the chance to run in parallel, with adjustments depending on the current I/O load on the system.

Journal writes, however, happen immediately, have high priority and should take precedent over data file writes, because they have immediate impact on the system.

We are also experimenting with using the operation system I/O priorities, but that is a bit hard, because most of those are about reducing the I/O priorities. Which we sort of want, but not that much.

Voron internalsThe transaction journal & recovery

time to read 4 min | 751 words

imagebot

In the previous post, I talked about the usage of scratch files to enable MVCC and the challenges that this entails. In this post, I want to talk about the role the transaction journal files play in all of this. I talked a lot about how to ensure that transaction journals are fast, what goes into them, etc. But this post is how  they are used inside Voron.

The way Voron stores data inside the transaction journal is actually quite simple. We have a transaction header, which contains quite a bit of interesting information, and then we have all the pages that were modified in this transaction, compressed.

image

The fact that we are compressing pages can save on a lot of the amount of I/O we write. But the key aspect here is that a transaction is considered committed by the Voron when we complete the write of the entire thing to stable storage. See the post above to a complete discussion on why it matters and how to do this quickly and with the least amount of pain.

Typically, the transaction journal is only used during recovery, so it is write only. We let the journal files to grow to about 64MB in size, then we create new ones. During database startup, we check what is the last journal file and journal file position that we have synced (more on that later), and we start reading from there. We read the transaction header and compare its hash to the hash of the compressed data. If they match (as well as a bunch of other checks we do), then we consider this to be a valid commit, and then we decompress the data into a temporary buffer and we have all the dirty pages that were written in that transaction.

We can then just copy them to the appropriate location in the data file. We continue doing so until we hit the end of the last file or we hit a transaction which is invalid or empty. At that point we stop, consider this the end of the valid committed transactions, and complete recovery.

Note that at this point, we have written a lot of stuff to the data file, but we have flushed it. The reason is that flushing is incredibly expensive, especially during data recovery where we might be re-playing a lot of data. So we skip it.  Instead, we rely on the normal flushing process to do this for us. By default, this will happen within 1 minute of the database starting up, in the background, so it will reduce the interruption for regular operations. This gives us a very fast startup time. And our in memory state let us know where is the next place we need to flush from the log, so we don’t do the same work twice.

However, that does mean that if we fail midway through, there is absolutely no change in behavior. In recovery, we’ll write the same information to the same place, so replaying the journal file become idempotent operation that can fail and recover without a lot of complexity.

We do need to clear the journal files at some point, and this process happens after we synced the data file. At that point, we know that the data is safely stored in the data file, and we can update our persistent state on where we need to start applying recovery the next time. Once those two actions are done, we can delete the old (and now unused) journal files. Note that at each part of the operation, the failure mode is to simply retry the idempotent operation (copying the pages from the journal to the data file), and there is no need for complex recovery logic.

During normal operation, we’ll clear the journal files once it has been confirmed that all the data it has was successfully flushed to the disk and that this action has been successfully recorded in stable storage. So in practice, database restarts after recovery are typically very fast, only needing to reply the last few transactions before we are ready for business again.

FUTURE POSTS

  1. RavenDB Restorspective - 7 days from now

There are posts all the way to Oct 07, 2016

RECENT SERIES

  1. Interview question (3):
    29 Sep 2016 - Stackoverflow THAT
  2. Voron internals (5):
    13 Sep 2016 - The diff is the way
  3. Database Building 101 (8):
    25 Aug 2016 - Graph querying over large datasets
  4. Production postmortem (16):
    23 Aug 2016 - The insidious cost of managed memory
  5. Digging into the CoreCLR (3):
    12 Aug 2016 - Exceptional costs, Part II
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats