Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,329 | Comments: 47,004

filter by tags archive

When I gave up on pointers

time to read 2 min | 372 words

hI started programming with that orange turtle ( I think it was supposed to be green, but we had bad CRT screens ) by drawing stuff on the screen. I think that I was in fifth grade or so. I later graduated to VB (IIRC, that was VB3 or VB4), but my first formal programming education was done in Pascal. And I was pretty good (for a high school kid who merely dabbled), but I just couldn’t figure out pointers. I mean, they made absolutely no sense whatsoever.

Take the example of an “infinite” size stack, that was the example that we were given during class, and I just couldn’t follow it. Take the stack example, like so:

You might notice that this code is limited, if you are storing more than 3 items, the value will be silently ignored, which is probably not what you want. High school me would agree that this is bad, and therefor increase the STACK_SIZE variable to a ridiculously high size, such as 100). That would surely be big enough for everything, right?

I remember really struggling with the concept of dynamic memory management, not so much as because of the API, but because I  couldn’t make any sort of sense about what I was supposed to do there.

After high school, I went and high end course in C++. That took about a year, and I highly recommend the course (even though I don’t think they run it), it taught me a lot about basic stuff such as how things actually work. We started with low level C in DOS, and build on top of that all the way to MFC and ATL. And at some point, the instructor introduced dynamic memory management. And it was so blindingly obvious that I never actually realized that I’m learning the same concept that gave me so much grief in the past.

I had that experience several times since then. I try to learn something, and I just bounce, hard. At a while later, I do the same thing, or fight a slightly  different, and get it. Bug I’m no sure how I go from “what the hell” to “oh, this is obvious”.

The randomly failing test

time to read 2 min | 349 words

We made  low level change in how RavenDB is writing to the journal. This was verified by multiple code reviews and a whole battery of tests and production abuse. And yet, once in a blue moon, we’ll have a test failure. Utterly non reproducible, and only happening once every week or two (out of hundreds or thousands of test runs. That was worrying, because this test was checking the behavior of RavenDB when it crashed midway through a transaction, which is kind of an important metric for us.

It took a long while to finally figure out what is going on there. The first thing that we ruled out was non reprodicability because of threading. This test was single threaded, and nothing code inject anything to the code.

The format of the test was something like this:

  • Write 1000 random fixed size values to the database.
  • Close the database.
  • Corrupt the last page of the page journal.
  • Star the db again and note that all the values in the last transaction are not in the db.

So far, awesome. So why would it fail?

The underlying reason was a obvious, once we looked at it. The only thing that differs from test to test is the random call. But we are using fixed size buffers to write, so that shouldn’t change anything. The data itself is meaningless.

As it turned out, the data is not quite meaningless. As part of the commit process, we compress the data before we write it to the journal . As it turns out, different patterns of random buffers have different compression characteristics. In other words, a buffer of 100 random bytes may compress to 90 bytes or 102 bytes. And that mattered. If the test got enough random inputs to create a new journal file, we will still corrupt the last page on that journal, but since we already are on a new journal, that last page hasn’t been used yet, and the transaction wouldn’t become corrupt and we would have the data still in the database, effectively failing the test.

Low level Voron optimizationsPrimitives & abstraction levels

time to read 5 min | 858 words

One of the things that I noticed with the recent spate of work we have been doing is that we are doing things that we have already tried, and failed. But suddenly we are far more successful. What is the difference?

Case in point, transaction merging and early lock release. Those links both go to our initial implementation that was written in 2013. That is four years ago. Yes, today I can tell you that transaction merging was able to give us two orders of magnitude improvement and early lock release gave us 45% boost in performance. But looking at the timeline, we rolled back early lock release in early 2014.

The complexity of the feature is certainly non trivial, but the major point that led to its removal in 2014 was that it wasn’t worth it. That is, it didn’t pay enough to be worth the complexity it brought. When we sat down to design Voron for RavenDB 4.0, one of the first areas that we sought to eliminate was the transaction merging. We wanted Voron to be single threaded, by design.

And I still very much stand by those decisions. So how can I reconcile both statements? The core difference between them is where those are located, and what this means.

Transaction merging now is not done by Voron, instead, this is something that RavenDB does on top of Voron. But why?

When we had transaction merging in Voron, it meant that we had to submit transactional work to Voron in a format that it could understand. And Voron is a very low level library, so it doesn’t really understand much. This gave us very small “vocabulary” to work with. More than that, it also meant that we had to deal with such features as explicit concurrency (at the Voron level, on top of the concurrency primitives exposed by RavenDB). Let us take the simplest example. We have two threads that want to write to the same document.

That means that we have to build the buffer we want to write in memory, then submit it to Voron, with the right concurrency setting at the Voron level. This is after we already checked the concurrency semantics at the RavenDB level, and with double cost to ensure that a concurrency conflicts at all levels are properly handled. From the point of view of Voron, that meant much more common merged transaction failing (which kills perfromance) and much higher complexity overall when using it. Alongside that, we also have much higher memory usage, because we have to allocate buffers to hold the data we need to write, then submit it to Voron, so the rate of allocations was much higher.

We still saw performance improvement over not using it, but nothing that was really major as two orders of magnitude that we see today. Another aspect of this is that when we built Voron, we built it to fit our existing architecture (which was built on top of Esent), so it reflect a lot of design decisions coming from there.

With RavenDB 4.0, we took a few giant steps back and decided to design the whole system as a single integrated piece. In fact, that meant that any attempt to do concurrency at the Voron level was abandoned. That meant that the moment you had a write transaction, you were safe from concurrency, you didn’t have to worry about anyone modify the data you were looking at. There was no need to allocate special buffers and hold them, because we are always writing directly to Voron, instead of buffering in memory.

This was a dramatic simplification of the API and its usage, and it meant that the code is much more approachable and easy to understand, work with and make performant. Of course, it also meant that we had a serial lock, which is where the transaction merger became such a huge deal. But the point here is that this kind of transaction merging isn’t done at the Voron level, but at the RavenDB level, and instead of submitting primitive operations we can submit full fledge work items, including logic. So writing a document is now done by the request thread parsing the document, preparing a MergedPutCommand class and submitting it to the transaction merger.

The transaction merger will then execute the command under a write transaction, and it will directly manipulate Voron. This means that we get both high concurrency and safety from concurrency issues at the same time.  Early lock release plays into that as well, we had to modify Voron to allow that, but what we did was to build low level primitives that can be used by higher levels, without making assumptions on their usage.

On the Voron side of things, we just have the notion of async commit (with a list of requirements that happen to be exactly fit what is going on in the transaction merging portion in RavenDB), and the actual transaction lock handoff / early lock released is handled at a higher layer, with a lot more information about the system.

Low level Voron optimizationsTransaction lock handoff

time to read 6 min | 1006 words

There are some features that on completion, just made my day/week/month. This is one of them. I’ve only just started recovering from the marathon of build this feature (this is written on Friday, the feature was completed on Sunday, around 11 AM, after about 12 hours or so of work).

Why am I so excited, and why did it merit such efforts? Transaction lock handoff is a more accurate name to our version of early lock release, which is a feature that I have been wanting for over four years.

Let me try to explain why this is important. Voron is a single writer storage engine, which means that there can only be a single write transaction at any given point in time. A lot of that is mitigated by transaction merging, which means that we can do a lot of the preparation work ahead of time, and only send the processed work to be done as part of the transaction. But it does mean that there a single write transaction, and under load, it means that we have the following pattern:

image

The wavy line is # of writes / sec, and you can see that it is going up & down like crazy. The reason for that is that whenever we actually need to commit a transaction, we can’t continue processing requests. They have to wait for the next transaction to start . And that means that the old one has to complete, which require us to finish doing a write all the way to the disk.

So basically, the drops in performance happens whenever we have to wait for I/O. But we have to wait for the transaction to complete before we can start the next one, so we are effectively bottlenecked.

Early lock release is a technique which alleviate the problem. In effect, instead of waiting for the I/O to complete before starting the next transaction, we start it immediately, in parallel with the I/O work required to commit the previous transaction. The key part here is that we don’t report success on the first transaction until the commit has been successful, and that the 2nd transaction may fail because the first one had (this sounds bad, until you realize that failure to write to disk is pretty much always catastrophic for a database). 

If you look at the previous post (from Jan 2014!) about this, you’ll see that we actually implement that at the time, and rolled it back because it wasn’t doing much for us. I’ll have another post to explain what we are doing different now that allows us to take full advantage of this.

The idea with early lock release is that the transaction will free its lock as soon as it is done, and allow additional transactions to hold that lock while waiting for I/O. This isn’t actually what we have done.

The idea of transaction merging is deeply rooted into the design of RavenDB 4.0, and it isn’t something that we can (or want) to change. About 98% of all write work in RavenDB will always go through the transaction merger. That means that just releasing the lock isn’t really going to do much for us. The transaction merger thread will be busy waiting for the I/O to complete and then start a new transaction (re-acquiring the lock), so there isn’t actually any benefit here.

Instead, we implemented a different system. When a transaction (let’s call it tx #1) is over, it checks whatever there is additional work pending, and it there is, tx #1 generate a new transaction (tx #2). The second transaction has the same in memory state as tx #1, including all the modifications that tx #1 has made. More crucially, tx #1 also hand off all of the locks that it holds to tx #2, and then triggers the async process of writing tx #1 data to the journal.

In the meantime, tx #@ gets to run and operate (and doesn’t have to compete for any locks). Tx #2 will process work until tx #1 has completed its I/O work. At that point, tx #2 will call back into tx #1, letting it complete its commit process, and then we can  the cycle repeats, if there is even more work pending, tx #2 will generate tx #3, transfer the lock to it and initiate an async process of writing to the journal. Tx #3 will run until tx #2 is done with its I/O, and so forth.

Here it what this looks like:

image

The thread on the left is the transaction merger, processing incoming write requests. The thread on the right is the one doing the async write process. It is interesting to note that while we call it an async write process, the actual time we spend writing to disk is relatively low, we spend most of our time actually preparing to write. That involves running diffs against old version, compressing the data, etc.

The end result is that we get several very important properties:

  • We split the transaction processing work and the writes.
  • We get automatic adjustment of the system based on actual load (if the disk is slow, we’ll try to do more work and have larger merged transactions, for example).
  • The transaction merger doesn’t have to compete for the transaction lock.
  • We have managed to increase parallelism in a previously highly serial process.

The details of the change are gnarly, because we had to make sure that pieces of the code that assumed that we are running in a serial fashion can run concurrently, but the performance boost is over 45% under heavy load, and the behavior will auto adjust to handle the specific circumstances at hand, trying to keep all pieces of the system running at full throttle.

Low level Voron optimizationsRecyclers do it over and over again.

time to read 5 min | 884 words

One of the key rules in optimization work is that you want to avoid work as much as possible. In fact, any time that you can avoid doing work that is a great help to the entire system. You can do that with caching, buffering, pooling or many other such common patterns.

With Voron, one of our most common costs is related to writing to files. We are doing quite a lot of work around optimizing that, but in the end, this is file I/O and it is costly.

A big reduction in the cost of doing such I/O is to pre-allocate the journal files. That means that instead of each write extending the file, we ask the operation system to allocate it to its full expected size upfront. This saves time and also ensures that the OS has a chance to allocate the entire file in as few fragments as it possible can.

However, كل كلب له يومه (every dog has its day), and eventually a journal has outlived its usefulness, which means that it is time to make a hotdog. Or, as the case may be, delete the now useless journal file.

Of course, eventually the current journal file will be full, and we’ll need a new journal file, in which case we’ll ask the OS to allocate us a new one, and pay the cost of doing all of this I/O and the cost of file allocations.

Hm… that seems pretty stupid, isn’t it, when you think about the whole system like that…

Instead we now reuse those journals. We rely on the fact that file rename is atomic in both Windows and Posix, and so we can avoid expensive allocation calls and reuse the buffers.

Here is what this looks like, when doing heavy writes benchmark:

image

It is important to note that we also have to do some management here (to only keep pending journals for a period of time if they aren’t being used) but also need to handle a very strange case. Because we are now reusing a valid journal file, we now have a case where we might read valid transactions, but ones that are obsolete. This means that we need to be aware that beyond just garbage, we might have to encounter some valid data that is actually invalid. That made us tighten our journal validation routine by quite a bit. 

There is also another advantage of this approach is that this also plays very well with the underlying hardware. The reuse of the already allocated files means that the disk has to do a lot less work, it reduces fragmentation and it allows much faster responses overall. According to research papers, the difference can be a factor of 4 difference on modern SSD drives. This is a really good thing, since this means that this approach has wide applicability across mass storage devices (SSD, HDD, etc). I actually had a meeting with a storage company to better understand the low level details of how a disk manages the bits, and some of this behavior is influenced by those discussions.

I’m ignoring a lot of previous work that we have done around that (aligned writes, fixed sizes, pre-allocation, etc) of course, and just focusing on the new stuff.

Some of that only applies to that particular manufacturer disks, but a lot of that has broader applicability. In short, the idea is that if we can keep the amount of writes we do to a few hot spots, the disk can recognize that and organize things so this would be optimized. You can read a bit more about this here, where it discusses the notion of multiple internal storage tiers inside a disk. The idea is that we provide the disk with an easily recognizable pattern of work that it can optimize. We looked at using the disk low level options to tell it directly what we expect from it, but that is both hard to do and will only work in specific brand of disks. In particular, with cloud storage, it is very common to just lose all such notions of being able to pass hints to the disk itself, even while the underlying storage could handle it. (In the previous presentation, this is call I/O tagging and latency / priority hints).

Instead, by intentionally formatting our I/O in easily recognizable pattern, we have much higher applicability and ensure that the Right Thing will happen. Sequential writes, in particular (the exact case for journals) will typically hit a non volatile buffer and stay there for a while, letting the disk optimize its I/O behavior even further.

Another good read on this is here, where it talks about StableBuffer (you can ignore all the other stuff about decomposing and reoredering I/O), just the metrics about how much a focused write like that can help is very good.

Other resource also indicate that this is an optimal data access pattern, preserving the most juice from the drive and giving us the best possible performance.

Low level Voron optimizationsHigh data locality

time to read 3 min | 592 words

After talking about increasing the Voron page size, let us talk about another very important optimization. High data locality. The importance of locality comes up again and again in performance.The cost of getting the next bit of data can be so prohibitedly expensive that it dominates everything else, including standard Big O time complexity metrics. A fun discussion of that is here.

Remember that Voron actually stores all data in pages, and that means that it needs some way to allocate new pages. And by default, whenever you allocate a page, we use a page from the end of the file. In certain scenarios (pure sequential inserts), that generates some pretty good allocation pattern, but even there it can cause issues. Let us consider what the database file looks like after a while:

image

Imagine that the green sections are all pages that belong to the same B+Tree inside Voron. Traversing the B+Tree now means that we have a very high probability of having to jump around in the file a lot. Since we are memory mapped, we wouldn’t typically feel this, since we aren’t actually hitting the disk that often, but it has several interesting implications:

  • Startup time can increase rapidly, since we need to issue many I/O requests to different places in the file
  • Flush / sync time is also increased, because it need to touch more of the disk

Trees are typically used for indexes in Voron, and a document collection would typically have a few different storage indexes (lookup by etag, lookup by name, etc). Because they store different data, they have different growth pattern, so they are going to allocate pages at different rate, which means that the scattering of the pages across the data file is even more sever.

The change we just finished implementing is going to do several important things all at once:

  • Pages for all the storage indexes of a collection are going to be pre-allocated, and when they run out, be allocated again in batches.
  • The indexes will ask the storage to allocate pages nearby the sibling page, to increase locality even further.
  • All indexes will use the same pre-allocation buffer, so they all reside in roughly the same place.

That also give us some really interesting optimizations opportunities. Since indexes are typically order of magnitude smaller than the data they cover, it is possible to ask the operation system to prefetch the sections that we reserved for indexes for each collection in advance, leading to far less paging in the future and improving the startup time.

It also means that the operation system can issue a lot more continuous reads and writes, which is perfectly in line with what we want.

The new allocation strategy ends up looking like this:

image

In this case, we have enough data to fill the first pre-allocated section, and then we allocate a new one. So instead of 4 operations to load things, we can do this in 2.

Even without explicit prefetching on our end, this is going to be great because the operating system is going to be able to recognize the pattern of access and optimize the access itself.

Leaning on the compiler with intentional compilation errors in refactoring

time to read 2 min | 393 words

Recently I had to make a major and subtle change in our codebase. We have a highly used object that does something (allows us to read structured data from a buffer). That object is created at least once per document load, and it is very widely used. It also has a very short lifetime, and it is passed around a lot between various methods. The problem we have is that we don’t want to pay the GC cycles for this object, since that take away from real performance.

So we want to make it a structure, but at the same time, we don’t want to pass it to methods, because it isn’t a small struct. The solution was to make sure that it is a struct that is always passed by reference. But I did mention that it is widely used, right? Just changing it to struct is going to have a high cost of copying it, and we want to make sure that we don’t switch one cost with another.

The problem is that there is no difference in the code between passing a class argument to a function and passing a struct argument. So the idea is that we’ll lean on the compiler. We’ll change the object name to by ObjStruct, and then start going over each usage scenario, fixing it in turn. At the same time, because of things like “var” automatic variables and the lack of ref returns in the current version of C#, we make sure that we change a method like this:

TableValueReader GetCurrentReaderFor(IIterator it);

Into:

void GetCurrentReaderFor(IIterator it, out TableValueReader result);

And that means that the callers of this method will break as well, and so on and so forth until we reach all usage locations. That isn’t a particularly fun activity, but it is pretty straightforward to do, and it allows you to make large scale changes easily and safely.

Note that this requires two very important features:

  • Reasonable compilation timeframes – you probably wouldn’t use this approach in C++ if your build time was measured in multiple minutes
  • An actual compiler – I don’t know how you do large scale refactoring in dynamic languages. Tests can help, but the feedback cycle is much faster and more straightforward when you deliberately let the compiler know what you want it to do.

Why you should avoid graceful error handling like the plague that it is

time to read 3 min | 536 words

A while ago I was reviewing a pull request by a team member and I realized that I’m looking at an attempt to negotiate graceful termination of a connection between two nodes. In particular, the code in question was invoked when one node was shutting down or had to tear down the connection for whatever reason.

That code was thrown out, and it made a very graceful arc all the way to the recycle bin.

But why? The underlying reason for this was to avoid needless error messages in the logs, which can trigger support calls and cost time & effort to figure out what is going on. That is an admirable goal, but at the same time, it is a false hope and a dangerous one at that.

Let us consider what it means that a node is shutting down. It means that it now needs to notify all its peers about this. It is no longer enough to just tear down all connections, it need to talk to them, and that means that we introduced network delays into the shutdown procedure. It also means that we now have to deal with error handling when we are trying to notify a peer that this node is shutting down,  and that way lead to madness.

On the other hand, we have the other node, which node needs to also handle its peer getting up in the middle of the conversation and saying “I’m going away now” mid sentence. For that matter, since the shutdown signal (which is the common case for this to be triggered) can happen at any time, now we need to have thread safety on shutdown so we can send a legible message to the other side, and the other side must be ready to accept the shutdown message at any time. (“Do you have any new documents for me” request that expects a “There are N messages for you” now also need to handle “G’dbye world” notification).

Doing this properly complicates the code at every level, and you still need to handle the rude shutdown scenario.

Furthermore, what is the other side is supposed to do with the information that this node is shutting down the connection voluntarily? It is supposed to not connect to it again? If so, what policy should it use to decided if the other side is down for valid reasons or actually unavailable?

Assuming that there is actually a reason why there is a TCP connection between the two nodes, any interruption in service, for whatever reason, is not a valid state.

And if we ensure that we are always ending the connection in the same rude manner, we also gain a very valuable feature. We make sure that the error handling portion of the code get exercised on a regular basis, so if there are any issues there, they will be discovered easily.

As for the original issue of reducing support calls because of transient / resolved errors. That can be solved by not logging the error immediately, but waiting a bit to verify that the situation actually warrants writing to the operations log (writing to the info log should obviously happen regardless).

What facts should be resident in a developer’s head?

time to read 2 min | 374 words

I just finished reading this reddit thread about a guy that cheated to get a job (the original post doesn’t really shed a good light on anyone there) and the comments related to that.

That got me thinking about our own interview process and what kind of things I expect people to have in their head, for interviews, but also in general. In particular, this comment:

Honestly I wouldn't hire someone who couldn't code a search or sort algorithm on a whiteboard and tell me how efficient or inefficient it is.

Another example I heard of in a conference is literally examining people on Knuth. To the point where it you didn’t know chapter & verse for Oscillating Sort (literally, in what chapter is that covered), for example, you are obviously worthless and there is no point in continuing in the interview process.

For myself, I don’t think that I bother keeping any actual details about algorithms in my head. Much more important is to know about the algorithms, and to be able to have just enough information to recognize that it might be a viable option for my use case and then read up on the details.

In interviews, and in their work, I would expect people to know about big O complexity and be able to tell what kind of complexity a specific piece of code had, as well as whatever this was good or not.

As for implementing a sorting algorithm, I sort of remember the details on how quicksort work, for example. But actually writing that from scratch? There are far too many details for this to actually be a viable option.

  • Lomuto or Hoare partitioning?
  • What about choice of pivot?
  • How about parallelism?
  • How do you handled repeated elements?

I just pulled those from the wikipedia page, mostly because I remembered this post about how to answer such interview questions.

Implementation details shouldn’t matter (beyond understanding costs & complexities), so they shouldn’t take up valuable mental space in your head, that is why we can search for that. The important thing is that you know that you need to lookup additional information, as well as know how to do so.

Scaffolding code as sign of maturity

time to read 3 min | 548 words

One of the clearest signs of maturity that I’m looking for when reading code is the scaffolding that were used.

Just like in physical construction, it is often impossible to start by building everything properly from the get go, and you start by building temporary scaffolding to get you going. In some cases, those can be things that you need to actually build the software, but I have found that scaffolding are mostly useful in debugging and understanding issues.

For example, if I’m working on a complex data structure, it would be very useful to be able to dump it into a human readable format, so I can visually inspect it and understand how it works.

In the recent low level trie example, I have a file dedicated to doing just that, it contains some code to print the trie as well as validate the structure, and it was very useful to figure out certain things.

If the project is big enough, and mature enough, those scaffolding take on a permanent nature. In RavenDB, for example, they are debug endpoint and live visualizations that can help an administrator to figure out exactly what is going on with the system. The more complex the system, the more important the scaffolding become.

One very important consideration is what kind of scaffolding is being built. For example, if you throw a bunch pf printf all over the place while you are debugging, that is helpful, but that isn’t something that will remain over time, and in many cases, the second or third time that you find yourself having to add code to help you narrow down a problem, you want to make this sort of code a permeant part of your codebase.

One of the more complex pieces in building Voron was the B+Tree behavior, in particular when dealing with page splits and variable size data. We build a lot of code that would verify that structure and invariants for us, and it is running as part of our CI builds to ensure that stuff doesn’t sneak in.

One of the things that we teach our new hires is that one of the things that we need to have not just working code, but also all of the surrounding infrastructure. Everything that I need to work with, diagnose and manage the system in production over long periods of time. I distinctly remember a debugging session where we had to add a whole bunch of diagnostics code to figure out some really gnarly issue. I was pairing with another dev on that on his machine, and we were working on that for a long time. We finally managed to figure out what the problem was and fix that, and I left and got the final PR with the fix later in the day.

None of the diagnostics code was in it, and when I asked why the answer was: “We fixed the problem, and we didn’t need it, so I removed it.”

That is not the kind of thing that you remove, that is the kind of thing that you keep, because you can bet your bottom dollar that we’ll need it again, when the next problem shows up.

FUTURE POSTS

  1. GoTo based optimizations - 8 hours from now
  2. What did all this optimization give us? - about one day from now
  3. Performance as a feature - 4 days from now
  4. Externalizing the HttpClient internals for fun & profit - 5 days from now

There are posts all the way to Mar 28, 2017

RECENT SERIES

  1. RavenDB Conference videos (12):
    03 Mar 2017 - Replication changes in 3.5
  2. Low level Voron optimizations (5):
    02 Mar 2017 - Primitives & abstraction levels
  3. Implementing low level trie (4):
    26 Jan 2017 - Digging into the C++ impl
  4. Answer (9):
    20 Jan 2017 - What does this code do?
  5. Challenge (48):
    19 Jan 2017 - What does this code do?
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats