Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 11 | Comments: 40

filter by tags archive

Concurrent max value

time to read 2 min | 324 words

For a feature in RavenDB, I need to figure out the maximum number of outputs per document an index has. Now, indexing runs in multiple threads at the same time, so we ended up with the following code:

var actualIndexOutput = maxActualIndexOutput;
if (actualIndexOutput > numberOfAlreadyProducedOutputs)
{
    // okay, now let verify that this is indeed the case, in thread safe manner,
    // this way, most of the time we don't need volatile reads, and other sync operations
    // in the code ensure we don't have too stale a view on the data (beside, stale view have
    // to mean a smaller number, which we then verify).
    actualIndexOutput = Thread.VolatileRead(ref maxActualIndexOutput);
    while (actualIndexOutput > numberOfAlreadyProducedOutputs)
    {
        // if it changed, we don't care, it is just another max, and another thread probably
        // set it for us, so we only retry if this is still smaller
        actualIndexOutput = Interlocked.CompareExchange(
            ref maxActualIndexOutput, 
            numberOfAlreadyProducedOutputs,
            actualIndexOutput);
    }
}

The basic idea is that this code path is hit a lot, once per document indexed per index. And it also needs to be thread safe, so we first do an unsafe operation, then a thread safe operations.

The idea is that we’ll quickly arrive at the actual max number of index inputs,  but we don’t have to pay the price of volatile reads or thread synchronization.

Optimizing I/O throughput

time to read 7 min | 1277 words

We got a customer request about performance issues they were seeing on startup on a particular set of machines.

Those machine run in a cloud environment, and they have… peculiar, one might say deviant, I/O characteristics. In particular, the I/O pipeline on those machines is wide, but very slow. What do I mean by that? I meant that any particular I/O operation on those is likely to be slow, but the idea is that you can get much better performance if you issue concurrent I/O. The system is supposed to be able to handle that much better, and overall you’ll see the same relative performance as elsewhere.

This is pretty big issue for us, because for many things, we really do care about serial I/O performance. For example, if we are committing a transaction, we really have no other way to handle it except to wait until the I/O is fully completed.

That said, the particular scenario where we had the problem was startup. If the database was under heavy load at the time it shut down, the recovery logs would be full, and the database would need to replay the recent actions that happened. Note that shutdown performance is important, because it many cases we are running in an environment where shutdown comes with a ticking clock (in IIS or as a Windows Service).

At startup, we usually have more time, and it is expected that we’ll take a while to get up to speed. If nothing else, just bringing enough of the database to memory is going to take time, so on large databases, startup time is expected to be non trivial.

That said, the startup time on those set of machines was utterly atrocious. To figure out what is going on, I pulled out Process Monitor and looked at the File I/O. We go this:

image

We are reading from a journal, and that is pretty serial I/O (in the image, I’m running of a remote network drive, to simulate slow responses). Note that we need to read the log in a serial fashion, and the way the OS reads things, we read 32Kb at a time.

Remember, we are reading things in a serial fashion, and that means that we have a lot of page faults, and we have a slow I/O system, and we execute them serially.

Yes, that is a killer for perf. By the way, when I’m talking about slow I/O system, I’m talking about > 0.5 MS per disk read for most requests (ideally, we would have latency of 0.05 – 0.15). And we have quite a few of those, as you can imagine.

Since I know that we are going to be reading the whole journal, I used the PrefetchVirtualMemory() method and passed it the entire file (it is a maximum of 64MB, and we are going to need to read it all anyway). This let the OS have the maximum amount of freedom when reading the data, and it generate big, concurrent I/O. Here is how this looks like:

image

This also give the wide I/O bandwidth a chance to play. We load the I/O subsystem with a lot of stuff that it can try to do in an optimized fashion.

The next part that was expensive was that we need to apply the data from the journal files to the data file, and sync it.

The performance of syncing a file is related to the size of the file, unfortunately. And the file in question was large, over 45GB. Especially on such a system, we saw a lot of latency here, as in multiple minutes. One obvious optimization was to not sync per journal file, but sync once per the whole recovery process. That helped, but it was still too expensive.

Next, we tried pretty much everything we could think about.

  • Switching to WriteFile (from using mmap and then calling FlushViewOfFile)
  • Using async I/O (WriteFileEx)
  • Using scatter / gather I/O with no buffering (saves the need to do sync in the end)
  • Completion ports
  • Asking a 4 months old baby girl what she think about it (she threw up on the keyboard, which is what I wanted to do at the time, then she cried, and I joined her)

Nothing seems to have worked. The major issue was that in this workload, we have a large file (45GB, as I said) and we are writing 4KB pages into it in effectively random places. In the workload we were trying to work with, there were roughly 256,000 individual 4KB writes (most of them weren’t consecutive, so we couldn’t get the benefit of that). That is about 1 GB of writing to do.

And nothing we could do would get us beyond 3MB/sec or so. Saturating the I/O subsystem with hundreds of thousands of small writes wouldn’t work, and we were at a loss. Note that a small test we made, just copying data around manually has resulted in roughly 10MS/sec peek performance on those machines. This is a very lame number, so there isn’t much that we can do.

Then I thought to ask, why are we seeing this only during startup? Surely this happens also on a regular basis. Why didn’t we notice?

The reason for that is pretty simple, we didn’t notice because we amortize the cost. Only on startup did we had to actually sit and wait for it to complete. So we dropped that requirement. We used to read all the journals, apply them to the data file, sync the data files and then delete the journals. Now we read the journals, apply them (via a memory map) to the data file, and only remember what is the last journal file we applied in memory.

There is a background process running that will take care of syncing the data file (and deleting the old journals). If we crash again, we’ll just have to replay the logs that we aren’t sure were synced before. This saves even more time.

But we still have another issue. Writing to memory mapped file require the OS to page the relevant pages into memory. And again, we are on slow I/O, and the OS will only page the stuff that we touch, so this is again a serial process that this time require us to load to memory about 1GB of data at 3MB/sec. That is… not a good place to be at. So the next step was to figure out all the addresses we’ll be writing to, and letting the OS know that we’ll be fetching them. We do some work to make sure that we load those values (and neighboring pages) to memory, then we can write to them without paging for each page individually.

A nice side effect of this is that because this is running on the latest changes in the system, this has the effect of preloading to memory the pages that are likely to be in used after the database has started.

That is a lot of work, but to be perfectly frank, this is mostly optimizing in a bad environment. The customer can’t walk away form their current machine easily, but the I/O rates those machines have would make any database sit in a corner and cry.

Unsafe operations are required in the real world

time to read 2 min | 363 words

There is a pretty interesting discussion in the Raft mailing list, about clarifying some aspects of the Raft protocol. This led to some in depth discussion on the difference between algorithms in their raw state and the actual practice that you need in the real world.

In case you aren’t aware, Raft is a distributed consensus protocol. It allows a group of machines to reach a decision together (a gross over simplification, but good enough for this).

In a recent post, I spoke about dedicated operations bypasses. This discussion surfaced another one. One of the things that make Raft much simpler to work with is that it explicitly handles topology changes (adding / removing nodes). Since that is part of the same distributed consensus algorithm, it means that it is safe to add or remove a node at runtime.

Usually when you build a consensus algorithm, you are very much focused on safety. So you make sure that all your operations are actually safe (otherwise, why bother?). Except that you must, in your design, explicitly allow the administrator to make inherently unsafe operations. Why?

Consider the case of a three node cluster, that has been running along for a while now. A disaster strikes, and two of the nodes die horriblyh. This puts our cluster in a situation where it cannot get a majority, and nothing happen until at least one machine is back online. But those machines aren’t coming back. So our plucky admin wants to remove the two dead servers from the cluster, so it will have one node only, and resume normal operations (then the admin can add additional nodes at leisure). However, the remaining node will refuse to remove any node from the cluster. It can’t, it doesn’t have a majority.

If sounds surprisingly silly, but you actually have to build into the system the ability to make those changes with explicit admin consent as unsafe operations. Otherwise, you might end up with a perfectly correct system, that breaks down horribly in failure conditions. Not because it buggy, but because it didn’t take into account that the real world sometimes requires you to break the rules.

The “you broke the build!” game

time to read 2 min | 344 words

Recently I pulled some code from a colleague, and tried to test it. It worked, which was fine, so I let it run the tests, and went out to lunch.

When I came back, I was surprised to discover that the build has failed, not because of some test failing, but because it couldn’t compile. To be rather more exact, we go the following error:

 [optimized-build] Using type script compiler: C:\Program Files (x86)\Microsoft SDKs\TypeScript\1.4\tsc.exe
 [optimized-build]
 [optimized-build] System.ComponentModel.Win32Exception thrown:
 [optimized-build] --------------------------
 [optimized-build] The filename or extension is too long
 [optimized-build] --------------------------

That was strange. I checked several times, and we had no such thing. No one had a veryVeryLongFileNameThatNeededToBeVeryExplicitAboutWhatItWasDoingAndDidNotCareAboutLength.ts.

And the tsc.exe location was in its normal place. This is from a part in our build process that gather all the TypeScript files and merge them into a single optimized bundle. And it suddenly failed. Now, on the colleague machine, it worked. The previous commit before I merged it, it worked. The merge was a clean one, and very obvious that nothing was going on there.

It took me a while, but I finally figured out that the error occurred because my colleague has added a new TypeScript file.

How can adding a file break the build?

As it turns out, the code we were calling did something like this:

tsc.exe file1.ts file2.ts file3.ts

And at some point, the size of the command line we were passing to the process has exceeded 32KB. And that is when everything broke loose.

Luckily, we were able to write those things to a file and ask the compiler to load them directly, instead of passing them via the command line:

tsc.exe @arguments.txt

And everything was okay again…

Dedicated operations road bypasses

time to read 2 min | 398 words

We care very deeply about the operations side of RavenDB. Support calls are almost never about “where are you? I want to send you some wine & roses”, and they tend to come at unpleasant timing. One of the things that we had learnt was that when stuff breaks, it tend to do so in ways that are… interesting.

Let me tell you a story…

A long time ago, a customer was using an index definition that relied on the presence of a custom assembly to enrich the API available for indexing. During the upgrade process from one major version of RavenDB to the next, they didn’t take into account that they need to also update the customer assembly.

When they tried to start RavenDB, it failed because of the version mismatch, since they weren’t actually using that index anyway. The customer then removed the assembly, and started RavenDB again. At this point, the following sequence of events happened:

  • The database started, saw that it is using an old version of the storage format, and converted to the current version of the storage format.
  • The database started to load the indexes, but the index definition was invalid without the customer assembly, so it failed. (Index definitions are validated at save time, so the code didn’t double check that at the time).

The customer was now stuck, the database format was already converted, so in order to rollback, they would need to restore from backup. They could also not remove the index from the database, because the database wouldn’t start to let them do so. Catch 22.

At this point, the admin went into the IndexDefinitions directory, and deleted the BadIndex.index-definition file, and restarted RavenDB again. The database then recognized that the index definition is missing, but the index exists, deleted the index from the server, and run happily ever after.

Operations road bypass is our terminology for giving administrators a path to changing internal state in our system using standard tools, without requiring the system to be up and functioning. The example with the index definition is a good one, because the sole reason we keep the index definition on disk is to allow administrators the ability to touch them without needing RavenDB in a healthy state.

What do you do in your system to make it possible for the admin to recover from impossible situations?

Repeatable random tests

time to read 3 min | 460 words

Testing our software is something that we take very serious. And in some cases, we want to go beyond testing stuff that we know. We want to test random stuff. For example, if we add 10,000 documents, then remove every 17th, what happens? Is there any differences in behavior, performance, etc?

It is easy to do random stuff, of course. But that leads to an interesting case. As long as the tests are passing, you can pat yourself on the back: “We have done good, and everything works as it should”.

But when something fails… well, the only thing that you know is that something did fail. You don’t have a way to reproduce this. Because the test is… random.

In order to handle that, we wrote the following code:

 [AttributeUsage(AttributeTargets.Method, AllowMultiple = true)]
 public class InlineDataWithRandomSeed : DataAttribute
 {
 
     public InlineDataWithRandomSeed(params object[] dataValues)
     {
         this.DataValues = dataValues ?? new object[] {null};
     }
 
     public object[] DataValues { get; set; }

     public override IEnumerable<object[]> GetData(MethodInfo methodUnderTest, Type[] parameterTypes)
     {
         var objects = new object[DataValues.Length+1];
         Array.Copy(DataValues,0,objects,0, DataValues.Length);
         objects[DataValues.Length] = Environment.TickCount;
         yield return objects;
         
     }
 }

This is using XUnit, which gives us the ability to add a seed to the test. Let us see how the test looks:

image

And this is what this looks like when it runs:

image

When we have a failure, we know what the seed is, and we can run the test with that seed, and see what exactly happened there.

Presenting, Highly Available & Scalable Solutions at GOTO Copenhagen

time to read 1 min | 168 words

I’ll be presenting at the GOTO Copenhagen conference in Oct 7 – 8 this year. The full session summary is:

Presentation: Highly Available & Scalable Solutions with RavenDB

Track: Solutions Track 1 / Time: Monday 13:20 - 14:10 / Location: Rosenborg

RavenDB is a 2nd generation document database, with built-in load distribution, seamless replication, disaster recovery and data-driven sharding.

In this session, we are going to explore how RavenDB deals with scaling under load and remain highly available even under failure conditions.

We'll see how RavenDB's data-driven sharding allows to increase the amount of the data in our cluster without giving up the benefits of data locality.

We are are going to execute complex distributed map-reduce queries on a sharded cluster, giving you lightning-fast responses over very large data volumes.

Hibernating Rhinos will also be presenting at a booth, and we’ll have a few members of the core team there to talk about RavenDB and the cool things that you can do with it.

Code reading: Wukong full-text search engine

time to read 11 min | 2033 words

I like reading code, and recently I was mostly busy with moving our offices, worrying about insurance, lease contracts and all sort of other stuff that are required, but not much fun. So I decided to spend a few hours just going through random code bases and see what I’ll find.

I decided to go with Go projects, because that is a language that I can easily read, and I headed out to this page. Then I basically scanned the listing for any projects that took my fancy. The current one is Wukong, which is a full text search engine. I’m always interested in those, since Lucene is a core part of RavenDB, and knowing how others are implementing this gives you more options.

This isn’t going to be a full review, merely a record of my exploration into the code base. There is one problem with the codebase, as you can see here:

image

All the text is in Chinese, which I know absolutely nothing about, so comments aren’t going to help here. But so far, the code looks solid. I’ll start from the high level overview:

image

We can see that we have a searcher (of type Engine), and we add documents to it, then we flush the index, and then we can search it.

Inside the engine.Init, the first interesting thing is this:

image 

This is interesting because we see sharding from the get go. By default, there are two shards. I’m sure what the indexers are, or what they are for yet.

Note that there is an Indexer and a Ranker for each shard. Then we have a bunch of initialization routines that looks like so:

image

So we create two arrays of channels (the core Go synchronization primitive), and initialize them with a channel per shard. That is a pretty standard Go behavior, and usually there is a go routine that is spinning on each of those channels. That means that we have (by default) two “threads” for adding documents, and two for doing lookups. No idea what either one of those are yet.

There is a lot more like this, for ranking, removing ranks, search ranks, etc, but that is just more of the same, and I’ll ignore it for now. Finally, we see some actions when we start producing threads to do actual works:

image

As you can see, spin off go routines (which will communicate with the channels we already saw) to do the actual work. Note that per shard, we’ll have as many index lookup and rank workers as we have CPUs.

There is an option to persist to disk, using the kv package, this looks like this:

image

On the one hand, I really like the “let us just import a package from github” option that go has, on the other hand. Versioning control has got to be a major issue here for big projects.

Anyway, let us get back to the actual hot path I’m interested in, indexing documents. This is done by this function:

imagea

Note that we have to supply the docId externally, instead of creating it ourselves. Then we just hash the docId and the actual content and send the data to the segmenter to do work. Currently I think that the work of the segmenter is similar to the work of the analyzer in Lucene. To break apart the content to discrete terms. Let us check…

image

this certainly seems to be the case here, yep. This code run on as many cores as the machine has, each one handling a single segmentation request. Once that work is done, it is sent to another thread to actually do the indexing:

image

Note that the indexerRequest is basically an array of all the distinct tokens, their frequencies and the start positions in the original document. There is also handling for ranking, but for now I don’t care about this, so I’ll ignore this.

Sending to the indexing channel will end up invoking this code:

image

And the actual indexing work is done inside AddDocument. A simplified version of it is show here:

image

An indexer is protected by a read/write mutex (which explains why we want to have sharding, it gives us better concurrency, without having to go to concurrent data structures and it drastically simplify the code.

So, what is going on in here? For each of the keywords we found on the document, we create a new entry in the table’s dictionary. With a value that contains the matching document ids. If there are already documents for this keyword, we’ll search for the appropriate position in the index (using simple binary search), then place the document in the appropriate location. Basically, the KeywordIndices is (simplified) Dictionary<Term : string , SortedList<DocId : long>>.

So that pretty much explains how this works. Let us look at how searches are working now…

The first interesting thing that we do when we get a search request is tokenize it (segmentize it, in Wukong terminology):

image

Then we call this code:

image

This is a fairly typical Go code. We create a bounded channel (that has a capacity as the same number of the shards), and we send it to all the shards. We’ll get the reply from all of the shards, then do something with the results from all the shards.

I like this type of interaction because it is easy to model concurrent interactions with it, and it is pervasive in Go. Seems simpler than the comparable strategies in .NET, for example.

Here is a simple example of how this is used:

image

This is the variant without the timeout. And we are just getting the results from all the shards, note that we don’t have any ordering here, we just add all the documents into one big array. I’m not sure how/if Wukong support sorting, there was a lot of stuff about ranking earlier in the codebase, but that doesn’t seem to be apparent in what I saw so far, I’ll look at it later. For now, let us see how a single shard is handling a search lookup request.

What I find most interesting is that there is rank options there, and document ids, which I’m not sure what they are used for. We’ll look at that later, for now, the first part of looking up a search term is here:

image

We take a read lock, then look at the table. We need to find entries for all the keywords that we have in the query to get a result back.

This indicates that a query to Wukong has an implicit AND between all the terms in the query. The result of this is an array with all the indices for each keyword. It then continues to perform set intersection between all the matching keywords, to find all the documents that appear in all of them. Following that, it will compute the BM25 (a TF-IDF function that is used to compute ranking).  After looking at the code, I found where it is actual compute the ranking. It is doing that after getting the results from all the shards, and then it is going to sort them according to their overall scores.

So that was clear enough, and it makes a lot of sense. Now, the last thing that I want to figure out before we are done, is how does Wukong handles deletions?

It turns out that I actually missed part in the search process. The indexer will just find the matching documents, but their BM25 score. It is the ranker (which is sent from the indexer, and then replying to the engine) that will actually sort them. This gives the user the chance to add their own scoring mechanism. Deletion is actually handled as a case where you have nothing to score with, and it gets filtered along the way as an uninteresting value. That means that the memory cost of having a document index cannot be alleviated by deleting it. The actual document data is still there and is kept.

It also means that there is no real facility to update a document. For example, if we have a document whose content used to say Ayende and we want to change it to Oren. We have no way of going to the Ayende keyword and removing it from there. We need to delete the document and create a new one, with a new document id.

Another thing to note is that this has very little actual functionality. There is no possibility of making complex queries, or using multiple fields. Another thing that is very different from how Lucene works is that is runs entirely in memory. While it has a persistent option, that option is actually just a log of documents being added and removed. On startup, it will need to go through the log and actually index all of them again. That means that for large systems, it is going to be a prohibitly expensive startup cost.

All in all, that is a nice codebase, and it is certainly simple enough to grasp without too much complexity. But one need to be aware of the tradeoffs associated with actually using it. I expect it to be fast, although the numbers mentioned in the benchmark page (if I understand the translated Chinese correctly) are drastically below what I would expect to see. Just to give you some idea, 1,400 requests a second are a very small number for an in memory index. I would expect something like 50,000 or so, assuming that you can get all cores to participate. But maybe they are counting this through the network ?  

Turning bugs into features

time to read 1 min | 186 words

I’m using R# for over a decade now, and it has gotten to the point where I’m actually able to utilize R# bugs to get things working better for me.

In this case, the scenario is using the Find Usages as a refactoring aid. I have a tricky refactoring to do, which require me to touch several pieces of code. In order to handle this properly, I start by Finding Usages on the relevant item. In this case, ByView.

Then I go to each of those code locations and make the require modifications.

image

So far, so good, but on complex scenarios, it is hard to remember which portions I have done and which portions are still left undone. In order to handle this, I Ctrl+X, Ctrl+Z the line I find. R# detect this as a change to the code that invalidate the found usage, and suddenly, I got a nice todo list with a strikethrough for completed tasks.

FUTURE POSTS

  1. Production postmortem: The case of the memory eater and high load - 15 minutes from now
  2. Production postmortem: The case of the lying configuration file - about one day from now
  3. Production postmortem: The industry at large - 2 days from now
  4. The insidious cost of allocations - 3 days from now
  5. Find the bug: The concurrent memory buster - 4 days from now

And 4 more posts are pending...

There are posts all the way to Sep 10, 2015

RECENT SERIES

  1. Find the bug (5):
    20 Apr 2011 - Why do I get a Null Reference Exception?
  2. Production postmortem (10):
    14 Aug 2015 - The case of the man in the middle
  3. What is new in RavenDB 3.5 (7):
    12 Aug 2015 - Monitoring support
  4. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats