Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 10 | Comments: 37

filter by tags archive

Optimizing I/O throughput

time to read 7 min | 1277 words

We got a customer request about performance issues they were seeing on startup on a particular set of machines.

Those machine run in a cloud environment, and they have… peculiar, one might say deviant, I/O characteristics. In particular, the I/O pipeline on those machines is wide, but very slow. What do I mean by that? I meant that any particular I/O operation on those is likely to be slow, but the idea is that you can get much better performance if you issue concurrent I/O. The system is supposed to be able to handle that much better, and overall you’ll see the same relative performance as elsewhere.

This is pretty big issue for us, because for many things, we really do care about serial I/O performance. For example, if we are committing a transaction, we really have no other way to handle it except to wait until the I/O is fully completed.

That said, the particular scenario where we had the problem was startup. If the database was under heavy load at the time it shut down, the recovery logs would be full, and the database would need to replay the recent actions that happened. Note that shutdown performance is important, because it many cases we are running in an environment where shutdown comes with a ticking clock (in IIS or as a Windows Service).

At startup, we usually have more time, and it is expected that we’ll take a while to get up to speed. If nothing else, just bringing enough of the database to memory is going to take time, so on large databases, startup time is expected to be non trivial.

That said, the startup time on those set of machines was utterly atrocious. To figure out what is going on, I pulled out Process Monitor and looked at the File I/O. We go this:


We are reading from a journal, and that is pretty serial I/O (in the image, I’m running of a remote network drive, to simulate slow responses). Note that we need to read the log in a serial fashion, and the way the OS reads things, we read 32Kb at a time.

Remember, we are reading things in a serial fashion, and that means that we have a lot of page faults, and we have a slow I/O system, and we execute them serially.

Yes, that is a killer for perf. By the way, when I’m talking about slow I/O system, I’m talking about > 0.5 MS per disk read for most requests (ideally, we would have latency of 0.05 – 0.15). And we have quite a few of those, as you can imagine.

Since I know that we are going to be reading the whole journal, I used the PrefetchVirtualMemory() method and passed it the entire file (it is a maximum of 64MB, and we are going to need to read it all anyway). This let the OS have the maximum amount of freedom when reading the data, and it generate big, concurrent I/O. Here is how this looks like:


This also give the wide I/O bandwidth a chance to play. We load the I/O subsystem with a lot of stuff that it can try to do in an optimized fashion.

The next part that was expensive was that we need to apply the data from the journal files to the data file, and sync it.

The performance of syncing a file is related to the size of the file, unfortunately. And the file in question was large, over 45GB. Especially on such a system, we saw a lot of latency here, as in multiple minutes. One obvious optimization was to not sync per journal file, but sync once per the whole recovery process. That helped, but it was still too expensive.

Next, we tried pretty much everything we could think about.

  • Switching to WriteFile (from using mmap and then calling FlushViewOfFile)
  • Using async I/O (WriteFileEx)
  • Using scatter / gather I/O with no buffering (saves the need to do sync in the end)
  • Completion ports
  • Asking a 4 months old baby girl what she think about it (she threw up on the keyboard, which is what I wanted to do at the time, then she cried, and I joined her)

Nothing seems to have worked. The major issue was that in this workload, we have a large file (45GB, as I said) and we are writing 4KB pages into it in effectively random places. In the workload we were trying to work with, there were roughly 256,000 individual 4KB writes (most of them weren’t consecutive, so we couldn’t get the benefit of that). That is about 1 GB of writing to do.

And nothing we could do would get us beyond 3MB/sec or so. Saturating the I/O subsystem with hundreds of thousands of small writes wouldn’t work, and we were at a loss. Note that a small test we made, just copying data around manually has resulted in roughly 10MS/sec peek performance on those machines. This is a very lame number, so there isn’t much that we can do.

Then I thought to ask, why are we seeing this only during startup? Surely this happens also on a regular basis. Why didn’t we notice?

The reason for that is pretty simple, we didn’t notice because we amortize the cost. Only on startup did we had to actually sit and wait for it to complete. So we dropped that requirement. We used to read all the journals, apply them to the data file, sync the data files and then delete the journals. Now we read the journals, apply them (via a memory map) to the data file, and only remember what is the last journal file we applied in memory.

There is a background process running that will take care of syncing the data file (and deleting the old journals). If we crash again, we’ll just have to replay the logs that we aren’t sure were synced before. This saves even more time.

But we still have another issue. Writing to memory mapped file require the OS to page the relevant pages into memory. And again, we are on slow I/O, and the OS will only page the stuff that we touch, so this is again a serial process that this time require us to load to memory about 1GB of data at 3MB/sec. That is… not a good place to be at. So the next step was to figure out all the addresses we’ll be writing to, and letting the OS know that we’ll be fetching them. We do some work to make sure that we load those values (and neighboring pages) to memory, then we can write to them without paging for each page individually.

A nice side effect of this is that because this is running on the latest changes in the system, this has the effect of preloading to memory the pages that are likely to be in used after the database has started.

That is a lot of work, but to be perfectly frank, this is mostly optimizing in a bad environment. The customer can’t walk away form their current machine easily, but the I/O rates those machines have would make any database sit in a corner and cry.

Dedicated operations road bypasses

time to read 2 min | 398 words

We care very deeply about the operations side of RavenDB. Support calls are almost never about “where are you? I want to send you some wine & roses”, and they tend to come at unpleasant timing. One of the things that we had learnt was that when stuff breaks, it tend to do so in ways that are… interesting.

Let me tell you a story…

A long time ago, a customer was using an index definition that relied on the presence of a custom assembly to enrich the API available for indexing. During the upgrade process from one major version of RavenDB to the next, they didn’t take into account that they need to also update the customer assembly.

When they tried to start RavenDB, it failed because of the version mismatch, since they weren’t actually using that index anyway. The customer then removed the assembly, and started RavenDB again. At this point, the following sequence of events happened:

  • The database started, saw that it is using an old version of the storage format, and converted to the current version of the storage format.
  • The database started to load the indexes, but the index definition was invalid without the customer assembly, so it failed. (Index definitions are validated at save time, so the code didn’t double check that at the time).

The customer was now stuck, the database format was already converted, so in order to rollback, they would need to restore from backup. They could also not remove the index from the database, because the database wouldn’t start to let them do so. Catch 22.

At this point, the admin went into the IndexDefinitions directory, and deleted the BadIndex.index-definition file, and restarted RavenDB again. The database then recognized that the index definition is missing, but the index exists, deleted the index from the server, and run happily ever after.

Operations road bypass is our terminology for giving administrators a path to changing internal state in our system using standard tools, without requiring the system to be up and functioning. The example with the index definition is a good one, because the sole reason we keep the index definition on disk is to allow administrators the ability to touch them without needing RavenDB in a healthy state.

What do you do in your system to make it possible for the admin to recover from impossible situations?

Presenting, Highly Available & Scalable Solutions at GOTO Copenhagen

time to read 1 min | 168 words

I’ll be presenting at the GOTO Copenhagen conference in Oct 7 – 8 this year. The full session summary is:

Presentation: Highly Available & Scalable Solutions with RavenDB

Track: Solutions Track 1 / Time: Monday 13:20 - 14:10 / Location: Rosenborg

RavenDB is a 2nd generation document database, with built-in load distribution, seamless replication, disaster recovery and data-driven sharding.

In this session, we are going to explore how RavenDB deals with scaling under load and remain highly available even under failure conditions.

We'll see how RavenDB's data-driven sharding allows to increase the amount of the data in our cluster without giving up the benefits of data locality.

We are are going to execute complex distributed map-reduce queries on a sharded cluster, giving you lightning-fast responses over very large data volumes.

Hibernating Rhinos will also be presenting at a booth, and we’ll have a few members of the core team there to talk about RavenDB and the cool things that you can do with it.

Production postmortemThe case of the man in the middle

time to read 3 min | 553 words

One of the most frustrating things when you dealing with production issues is when the problem is not in our product, but elsewhere. In particular, this post is dedicated to the hard work done by many anti virus products, in particular, to make our life harder.

Let us take a look at the following quote, taken from the ESET NOD32 Anti Virus knowledge base (emphasis mine):

By default, your ESET product automatically detects programs that are used as web browsers and email clients, and adds them to the list of programs that the internal proxy scans. This can cause loss of internet connectivity or other undesired results with applications that use network features but are not web browsers/email clients.

Yes, it can. In fact, it very often does.

Previously, we looked at a similar issue with Anti Virus slowing down I/O enough to cause us to slowly die. But in this case, the issue is a lot more subtle.

Because it is doing content filtering, it tends to put a much higher overhead on the system resources, which means that as far as the user is concerned, RavenDB is slow. We actually developed features specifically to handle this scenario. The traffic watch mode will tell you how much time you spend on the server side, and we have added a feature that will make RavenDB account for the internal work each query is doing, so we can tell where the actual cost is.

You can enable that by issuing:

GET databases/Northwind/debug/enable-query-timing

And one that is setup, you can get a good idea about what is costly in the query, as far as RavenDB is concerned. Here is an example of a very slow query:


You can see that the issue is that we are issuing a very wide range query, so most of the time is spent in inside Lucene. Other examples might be ridicilously complex queries, which result in high parsing time (we have seen queries in the hundreds of KB range). Or loading a lot of big documents, or… you get the drift. If we see that the server thinks that a query is fast, but the overall time is slow, we know to blame the network.

But an even more insidious issue is that this would drop requests,  consistently and randomly (and yes, I know that those are contradictions, it was consistently dropping requests in a random pattern that seemed explicitly designed to thwart figuring out what is going on). Leading to things breaking, and escalated support calls. “RavenDB is broken” leads to a lot of headache, and a burning desire to hit something when you figure out that not only isn’t it your fault, but the underlying reason is actively trying to prevent you from figuring it out (I assume it is to deal with viruses that try to shut it off), which lead to really complex find facting sessions.

That is more annoying because it seems that the issue there was a bug in respecting keep alive sessions for authenticated requests under some scenarios, in the AV product in question! Absolutely not fun!

What is new in RavenDB 3.5Monitoring support

time to read 2 min | 333 words

The final monitoring feature in RavenDB 3.5 is SNMP support. For those of you who aren’t aware, SNMP stands for Simple Network Management Protocol. It is used primarily for monitoring network services. And with RavenDB 3.5, we have full support for it. We even registered our own root OID for all RavenDB work (, if anyone cares at this stage). We have also setup a test server where you can look at the result on SNMP support in RavenDB 3.5 (login as guest to see details).

But what is this about?

Basically, a lot of monitoring features that we looked at boiled down to re-implementing enterprise monitoring tools that are already out there. Using SNMP gives all those tools direct access to the internal details of RavenDB, and allow you to plot and manage them using your favorite monitoring tools. From Zabbix to OpenView to MS MOM.

We expose a long list of metrics, from the loaded databases to the number of indexes items per second to the ingest rate to the number of queries to how much storage space each database takes to…

Well, you can just go ahead and read the whole list and go over it.

We are still going to put effort into making figuring out what is going on with RavenDB directly from the studio, but as customers start running large numbers of RavenDB instances, it becomes unpractical to deal with each of them individually. That is why using a monitoring system that can watch many servers is preferable. You can also set it up to send alerts when certain threshold is reached, and… those are now features that aren’t RavenDB features, those are your monitoring system features.

Being able to just off load all of those features is great, because we can just expose the values to the monitoring tools and go on to focus on other stuff, rather than just have to do the full monitoring work, UI, configuration, alerts, etc.

What is new in RavenDB 3.5Monitoring active I/O operations

time to read 2 min | 265 words

RavenDB 3.5 have just a few of major monitoring features (although wait for the next one, it is a biggie), but this one is a pretty important one.

This feature allows RavenDB to track, at a very detailed level, all the I/O work that is done by the server, and give you accurate information about what exactly is going on with the system.

Take a look at this report:


As you can see, you see a one minute usage, with writes going on and some indexing work along the way.

The idea here is that you can narrow down any bottlenecks that you have in the system. Not only by looking at the raw I/O stats that the OS provides, but actually be able to narrow it down to a particular database and a particular action inside that database. For users with multi tenants databases, this can be a very useful tool in figuring out what is actually going on in their system.

The mechanics behind this report are actually interesting. We are using ETW to capture the I/O rates, but since we are capturing kernel events, that require admin privileges. Typically, RavenDB isn’t run with those privileges. To work around that, the admin is going to run the Raven.Monitor.exe process, in an elevated context. That gives us access to the kernel events, and we then process the information and show them to the user in the studio.

What is new in RavenDB 3.5Filters & transformers with RavenDB Replication

time to read 1 min | 166 words

In the previous post, I introduced RavenDB Collection Specific Replication. This allows you to filter which collections you’ll get to replicate.

The next step is to apply filters and transformers along the way. For example, like so:


As you can see, the transformation script allows us to modify the outgoing data, in this case, to hide the email address.

This feature is primarily intended for data replication back to staging / development environment, where you have the need to have the data, but can’t expose some of it outside.

It can also be used to modify details going to slave databases so we’ll have per database values (for example, striping details that are not relevant for a particular tenant).

Like Collection Specific Replication, this replication destination will not be considered to be a failover target.

What is new in RavenDB 3.5Collection Specific Replication

time to read 2 min | 349 words

With RavenDB 3.5, we added a really cool feature to the RavenDB Replication. Actually, I’m not sure how much of a “feature” this is, because this actually take away capabilities Smile.

As the name suggest, this allows you to select specific collections and only replicate those to a specific destination. For example, in this example, we can see that we are only replicating the Categories, Companies and Employees collection, instead of replicating the entire database.


Why is this important?

Because it opens up new ways of managing your data. It allows you to use RavenDB replication (high throughput, reliable and error resilient) to manage data distribution.

Let us imagine for a moment that we have a web ordering system, with multiple tenants. And we have some common information that needs to be shared among all the tenants. For example, the baseline pricing information.

We can setup replication like so:


The Shared database contains a lot of information, but only the pricing information is replicated. This means that you can change it once, and it will propagate to all the end destinations on its own.

Common scenarios for such shared data include:

  • Users /logins
  • Base data for local modifications (product catalog that each tenant can override)
  • Rules

Note that because we are using collection specific replication, this does not make the destination database into a duplicate of the source. As such, it will not take part in failover configuration for the source database.

You can mix and match, a single database can replicate to failover destination (full replication) and partial (only specific collections). And the clients will know how to fail to the right node if something bad happens.

Production postmortemReading the errors

time to read 4 min | 776 words

One of the hardest practices to learn as a developer is to actually sit down and read the information that you have. Ever since we became a product company, we have to deal with partial information, because some data wasn’t captured, or it isn’t possible to capture more information because it is running in a production environment.

That means that we have to do the most with the information we have available. Here are a few recent cases of having information, and ignoring it. A customer complained that a query was timing out. He provided the following error.

EventTimeStamp=String&SortHint-EventSequence_Range=Long","Error":"System.OperationCanceledException: The operation 
was canceled.\r\n   at System.Threading.CancellationToken.ThrowIfCancellationRequested()\r\n   at Raven.Database.
Indexing.Index.IndexQueryOperation.<Query>d__5d.MoveNext()\r\n   at Raven.Database.Util.ActiveEnumerable`1..ctor(
IEnumerable`1 enumerable)\r\n   at Raven.Database.Actions.QueryActions.DatabaseQueryOperation.Init()\r\n   at Raven
.Database.Server.Controllers.StreamsController.SteamQueryGet(String id)\r\n   at lambda_method(Closure , Object , 
Object[] )\r\n   at System.Web.Http.Controllers.ReflectedHttpActionDescriptor.ActionExecutor.<>c__DisplayClass10.<
GetExecutor>b__9(Object instance, Object[] methodParameters)\r\n   at System.Web.Http.Controllers.
ReflectedHttpActionDescriptor.ExecuteAsync(HttpControllerContext controllerContext, IDictionary`2 arguments, 
CancellationToken cancellationToken)\r\n--- End of stack trace from previous location where exception was thrown 
---\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.
CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at System.Runtime.
CompilerServices.TaskAwaiter`1.GetResult()\r\n   at System.Web.Http.Controllers.ApiControllerActionInvoker.<
InvokeActionAsyncCore>d__0.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown 
---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.
CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.
HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter`1.
GetResult()\r\n   at System.Web.Http.Controllers.ActionFilterResult.<ExecuteAsync>d__2.MoveNext()\r\n--- End of 
stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.
ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task 
task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task 
task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()\r\n   at System.Web.Http.Controllers.

As you can see, that is a big chunk of text, and it require careful parsing. I’m actually quite proud of this, even though the error is obtuse, it is giving us all the information that we need to figure out what the problem is.

In particular, there are two pieces of information that we need to see here:

  • OperationCancelledException is thrown when we have a timeout processing the query on the server side.
  • start=9134999 is a request to do deep paging of about 10 million records, and then start reading.

Those two combine tell us that the issue is that we are doing deep paging, which cause us to timeout internally before we start sending any data.

But why are we timing out on deep paging? What is expensive about that?

Turns out that RavenDB does de-dup during queries. So if you got events/1 in page #1, you won’t see it in page #2 even if there are results from the index for that document in page #2. If you care to know more, the issue is related to fanout and one document having many index entries pointing to it. But the major issue here is that we need to scan 10 million docs to avoid duplicates before we can return any value to the user. There is a flag you can send that would avoid this, but that wasn’t used in this case.

Another issue that was solved by reading the message from the customer was:

Version store out of memory (cleanup already attempted)
Microsoft.Isam.Esent.Interop.EsentVersionStoreOutOfMemoryException: Version store out of memory (cleanup already attempted) 
at Raven.Database.Storage.Esent.StorageActions.DocumentStorageActions.RemoveAllBefore(String name, Etag etag) in c:\Builds\RavenDB-Stable-3.0\Raven.Database\Storage\Esent\StorageActions\Lists.cs:line 74
at Raven.Database.Smuggler.SmugglerEmbeddedDatabaseOperations.<>c__DisplayClass12.<PurgeTombstones>b__11(IStorageActionsAccessor accessor) in c:\Builds\RavenDB-Stable-3.0\Raven.Database\Smuggler\SmugglerEmbeddedDatabaseOperations.cs:line 219
at Raven.Storage.Esent.TransactionalStorage.ExecuteBatch(Action`1 action, EsentTransactionContext transactionContext) in c:\Builds\RavenDB-Stable-3.0\Raven.Database\Storage\Esent\TransactionalStorage.cs:line 799 at Raven.Storage.Esent.TransactionalStorage.Batch(Action`1 action) in c:\Builds\RavenDB-Stable-3.0\Raven.Database\Storage\Esent\TransactionalStorage.cs:line 778
at Raven.Abstractions.Smuggler.SmugglerDatabaseApiBase.<ExportData>d__2.MoveNext() in c:\Builds\RavenDB-Stable-3.0\Raven.Abstractions\Smuggler\SmugglerDatabaseApiBase.cs:line 187 --- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
25 minutes ago (08/02/15, 9:43am)

Version store out of memory happens when the size of the uncommitted transaction is too large, and we error. Usually that indicate some unbounded operation that wasn’t handled well.

Those are almost always batch operations of some kind that do various typed of cleanups. In order to handle them, we are pulsing the transaction. Effectively committing the transaction and starting a new one. That frees the uncommitted values.

The problem was that this method had this behavior:


So the investigation was focused on why MaybePulseTransaction didn’t work properly. There is some heuristics there to avoid being too costly, so it was very strange.

Can you see the error? We were looking at the wrong overload, this is the one that takes a time, and the one that was called takes an etag. Indeed, in the version the customer had, that call wasn’t there. In the current version, it was added.

Knowing how to read errors and logs, and actually paying attention to what they are saying is crucial when the time to fix a problem is at hand.

Reducing parsing costs in RavenDB

time to read 5 min | 801 words

Note, this is something that we are considering for 4.0.

RavenDB uses JSON natively to handle pretty much everything. Which makes sense, for a JSON Document Database. JSON is easy to parse, human readable and for the most part, have no real complexity associated with it.

However, while JSON is easy to parse, there is some computational overhead around parsing it. In fact, in our perf testing, we spend quite a lot of time just serializing and deserialzing JSON. It is one of the major costs we have to deal with, mostly because that is something that happens all the time. We have previous used BSON to store documents internally, but profiling has actually shown that it is cheaper to hold the data as JSON text and parse it. So that is what we are currently doing.

However, even faster parsing is still parsing, and something that we would like to avoid. There is also an issue with how this is actually represented in memory. Let us consider the following JSON:


In memory, this is represented as (highly simplified):

  • Dictionary<string,object> instance
    • string[] for the keys
      • “FirstName” string
      • “LastName” string
    • object[] for the values
      • “John” string
      • “Smith” string

In other words, especially for large documents, there are a lot of small objects that are being created. That doesn’t impact immediate parsing cost, but those do need to be collected after the fact, and that is something that we would like to avoid.

We are currently considering using Flat Buffers for internal document storage. The good thing about Flat Buffers is that there is not intermediate parsing step. You get a block of memory that you can immediately access. That has two major advantages, loading the documents to memory would mean just reading a buffer from disk, with no extra cost. But it would also mean that the act of releasing a document would be much cheaper, we would only need to collect the buffer again, not potentially tens of thousands of small objects.

Another advantage is that we usually need to load documents to be indexed, and usually indexing only require very few fields from the documents. By avoiding the cost of parsing, only paying the price for the objects that we are actually touching, we are in a much better position to reduce the indexing costs.

A rough scratch schema using flat buffers would be:

union AnyValueUnion { Document, Array, Value }

table AnyValue
    Value: AnyValueUnion;

table Value
    Type: byte;
    String: string;
    Bytes: [byte];

table Array
    Values: [AnyValue];

table FieldValue 
    Name: string;
    Value: AnyValue;

table Document
    Values: [FieldValue];    

The Values inside the documents are sorted by field name, so we can search a field using binary search.

Nitpicker corner: Yes, we probably would want to use a hash here, but this is a rough draft to test things out, it will come later.

We need to optimize the schema, and using it will not be a lot of fun, but the speed and memory improvements should be significant.

Another alternative would be a schema such as this:

table FieldValue 
    Name: int;
    Value: AnyValue;

table Document
    Values: [FieldValue];    

table Root
    Fields: [string];
    Document: Document;

Here we store all the field names once, then we refer to the field name using its index in the root object. This should have the advantage of reducing repeated string names.


  1. Production postmortem: The case of the memory eater and high load - 9 hours from now
  2. Production postmortem: The case of the lying configuration file - about one day from now
  3. Production postmortem: The industry at large - 2 days from now
  4. The insidious cost of allocations - 3 days from now
  5. Find the bug: The concurrent memory buster - 4 days from now

And 4 more posts are pending...

There are posts all the way to Sep 10, 2015


  1. Find the bug (5):
    20 Apr 2011 - Why do I get a Null Reference Exception?
  2. Production postmortem (10):
    14 Aug 2015 - The case of the man in the middle
  3. What is new in RavenDB 3.5 (7):
    12 Aug 2015 - Monitoring support
  4. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats