Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 589 words

Also known as: Please check the subtitle of this blog.

This post is in response to this one. Kelly took offence with this post about Voron performance. In particular, it appears that the major issues are:

This benchmark doesn’t actually provide much useful information. It is too short and compares fully featured DBMS systems to storage engines. I always stress very much that people never make decisions based on benchmarks like this.

These paint the fully featured DBMS systems in a negative light that isn’t a fair comparison. They are doing a LOT more work. I’m sure the FoundationDB folks will not be happy to know they were roped into an unfair comparison in a benchmark where the code is not even available.

This isn’t a benchmark. This is just an interim step along the way of developing Voron. It is a way for us to see where we stand and where we need to go. A benchmark include full details about what you did (machine specs, running environment, full source code, etc). This is just us putting stress on our machine and comparing where we are at. And yes, we could have done it in isolation, but that wouldn’t really give us any major advantage. We need to see how we compare to other database.

And yes, we compare apples to oranges here when we compare a low level storage engine like Voron to SQL Server. I am well aware of that. But that isn’t the point. For the same reason that we are currently doing a lot of micro benchmarks rather than the 48 hours ones we have in the pipeline.

I am trying to see how users will evaluate Voron down the road. A lot of the time, that means users doing micro benchmarks to see how good we are. Yes, those aren’t very useful, but they are a major way people make decisions. And I want to make sure that we come out in a good light under that scenario.

With regards to Foundation DB, I am sure they are as happy about it as I am about them making silly claims about RavenDB transaction support. And the source code is available if you really want to, in fact, we got the Foundation DB there because we had an explicit customer request, and because they contributed the code for that.

Next, let us move to something else equally important. This is my personal blog. I publish here things that I do on a daily basis. And if I am currently in a performance boost stage, you’re going to be getting a lot of details on that. Those are the results of performance runs, they aren’t benchmarks. They don’t get anywhere beyond this blog. When we’ll put the results on ravendb.net, or something like that, then it will be a proper benchmark.

And while I fully agree that making decisions based on micro benchmarks is a silly way to go about doing so, the reality is that many people do just that. So one of the things that I’m focusing on is exactly those things. It helps that we currently see a lot of places to improve in those micro benchmarks. We already have a plan (and code) to see how we do on a 24 – 48 hours benchmark, which would also allow us to see all sort of interesting things (mixed reads & writes, what happens when you go beyond physical memory size, longevity issues, etc).

time to read 1 min | 199 words

Those are interim results, but it gives me some hope. This is running on a different machine than the numbers I posted before. But I have enough to compare to each other. You can see it here. All numbers are in operations / sec.

Writing 100,000 sequential items (100 items per transaction in 1,000 transactions):

image

And writing 100,000 random items:

image

 

And here is what we really care about. Comparing Voron to Esent:

image

This is actually pretty amazing, because we are now at about 25% write speed of Esent for sequential writes (we used to be at less than 5%). When talking about random writes (what we really care about) we are neck in neck.

time to read 1 min | 124 words

I am currently in the states, and I can’t go anywhere without seeing a lot of signs for Black Friday. Since it seems that this is a pretty widely spread attempt to do a load test on everyone’s servers (and physical stores, for some reason). I decided that I might as well join the fun and see how we handle the load.

You can use one of the following coupon codes (each one has between 4 – 16 uses) to get a 20% discount for any of our products, if you buy in the next 48 hours.

    1. pink-Sunday
    2. white-Monday
    3. green-Tuesday
    4. orange-Wednesday
    5. red-Thursday
    6. black-Friday
    7. yellow-Saturday

This explicitly includes 20% discounts for RavenDB and the Profilers.

time to read 1 min | 93 words

I keep getting asked by people: “What is the configuration option to make NHibernate run faster?”

People sort of assume that NHibernate is configured to be slow by default because it amuses someone.

Well, while there isn’t a “Secret_incantation” = “chicken/sacrifice” option in NHibernate, there is this one:

image

And it pretty much does the same thing.

No, I won’t explain why. Go read the docs.

time to read 3 min | 595 words

I have a method that does a whole bunch of work, then submit an async write to the disk. The question here is how should I design it?

In general, all of the method’s work is done, and the caller can continue doing whatever they want to do with they life, all locks are released, all the data structures are valid again, etc.

However, it isn’t until the async work complete successfully that we are actually done with the work.

I could do it like this:

   1: public async Task CommitAsync();

But this implies that you have to wait for the task to complete, which isn’t true. The commit function looks like this:

   1: {
   2:     // do work
   3:     return SubmitWriteAsync();
   4: }

So by the time you have the task from the method, all the work has been done. You might want to wait of the task to ensure that the data is actually durable on disk, but in many cases, you can get away with not waiting.

I am thinking about doing it like this:

   1: public Task Commit();

To make it explicit that there isn’t any async work done that you have to wait for.

Thoughts?

time to read 23 min | 4490 words

I really like the manner in which C# async tasks work. And while building Voron, I run into a scenario in which I could really make use of Windows async API. This is exposed via the Overlapped I/O. The problem is that those are pretty different models, and they don’t appear to want to play together very nicely.

Since I don’t feel like having those two cohabitate in my codebase, I decided to see if I could write a TPL wrapper that would provide nice API on top of the underlying Overlapped I/O implementation.

Here is what I ended up with:

   1: public unsafe class Win32DirectFile : IDisposable
   2: {
   3:     private readonly SafeFileHandle _handle;
   4:  
   5:     public Win32DirectFile(string filename)
   6:     {
   7:         _handle = NativeFileMethods.CreateFile(filename,
   8:             NativeFileAccess.GenericWrite | NativeFileAccess.GenericWrite, NativeFileShare.None, IntPtr.Zero,
   9:             NativeFileCreationDisposition.CreateAlways,
  10:             NativeFileAttributes.Write_Through | NativeFileAttributes.NoBuffering | NativeFileAttributes.Overlapped, IntPtr.Zero);
  11:  
  12:         if (_handle.IsInvalid)
  13:             throw new Win32Exception();
  14:  
  15:         if(ThreadPool.BindHandle(_handle) == false)
  16:             throw new InvalidOperationException("Could not bind the handle to the thread pool");
  17:     }

Note that I create the file with overlapped enabled, as well as write_through & no buffering (I need them for something else, not relevant for now).

It it important to note that I bind the handle (which effectively issue a BindIoCompletionCallback under the cover, I think), so we won’t have to use events, but can use callbacks. This is much more natural manner to work when using the TPL.

Then, we can just issue the actual work:

   1: public Task WriteAsync(long position, byte* ptr, uint length)
   2: {
   3:     var tcs = new TaskCompletionSource<object>();
   4:  
   5:     var nativeOverlapped = CreateNativeOverlapped(position, tcs);
   6:     
   7:     uint written;
   8:     var result = NativeFileMethods.WriteFile(_handle, ptr, length, out written, nativeOverlapped);
   9:     
  10:     return HandleResponse(result, nativeOverlapped, tcs);
  11: }

As you can see, all the actual details are handled in the helper functions, we can just run the code we need, passing it the overlapped structure it requires. Now, let us look at those functions:

   1: private static NativeOverlapped* CreateNativeOverlapped(long position, TaskCompletionSource<object> tcs)
   2: {
   3:     var o = new Overlapped((int) (position & 0xffffffff), (int) (position >> 32), IntPtr.Zero, null);
   4:     var nativeOverlapped = o.Pack((code, bytes, overlap) =>
   5:     {
   6:         try
   7:         {
   8:             switch (code)
   9:             {
  10:                 case ERROR_SUCCESS:
  11:                     tcs.TrySetResult(null);
  12:                     break;
  13:                 case ERROR_OPERATION_ABORTED:
  14:                     tcs.TrySetCanceled();
  15:                     break;
  16:                 default:
  17:                     tcs.TrySetException(new Win32Exception((int) code));
  18:                     break;
  19:             }
  20:         }
  21:         finally
  22:         {
  23:             Overlapped.Unpack(overlap);
  24:             Overlapped.Free(overlap);
  25:         }
  26:     }, null);
  27:     return nativeOverlapped;
  28: }
  29:  
  30: private static Task HandleResponse(bool completedSyncronously, NativeOverlapped* nativeOverlapped, TaskCompletionSource<object> tcs)
  31: {
  32:     if (completedSyncronously)
  33:     {
  34:         Overlapped.Unpack(nativeOverlapped);
  35:         Overlapped.Free(nativeOverlapped);
  36:         tcs.SetResult(null);
  37:         return tcs.Task;
  38:     }
  39:  
  40:     var lastWin32Error = Marshal.GetLastWin32Error();
  41:     if (lastWin32Error == ERROR_IO_PENDING)
  42:         return tcs.Task;
  43:  
  44:     Overlapped.Unpack(nativeOverlapped);
  45:     Overlapped.Free(nativeOverlapped);
  46:     throw new Win32Exception(lastWin32Error);
  47: }

The complexity here is that we need to handle 3 cases:

  • Successful completion
  • Error (no pending work)
  • Error (actually success, work is done in an async manner).

But that seems to be working quite nicely for me so far.

time to read 1 min | 189 words

For the past few days I have been talking about our findings with regards to creating ACID storage solution. And mostly I’ve been focusing on how it works with Windows, using Windows specific terms and APIs.

The problem is that I am not sure if those are still relevant if we talk about Linux. I know that fsync perf is still an issue (if only because both Win & Lin are running on the same hardware). But would the same solutions apply?

For example, the nearest that I can find to FILE_FLAG_NO_BUFFERING is O_DIRECT and FILE_FLAG_WRITE_THROUGH appears to be similar to  O_SYNC. But I am not sure if they are actually behaving in the same fashion.

Any ideas? Anyone has something like Process Monitor for Linux and can look at the actual behavior of industry grade databases commit behavior?

From my exploring, it appears that PostgreSQL is using fdatasync() as the default approach, but it can use O_DIRECT and O_DSYNC as well, so that is promising. But I would like to have someone who actually know Linux intimately tell me if I am even in the right direction.

time to read 1 min | 200 words

In Voron, we use a double buffer approach. We use the first two pages of the file to alternately write the last version of the database info. For example, the last transaction id, among other things.

The problem is that when we make those changes, we have to call fsync on that, and as we have seen, that is something that we would like to avoid if possible. Because of that, we are going to try something different. We are going to extract the header information from the first few pages of the file in favor of holding them as separate files: header.one and header.two

The idea is that they are very small files, and as such, it would be cheap to fsync them independently. Moreover, we can take advantage of the fact that very small files (and in this case, I am not sure we even above 256 bytes) are usually stored in the MFT in NTFS and in the inode in ext4. That means that fsync would get both data and metadata at the same time, hopefully just writing out a single block.

I am not sure how useful that is going to be, but I have hopes.

time to read 5 min | 870 words

This post is here to answer several queries in the mailing list, and some questions that were raised in this blog post. I think that this is important enough to warrant a post here, instead of an email to the list, or just a comment.

To summarize, we had a few issues recently that impacted our users’ systems. Those are usually (but not always) cases where a combination of features wasn’t working properly (feature intersection), or just actual bugs. That led to some questions that are worth answering. You can find all the details below, but I would like to talk about what we are actually doing.

In the past 4 or 5 years, we have managed to create a NoSQL database for the .NET platform, and it has been doing nothing but picking up steam ever since we released it. We have been working hard to provide performance, features and stability for our users. On a personal note, it has been quite an amazing ride, seeing more people put RavenDB to use and creating interesting applications and features.

First, there seems to be some concerns about the new things that we are doing. Voron, in particular, appears to be a cause for concern. We have relied on Esent as our storage engine for the past four or five years, to great success. Not least of its properties is the fact that Esent has been around the block for a while now, and is proven to be robust and safe in the simplest of methods, high and constant use over multiple decades. Esent also have its share of problems, but we didn’t forget why we chose it in the first place. Indeed, I still think that that was an excellent choice. With Voron, the only change you’ll see is that it won’t be the only choice.

Voron is meant to allow us to run on Linux machines, and to provide us with a fully owned stack, so we can do more interesting things across the board. But we aren’t letting go of Esent, and in any way you care to name, Esent is still going to be the core (and default) option we have for storage in RavenDB. With RavenDB 3.0, you’ll have the option to make an informed choice about selecting Voron as a storage engine, with a list of pros & cons.

Second, we do acknowledge that we suffer from a typical blindness for how we approach RavenDB. Since we built it, we know how things are supposed to be, and that is how we usually test them. Even when we try to go for the edge cases, we are constrained by our own thinking. We are currently working on getting an external testing team to do just that. Actively work to make use of RavenDB in creative ways specifically to try to break it.

Third, our own internal policies for releasing RavenDB need to be adjusted slightly. In particular, we are usually faced with two competing pressures: Release Already and Super Stable. We have always tried to release both unstable and stable versions, and the process for moving from unstable to stable is a pretty good one, I think. We have:

  • The test suite, now clocking at just over 3,000 tests.
  • A separate test suite that is meant to stress test the database.
  • Performance test suite, to make sure that we are in line for general performance.
  • Longevity tests, making sure that we don’t have any issues in long term usage.
  • Finally, as an act of dog fooding, we upgrade our own servers to the new build, and let it run in production for a while, just to make absolutely sure.

We are going to add additional tests (see the 2nd point) to the process, and we are going to extend the duration of all of those steps.  I think that in the past few months we have have leaned too far toward the “Release Already” mode, so we are going to try to lean back (hopefully not too much) the other way. 

Fourth, with regards to licensing. It has been our policy to provide anyone with a free trail license of RavenDB if they want to test it on a temporary basis. We require permanent non developer servers to have a license. I think that this strikes the appropriate balance.

Fifth, we are going to be working on additional tooling round deployment and upgrades. For customers that jump multiple versions (moving from 1.x to 2.5, for example), the update process of the RavenDB internal storage data during upgrades can be lengthy and there is too little visibility into it at the moment. We are also working on building tools that help figure out what is going on with a production instance (more ops endpoint, more visibility into internal operations, etc).

In summary, we are grateful for our users for bringing any issues to our attention. We are trying hard to have a very responsive feedback cycle, and we can usually resolve most issues within 24 – 48 hours. But I know we need to do better in making sure that users have a more streamlined experience.

time to read 3 min | 466 words

So far, we have got to the conclusion that we are going to ditch fsync in favor of unbuffered write through calls. We also saw that it can play nicely with memory mapped files, which is what we are using for Voron.

However, there is a problem here. Before we can write the data to the journal file, we need some way to actually put it. Previously, we could use the memory directly from the memory mapped journal file, and then just flush it. However, now we cannot do that, the only writes that we can do to the journal are using the unbufered write through I/O. Otherwise, we have to deal call fsync again. And sadly, we cannot call WriteFile on the memory that is mapped to the same part of the file that we write to.

That means that we need some scratch space to work with. And that means that we need to make some choices here. The obvious place to handle this scratch space is memory. The problem with that is that this means that we are going to compete with the rest of the system for available memory. In particular, we would need some way to free up memory after we use it, or we may hold into it forever. But if we free the memory, we might need to use it again, in which case we have a free/alloc pattern that isn’t going to be good.

Ideally, we want to get a continuous range of memory, so that probably explains why we care about its size and not releasing it early. One thing that I should note is that we are worried mostly about big transactions, ones that might need to touch hundreds or thousands of megabytes. Those tend to be rare, yes, but I hate to have any sort of hard limits in my software.

So what we’ll probably do is create another memory mapped file, of a size that is at least as big as the current journal file. And we will put all of our in flight transactional data in there. The good news about it is that we can re-use the space on every transaction, just overwriting what previous values. It also means that we easily expand the size of the current transaction buffer, so to speak. And under high memory pressure, we have an easier way to handle things.

When the transaction actually commits, we will write directly from the scratch space to the journal file, as a single, sequential, unbuffered write through write. Externally, there would be much of a change. And most of that would probably just have to do with the transaction commit semantics. Speaking of which, we probably want to talk about the way we store the header information…

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}