Ayende @ Rahien

Refunds available at head office

Message passing, performance–take 2

In my previous post, I did some rough “benchmarks” to see how message passing options behave. I got some great comments, and I thought I’ll expand on that.

The baseline for this was a blocking queue, and we managed to process using that we managed to get:

145,271,000 msgs in 00:00:10.4597977 for 13,888,510 ops/sec

And the async BufferBlock, using which we got:

43,268,149 msgs in 00:00:10 for 4,326,815 ops/sec.

Using LMAX Disruptor we got a disappointing:

29,791,996 msgs in 00:00:10.0003334 for 2,979,100 ops/sec

However, it was pointed out that I can significantly improve this if I changed the code to be:

var disruptor = new Disruptor.Dsl.Disruptor<Holder>(() => new Holder(), new SingleThreadedClaimStrategy(256), new YieldingWaitStrategy(), TaskScheduler.Default);

After which we get a very nice:
141,501,999 msgs in 00:00:10.0000051 for 14,150,193 ops/sec
Another request I got was for testing this with a concurrent queue, which is actually what it is meant to do. The code is actually the same as the blocking queue, we just changed Bus<string> to ConcurrentQueue<string>.
 
Using that, we got:
170,726,000 msgs in 00:00:10.0000042 for 17,072,593 ops/sec
And yes, this is pretty much just because I could. Any of those methods is quite significantly higher than anything close to what I actually need.

The cost of working with strings

Following my last post, I decided that it might be better to actually show what the difference is between direct string manipulation and working at lower levels.

I generated a sample CSV file with 10 million lines and 6 columns. The file size was 658MB. I then wrote the simplest code that I could possibly think of:

   1: public class TrivialCsvParser
   2: {
   3:     private readonly string _path;
   4:  
   5:     public TrivialCsvParser(string path)
   6:     {
   7:         _path = path;
   8:     }
   9:  
  10:     public IEnumerable<string[]> Parse()
  11:     {
  12:         using (var reader = new StreamReader(_path))
  13:         {
  14:             while (true)
  15:             {
  16:                 var line = reader.ReadLine();
  17:                 if (line == null)
  18:                     break;
  19:                 var fields = line.Split(',');
  20:                 yield return fields;
  21:             }
  22:         }
  23:     }
  24: }

This run in 8.65 seconds (with a no-op action) and kept the memory utilization at about 7MB.

Then next thing to try was just reading through the file without doing any parsing. So I wrote this:

   1: public class NoopParser
   2: {
   3:     private readonly string _path;
   4:  
   5:     public NoopParser(string path)
   6:     {
   7:         _path = path;
   8:     }
   9:  
  10:     public IEnumerable<object> Parse()
  11:     {
  12:         var buffer = new byte[1024];
  13:         using (var stream = new FileStream(_path,FileMode.Open, FileAccess.Read))
  14:         {
  15:             while (true)
  16:             {
  17:                 var result = stream.Read(buffer, 0, buffer.Length);
  18:                 if (result == 0)
  19:                     break;
  20:                 yield return null; // noop
  21:             }
  22:         }
  23:     }
  24: }

Note that this isn’t actually doing anything. But this took 0.83 seconds, so we see a pretty important big difference here. By the way, the amount of memory used isn’t noticeably different here. Both use about 7 MB. Probably because we aren’t actually holding up to any of the data in any meaningful way.

I have run the results using release build, and I run it multiple times, so the file is probably all in the OS cache. So I/O cost is pretty minimal here. However, note that we aren’t doing a lot of stuff that is being done by the TrivialCsvParser. For example, doing line searches, splitting the string to fields, etc. But interestingly enough, just removing the split will reduce the cost from 8.65 seconds to 3.55 seconds.

Voron & the FreeDB dataset

I got tired of doing arbitrary performance testing, so I decided to take the FreeDB dataset and start working with that. FreeDB is a data set used to look up CD information based on the a nearly unique disk id. This is a good dataset, because it contains a lot of data (over three million albums, and over 40 million songs), and it is production data. That means that it is dirty. This makes it perfect to run all sort of interesting scenarios.

The purpose of this post (and maybe the new few) is to show off a few things. First, we want to see how Voron behaves with realistic data set. Second, we want to show off the way Voron works, its API, etc.

To start with, I run my FreeDB parser, pointing it at /dev/null. The idea is to measure what is the cost of just going through the data is. We are using freedb-complete-20130901.tar.bz2 from Sep 2013.

After 1 minute, we went through 342,224 albums, and after 6 minutes we were at 2,066,871 albums. Reading the whole 3,328,488 albums took about a bit over ten minutes. So just the cost of parsing and reading the FreeDB dataset  is pretty expensive.

The end result is a list of objects that looks like this:

image

Now, let us see how we want to actually use this. We want to be able to:

  • Lookup an album by the disk ids
  • Lookup all the albums by an artist*.
  • Lookup albums by album title*.

This gets interesting, because we need to deal with questions such as: “Given Pearl Jam, if I search for Pearl, do I get them? Do I get it for jam?” For now, we are going to go with case insensitive, but we won’t be doing full text search, we will allow, however, prefix searches.

We are using the following abstraction for the destination:

   1: public abstract class Destination
   2: {
   3:     public abstract void Accept(Disk d);
   4:     public abstract void Done();
   5: }

Basically, we read data as fast as we can, and we shove it to the destination, until we are done. Here is the Voron implementation:

   1: public class VoronDestination : Destination
   2: {
   3:     private readonly StorageEnvironment _storageEnvironment;
   4:     private WriteBatch _currentBatch;
   5:     private readonly JsonSerializer _serializer = new JsonSerializer();
   6:     private int counter = 1;
   7:  
   8:     public VoronDestination()
   9:     {
  10:         _storageEnvironment = new StorageEnvironment(StorageEnvironmentOptions.ForPath("FreeDB"));
  11:         using (var tx = _storageEnvironment.NewTransaction(TransactionFlags.ReadWrite))
  12:         {
  13:             _storageEnvironment.CreateTree(tx, "albums");
  14:             _storageEnvironment.CreateTree(tx, "ix_artists");
  15:             _storageEnvironment.CreateTree(tx, "ix_titles");
  16:             tx.Commit();
  17:         }
  18:         _currentBatch = new WriteBatch();
  19:     }
  20:  
  21:     public override void Accept(Disk d)
  22:     {
  23:         var ms = new MemoryStream();
  24:         _serializer.Serialize(new JsonTextWriter(new StreamWriter(ms)), d);
  25:         ms.Position = 0;
  26:         var key = new Slice(EndianBitConverter.Big.GetBytes(counter++));
  27:         _currentBatch.Add(key, ms, "albums");
  28:  
  29:         if(d.Artist != null)
  30:             _currentBatch.MultiAdd(d.Artist.ToLower(), key, "ix_artists");
  31:         if (d.Title != null)
  32:             _currentBatch.MultiAdd(d.Title.ToLower(), key, "ix_titles");
  33:  
  34:         if (counter%1000 == 0)
  35:         {
  36:             _storageEnvironment.Writer.Write(_currentBatch);
  37:             _currentBatch = new WriteBatch();
  38:         }
  39:  
  40:     }
  41:  
  42:     public override void Done()
  43:     {
  44:         _storageEnvironment.Writer.Write(_currentBatch);
  45:     }
  46: }

Let us go over this in detail, shall we?

In line 10 we create a new storage environment. In this case, we want to just import the data, so we can create the storage inline. On lines 13 – 15, we create the relevant trees.

You can think about Voron trees in a very similar manner to the way you think about tables. They are a way to separate data into different parts of the storage. Note that this still all reside in a single file, so there isn’t a physical separation.

Note that we created an albums tree, which will contain the actual data. And ix_artists, ix_titles trees. Those are indexes into the albums tree. You can see them being used just a little lower.

In the Accept method, you can see that we use a WriteBatch, a native Voron notion that allows us to batch multiple operations into a single transaction. In this case, for every album, we are making 3 writes.

First, we write all of the data, as a JSON string, into a stream and put it in the albums tree. Then we create a simple incrementing integer to be the actual album key. Finally, we add the artist and title entries (lower case, so we don’t have to worry about case sensitivity in searches) into the relevant indexes.

At 60 seconds, we written 267,998 values to Voron. In fact, I explicitly designed it so we can see the relevant metrics. At 495 seconds we have reads 1,995,385 entries from the FreeDB file, we parsed  1,995,346 of them and written to Voron  1,610,998. As you can imagined, each step is running in a dedicated thread, so we can see how they behave on an individual basis. The good thing about this is that I can physically see the various costs, it is actually pretty cool

Here is the Voron directory at 60 seconds:

image

You can see that we have two journal files active (haven’t been applied to the data file yet) and the db.voron file is at 512 MB. The compression buffer is at 32 MB (this is usually twice as big as the biggest transaction, uncompressed).

The scratch buffer is used to hold in flight transaction information (until we send it to the data file), and you can see it is sitting on 256MB in size.

At 15 minutes, we have the following numbers: 3,035,452 entries read from the file,  3,035,426 parsed and 2,331,998 written to Voron. Note that we are reading the file & writing to Voron on the same disk, so that might impact the read performance.

At that time, we can see the following on the disk:

image

Note that we increase the size of most of our files by factor of 2, so some of the space in the db.voron file is probably not used. Note that we needed more scratch space to handle the in flight information.

The entire process took 22 minutes, start to finish. Although I have to note that this hasn’t been optimized at all, and I know we are doing a lot of stupid stuff through it.

You might have noticed something else, we actually “crashed” closed the Voron db, this was done to see what would happen when we open a relatively large db after an unordered shutdown.

We’ll actually get to play with the data in my next post. So far this has been pretty much just to see how things are behaving. And… I just realized something, I forgot to actually add an index on disk id Sad smile.

Which means that I have to import the data again. But before that, I also wrote the following:

   1: public class JsonFileDestination : Destination
   2: {
   3:     private readonly GZipStream _stream;
   4:     private readonly StreamWriter _writer;
   5:     private readonly JsonSerializer _serializer = new JsonSerializer();
   6:  
   7:     public JsonFileDestination()
   8:     {
   9:         _stream = new GZipStream(new FileStream("freedb.json.gzip", FileMode.CreateNew, FileAccess.ReadWrite), CompressionLevel.Optimal);
  10:         _writer = new StreamWriter(_stream);
  11:     }
  12:  
  13:     public override void Accept(Disk d)
  14:     {
  15:         _serializer.Serialize(new JsonTextWriter(_writer), d);
  16:         _writer.WriteLine();
  17:     }
  18:  
  19:     public override void Done()
  20:     {
  21:         _writer.Flush();
  22:         _stream.Dispose();
  23:     }
  24: }

This completed in ten minutes, for 3,328,488 entries. Or a rate of about 5,538 per / second. The result is a 845MB gzip file.

I had twofold reasons to want to do this. First, this gave me something to compare ourselves to, and more to the point, I can re-use this gzip file for my next tests, without having to go through the expensive parsing of the freedb file.

I did just that and ended up with the following:

   1: public class VoronEntriesDestination : EntryDestination
   2: {
   3:     private readonly StorageEnvironment _storageEnvironment;
   4:     private WriteBatch _currentBatch;
   5:     private int counter = 1;
   6:  
   7:     public VoronEntriesDestination()
   8:     {
   9:         _storageEnvironment = new StorageEnvironment(StorageEnvironmentOptions.ForPath("FreeDB"));
  10:         using (var tx = _storageEnvironment.NewTransaction(TransactionFlags.ReadWrite))
  11:         {
  12:             _storageEnvironment.CreateTree(tx, "albums");
  13:             _storageEnvironment.CreateTree(tx, "ix_diskids");
  14:             _storageEnvironment.CreateTree(tx, "ix_artists");
  15:             _storageEnvironment.CreateTree(tx, "ix_titles");
  16:             tx.Commit();
  17:         }
  18:         _currentBatch = new WriteBatch();
  19:     }
  20:  
  21:     public override int Accept(string d)
  22:     {
  23:         var disk = JObject.Parse(d);
  24:  
  25:         var ms = new MemoryStream();
  26:         var writer = new StreamWriter(ms);
  27:         writer.Write(d); writer.Flush();
  28:         ms.Position = 0;
  29:         var key = new Slice(EndianBitConverter.Big.GetBytes(counter++));
  30:         _currentBatch.Add(key, ms, "albums");
  31:         int count = 1;
  32:  
  33:         foreach (var diskId in disk.Value<JArray>("DiskIds"))
  34:         {
  35:             count++;
  36:             _currentBatch.MultiAdd(diskId.Value<string>(), key, "ix_diskids");
  37:         }
  38:  
  39:         var artist = disk.Value<string>("Artist");
  40:         if (artist != null)
  41:         {
  42:             count++; 
  43:             _currentBatch.MultiAdd(artist.ToLower(), key, "ix_artists");
  44:         }
  45:         var title = disk.Value<string>("Title");
  46:         if (title != null)
  47:         {
  48:             count++; 
  49:             _currentBatch.MultiAdd(title.ToLower(), key, "ix_titles");
  50:         }
  51:  
  52:         if (counter % 100 == 0)
  53:         {
  54:             _storageEnvironment.Writer.Write(_currentBatch);
  55:             _currentBatch = new WriteBatch();
  56:         }
  57:         return count;
  58:     }
  59:  
  60:     public override void Done()
  61:     {
  62:         _storageEnvironment.Writer.Write(_currentBatch);
  63:         _storageEnvironment.Dispose();
  64:     }
  65: }

Now we are actually properly disposing of things, and I also decreased the size of the batch, to see how it would respond. Note that it is now being fed directly from the gzip file, at a greatly reduced cost.

I also added tracking note only for how many albums we write, but also how many entries. By entries I mean, how many Voron entries (which include the values we add to the index).

I did find a bug where we would just double the file size without due consideration to its size, so now we are doing smaller file size increases.

Word of warning: I didn’t realized until after I was done with all the benchmarks, but I actually run all of those in DEBUG configuration, which basically means that it is utterly useless as a performance metric. That is especially true because we have a lot of verifier code that runs in DEBUG mode. So please don’t take those numbers as actual performance metrics, they aren’t valid.

Time # of albums # of entries
4 minutes 773,398 3,091,146
6 minutes 1,126,998 4,504,550
8 minutes 1,532,858 6,126,413
18 minutes 2,781,698 11,122,799
24 minutes 3,328,488 13,301,496

The status of the file system midway during the run. You can see that now we increase the file is smaller increments. And that we are using more scratch space, probably because we are under very heavy write load.

image

After the run:

image

Scratch & compression are only used when the database is running, and deleted on close. The database is 7GB in side, which is quite respectable. Now, to working with it, but I’ll save that for my next post, this one is long enough already.

Performance Testing with Voron OMG

I am not going to have a lot of commentary here, I think that I can let the numbers speak for themselves. Here are the current performance test of Voron vs. Esent on my machine.

image

Note, the numbers are in millions of writes per second here!

For Random Writes, the numbers are in thousands of writes per second:

image

Okay, I lied about no commentary. What you can see is that in our test, inserting 1.5 million items to storage. In sequential runs, we get such big variance because the test run itself is so short. (Nitpicker, yes, we’ll post longer benchmarks).

Note that this was running on a pretty good SSD drive, but what about HDD? Yes, that isn’t a mistake, Esent is doing 901 writes per second.

imageimage

As we have seen, we are pretty good at reading, but what are the current results?

image

As you can see, we are significantly better than Esent for reads in single and dual threaded modes, but Esent is faster when it can use more threads.

That annoyed me, so I set out to figure out what was blocking us. As it turned out, searching the B-tree was quite expensive, but when I added a very small optimization (remember the last few searched pages), we saw something very interesting).

image

(The Esent results are different than the previous run because we run it on a different disk (still SSD)).

What about random reads?

image

Now, one area where we currently suck is large writes, where we are doing really bad, but we’re working on that.

Overall, I am pretty happy about it all.

Versioned Collections

Last time we talked about collections, I talked about Linked Dictionary and Safe List. The first did some fancy dancing to reduce the cost of allocations, and the second would re-create the list on change.

Those implementations were good enough for us to go to hundreds of thousands of writes per second in the sequential scenario, but we noticed them cracking up under the pressure for the random writes scenario.

So I started looking at more options. As it turns out, there is a very simple immutable and highly efficient design for a list. Just hold to the end of the list, and every append is just creating a new node that points to the old tail of the list. This makes adding to the list an O(1) operation. It is very fast, simple and easy to work with.

This has just one issue, it allows efficient iteration only in reverse order. And there are a few places where we are actually relying on the ordering, so it would be nice not to lose that. Alex suggested using a internally mutable but externally immutable structure, such as this one. The benefit here is that we still get external immutability, but we get the performance of mutable access.

That works, as long as we use a list, which we needed for some pretty important things. However, the key collection, and the one that caused us most performance issues was the cost of getting values from the page table, which is a dictionary. We needed a really good solution there. I started to think that we would need to change all the code that uses this, but then I realized something very important. I don’t actually need this. I don’t need immutability, I just need snapshotting. And another aspect of that is that we have only a single writer thread for all of those items. So while we require that once we gave an instance to a reader, it can never change, that only applies to its externally observable behavior.

I would like to thank Tyler for the suggestion. Basically, we start off with a Concurrent Dictionary (and the only reason we need it is so we can write to the dictionary while we are reading from it, something that isn’t safe to do with standard dictionary). The values in that dictionary is an immutable list of the values with the transaction id. Whenever we modify a value, we can drop all the values older than the oldest existing transaction.

And when we search, we search for a value that is not greater than our own transaction id. That means that we effectively have a frozen snapshot of the page table, at a very little cost. I felt bad calling this as a generic collection name, so we actually named the class PageTable, since it has a very specific role in the code.

After this was done, we could not longer find any major cost associated with collections. So it is onward to the next challenge, reducing the cost of key comparisons…

Machine bias in profiler based optimizations

Take a look at the following profilers results. They both showcase pretty much the same codebase, without any major changes in between. However, they have been both run on different machines, and they result in very different performance optimization paths.

Main machine:

Laptop:

image

As you can see, when running the code on my main machine, the most expensive thing is actually writing to disk (WriteToJournal). However, on my laptop, writing to disk is actually very cheap.  The likely reason is, again, my laptop is faking writing to disk in favor of buffering writes, even though we explicitly tell it not to. That means that it behaves as if it is we already optimized writing to disk. My main machine behaves more like a server, and we see a lot more I/O costs, which pretty much mask any other costs that we might have.

Using the profiler to guide you with regards to how to optimize the system on any one of those machines would lead you in very different paths. With Voron, we are testing it in a variety of scenarios, to make sure that we aren’t optimizing for one particular machine and hurting another.

Reducing the cost of writing to disk

So, we found out that the major cost of random writes in our tests was actually writing to disk. Writing 500K sequential items resulted in about 300 MB being written. Writing 500K random items resulted in over 2.3 GB being written.

Note: I would like to point out Alex’s comment, which helped setup this post.

So the obvious thing to do would be to use compression. I decided to try and see what that would give us, and the easiest thing to do is to just enable file compression at the NTFS level. But we can probably do better. The major cost we have is writes to the journal file. And we only ever read from the journal file when we recover the database. We are also usually writing multiple pages at a time for most transactions, and for the scenario we care about, writing 100 random values in a single transaction, we are usually write about a 100 pages or so anyway. That means that we have got a pretty big buffer that we can try to compress all at once.

Sadly, we don’t really see a meaningful improvement under that scenario. Using NTFS compression slow us down considerably, while both LZ4 and Snappy resulted in greatly reduce file writes, but roughly the same performance.

image  image

Note that I have done just the minimal amount of work required to test this out. I changed how we are writing to the journal, and disabled reading from it. That was pretty disappointing, to tell you the truth, I fully expected that we’ll see some interesting improvements there. The LZ4 compression factor is about x10 for our data set. That means that 100 – 121 pages usually are compressed to 10 – 13 pages.

The good thing about this is that running this through the profiler shows that we are running on a high I/O machine, where the amount and speed of I/O doesn’t impact our performance much.

To test it out, I run the same test on an Amazon m3.2xlarge machine, and got the following results (writing to the C: drive):

image

So for now I think that I’ll do the following. We’ll focus on the other costs beyond I/O, and profile heavily on systems with higher I/O costs.

Compression should work, in fact, my first thought was that we pay for less I/O with more CPU, but that is a guess without looking at the numbers.

Here is the cost of committing using compression (lz4):

image

And here it is without compression:

image

So the good news is that we can visible see that we reduced the I/O cost from 8 seconds to 3.8 seconds. And even with compression cost of 2.6 seconds, we are still ahead.

However, the cost of the collections used is actually a lot worse from our perspective. So we are back to reducing computation costs first, then looking at compression again at a later point in time.

The cost of random writes

After pretty much getting to where we wanted with sequential writes, it is time to try to tackle the cost of random writes. Here is Esent & Voron doing random writes for 1 million entries:

image

Yes, it is sad that this is the case, because we can do about 40 times better for sequential writes.

However… what is the reason for the slowness? As usual, the answer is to profile, profile & profile. In fact, we want to do profiling on multiple machines, to check that we don’t have a machine bias.

At any rate, as you can see, we are actually pretty close to Esent in random writes, but why are we so much slower than our performance during sequential writes?

Well, we actually covered that, actually. The major reason for the cost is actually the fact that we have to deal with very different B-Tree shapes. You can see the difference between sequential and randomly inserted trees here. The more random you are, the bigger your tree is, the more pages you have to touch whenever you update. Our scenario calls for making 100 random modification in every transaction. That means that on every transaction, we have to touch many different parts of the tree.

Here are the results (in pages) for how much it cost us (in this test, we run 5,000 transactions of 100 items each.

 

Sequential

Random

Average

7.06

111.37

Max

11.00

134.00

Total

35,320.00

556,973.00

Std Dev

0.35

7.09

In other words, this means that on average, a sequential transaction wrote 7 pages, or about 28Kb. While the average random transaction wrote about 450Kb!

While the total size written for sequential was 137MB, and allocate 319MB. However, random sizes wrote 2.1 GB for 2.3 GB that were allocated. And that explains quite a lot about the whole thing. (Note that we count for writes only the journal writes, not the data writes, because those are the limiting factor for write performance).

Esent, by the way, writes sequentially about 465 MB and 2.3GB for random writes. Now we know why we are pretty close, because the vast majority of the time is actually spent just writing to the disk, and we are stalled by the sheer amount of data we have.

Interestingly enough, I don’t think that there is a good way to handle that nicely. Certainly not with a B-Tree. We’ll see how we can work with that at a future post.

Transaction merging, locks, background threads and performance

The following code is taken from the leveldb codebase:

image

This code is responsible for merging concurrent transactions. It operates using a mutex and conditional variables. And the idea really impressed me when I first saw it.

I implemented that in Voron because I think that this is a great way to gain higher concurrency rates while still having a single writer mode (which is much easier to work with).

However, as it turned out, this isn’t a really good idea in practice. The way conditional variables work, you lock the mutex, check your condition, and then wait on the conditional variable. Waking up from a conditional variable means that you have re-acquired the mutex.

What I want to happen:

  • Multiple concurrent threads will try to write at the same time.
  • One of them goes through and start writing, it takes all the pending writes and write them to disk.
  • It then releases all the waiting threads whose work it already completed.
  • They all move on from there without having to wait on one another.

However, because of the way conditional variables are implemented, what will actually happen is that they each will wake up one at a time, acquiring the mutex lock, then releasing it. However, while they are being released, they compete with one another, and they also compete with the next write.

We profiled this approach, and the result was that we could see how we were spending pretty much all of our time in just synchronizing threads.

Instead, we moved to a different approach, in which the first write will actually start off a new thread, dedicated to writing batches. The way it works, when a thread want to write a batch, it will add that to a queue and wake up the background writer thread, then wait on an event. The background writer will read all the current batches and merge them into a single write transaction.

When it is done, it will wake up all the sleeping writers. We tried to use TaskCompletionSource for that, initially, but we found that the inline nature of tasks made that too expensive to use. Instead, we use ManualResetEventSlim, and we explicitly wait / wake them. We even reuse events, so we don’t have to keep creating and disposing them.

The end result is that we have a pretty sequential process for actually doing concurrent writes, which turn it into a simple producers/consumer problem from threading perspective, and the actual writes into a simple write things out as fast as you can process them.

This also gives us the chance to do some re-ordering of operations to get better performance overall.

Voron Performance, the end is nigh (sequentially)

For the past few months, we have been doing a lot of work on performance for Voron. Performance for a storage engine is a very complex problem, composed of many competing issues.

For example, sequential writes vs. random writes. Or the cost of writes vs. the cost of reads. In this post, I am going to focus specifically on the single scenario we have focused most of our efforts on, sequential writes.

Sequential writes are important because they present the absolute sweet spot for a storage engine. In other words, this is the very first thing that needs to be fast, because if that ain’t fast ,nothing else would be. And because this represent the absolute sweat spot, it is perfectly fine and very common to explicitly design your system to ensure that this is what you’ll use.

We have been doing a lot there. In particular, some of our changes included:

  • Moving from fsync model (very slow) to unbuffered writer through writes.
  • Moving to vectored writes, reducing sys calls and the need for sequential memory space.
  • Writing our own immutable collection classes, meant specifically for what we’re doing.
  • Reduce # of tree searches.
  • Optimizing the cost of copying data from the user’s stream to our own store.

There have probably been others ,but those have been the major ones. Let us look at the numbers, shall we? I am going to compare us to Esent, since this is our current storage engine.

The test is writing out 10 million items, in 100,000 transactions, with 100 items per transaction. This is done in purely sequential manner. Item size is 128 bytes value and 16 bytes key.

Here are the original code numbers.

image

Note that this was run on a W520 Lenovo with an SSD drive. The actual details don’t really matter, what matters is that we are comparing two runs happening on the same machine.

And now, let us compare Esent and Voron as they currently stand:

image

Yes, you see clearly.

The big jump in Esent numbers have happened because we also worked on the test suite, and optimized things there, but the important thing is that we are now significantly faster than Esent.

In fact, we are fast enough that it took me a while to accept that we are actually doing it. But yes, you are seeing an order of magnitude improvement over what we used to have.

Feel free to take a moment to do a victory dance, that is what I’m doing right now.

However, note the title of this post. This is for a single threaded sequential inserts. The numbers are a lot worse for random writes, but we’ll deal with that in another post.

Voron Performance, the single biggest booster

One of the surprising points for improvement in our performance run was the following logic, responsible for copying the data from the user to our own memory:

   1: using (var ums = new UnmanagedMemoryStream(pos, value.Length, value.Length, FileAccess.ReadWrite))
   2: {
   3:     value.CopyTo(ums);
   4: }

Those three lines of code were responsible for no less than 25% of our performance. It was obvious that something needed to be done. My belief is that the unmanaged memory stream is just not optimized for this scenario, resulting in a lot of copying, allocations and costs.

Here is what we did instead. We create a temporary space that is allocated once, like this:

image

You can see that we are doing some interesting stuff there. In particular, we are allocated a managed buffer, but also force the GC to pin it. We keep this around for the entire lifetime of the database, too. The idea here is that we want to avoid the cost of pinning & unpinning it all the time, even if it means that we have an unmovable memory.

At any rate that important thing about this is that it gives us access to the same memory from managed and unmanaged perspectives. And that, in turn, leads to the following code:

image

We first read the values from the stream into the managed buffer, then copy them from the unmanaged pointer to the buffer to our own memory.

The idea here is that we accept a Stream abstraction, and that can only work with managed buffers, so we have to go through this route, instead of having to copy the memory directly. The reason we do that is that we don’t want to force the user of our API to materialize the data fully. We want to be able to stream it into the database.

At any rate, this has made some serious improvement to our performance, but I’ll be showing the details on a future post.

Voron Performance, let go off that tree, dude!

B-Trees has a few really awesome properties, chief among them is that they are playing very well with the hierarchical nature of data access that we have in our systems (L1 – L3 caches, main memory, disk, etc). Another really nice property is that searches in a B-Trees have very low costs.

However, it became apparent to us during performance tests that actually searching the tree was very costly. Assuming that we have the following pseudo code:

   1: for i in range(1, 1000 * 1000):
   2:    add(i, "val")

What is the cost of actually doing this? Well, an interesting tidbit about this is that every time we add an item to the tree, we need to search it, so we can find where in the tree that new value goes.

Cost of search the tree, based on size:

Number of entries

Cost

1

0.19

1,000

1.93

10,000

2.57

25,000

2.83

50,000

3.02

100,000

3.21

250,000

3.47

500,000

3.66

1,000,000

3.86

2,000,000

4.05

The cost, by the way, is O(log36 (N+1)). The log36 comes from the number of entries that we can fit in a single page. This cost ignores the actual search inside a page. But it is a good enough approximation for our needs.

Now, the cost of actually inserting 1 million items is the sum of all of those costs. Which means that the cost for 1 million is 3,576,242.35. This is another example of Schlemiel the Painter algorithm.

What we did was introduce a small cache, which remember the last few pages that we inserted to. That turned the cost down from searching the tree to checking the cache, where we can usually find the value, and gave a nice performance boost.

What about F# collections?

After disqualifying the BCL immutable collections for performance, I decided that I probably also need to look at the F# collection package, to see if they do any better.

I referenced the FSharp.Core dll and wrote the following:

   1: private static FSharpMap<long, object> FSharpAdd(int iterations)
   2: {
   3:     var dic = MapModule.Empty<long, object>();
   4:  
   5:     var sp = Stopwatch.StartNew();
   6:     var rnd = new Random(32);
   7:     for (int i = 0; i < iterations; i++)
   8:     {
   9:         foreach (var item in Enumerable.Range(rnd.Next(0, i), Math.Max(i * 2, 16)))
  10:         {
  11:             dic = dic.Add(item, null);
  12:         }
  13:     }
  14:  
  15:     Console.WriteLine(sp.Elapsed + " Adding items, fsharp map");
  16:  
  17:     return dic;
  18: }

As I was writing this post, the code is running, and I have had time to do some emails, writing the entire post, check the CPU status, and it is still running. It is likely going to be worse than the immutable collections scenario.

That make sense, since in the immutable collection scenario we had the ability to do many mutations all at once (using SetItems, which isn’t available on the FSharpMap).

However, it clocks in at just over 1 hour and 16 minutes, making it the slowest contender overall.

Tags:

Published at

Originally posted at

Comments (14)

More on collection performance, immutability and designing for a specific scenario

Note: I started writing a few minutes after starting the performance test, it took that long to run.

As you are probably already aware, we used the Immutable Collections package in Voron. That turned out to be a cause for pretty major performance issues, so I started investigating other approaches for dealing with the problem of both immutability and high performing code.

Before I go on, I would like to explain what we are using the immutable collections for. Basically, we have a page translation table, which tells us which pages belong where. This table changes on every write transaction.

However, locking isn’t a good option for us, because once started, the transaction view of the world can never be changed. That means that while we are modifying it in a write transaction, the read transaction should not ever see the modifications. This is exactly the case where immutable collections make sense. Read only views aren’t going to work, since once the write tx modify the page table, that change is going to be visible to read transactions, which isn’t what we want.

Our test scenario is that after each write tx commit, we will update the page table with all the new pages locations. That means that we wrote the following test case:

   1: var tmp = new Dictionary<long, object>();
   2:  
   3: var sp = Stopwatch.StartNew();
   4: var rnd = new Random(32);
   5: for (int i = 0; i < iterations; i++)
   6: {
   7:     foreach (var item in Enumerable.Range(rnd.Next(0, i), Math.Max(i * 2, 16)))
   8:     {
   9:         tmp[item] = null;
  10:     }
  11: }
  12:  
  13: Console.WriteLine(sp.Elapsed + " Adding items, mutable dictionary");

Every iteration is a “transaction”, and the work inside the for loop simulate us updating the page locations. Note, however, that the code above uses a mutable dictionary, and the changes are neither thread safe, nor are we isolating other readers from seeing our changes.

We can do this using immutable dictionary using the following code:

   1: var dic = ImmutableDictionary<long, object>.Empty;
   2:  
   3: var sp = Stopwatch.StartNew();
   4:  
   5: var rnd = new Random(32);
   6: for (int i = 0; i < iterations; i++)
   7: {
   8:     var tmp = new Dictionary<long, object>();
   9:     foreach (var item in Enumerable.Range(rnd.Next(0, i), Math.Max(i * 2, 16)))
  10:     {
  11:         tmp[item] = null;
  12:     }
  13:  
  14:     dic = dic.SetItems(tmp);
  15: }
  16:  
  17: Console.WriteLine(sp.Elapsed + " Adding items, immutable dictionary");

Note that we are using the optimal “SetItems” method, so we only modify the immutable dictionary once per transaction. That means that we create roughly 1 instance per transaction.

I then decided to see what would be the cost of running this with 50,000 transactions. Note that this means that in the end we have 149,697 items in the dictionary.

  • Cost for running this in mutable dictionary: 1 minute and 12 seconds.
  • Cost for running this in immutable dictionary: About 30 minutes.

Cost for reading 10,000 items from the dictionary:

  • 0.0002 seconds for mutable dictionary
  • 0.0061 seconds for immutable dictionary

Remember, the reason we go to this place is because we actually got performance issues specifically because of the SetItems calls. But note that we also have a performance problem for reads as well.

Initially, I tried to do the simple thing of just creating a new instance for every transaction. That worked quite nicely to remove the read costs, but it had a non trivial cost for adds, quite aside from the GC cost. The test is:

   1: var dic = new Dictionary<long, object>();
   2:  
   3: var sp = Stopwatch.StartNew();
   4: var rnd = new Random(32);
   5: for (int i = 0; i < iterations; i++)
   6: {
   7:     var tmp = new Dictionary<long, object>();
   8:     foreach (var item in Enumerable.Range(rnd.Next(0, i), Math.Max(i * 2, 16)))
   9:     {
  10:         tmp[item] = null;
  11:     }
  12:     dic = new Dictionary<long, object>(dic);
  13:     foreach (var o in tmp)
  14:     {
  15:         dic[o.Key] = o.Value;
  16:     }
  17: }
  18:  
  19: Console.WriteLine(sp.Elapsed + " Adding items, safe dictionary");

Cost for adding using new instance every time:

  • 7 minutes and 37 seconds.

So right off the bat, we are over 4 times faster than immutable dictionary. But this is still expensive. But at least is has the same read speed as a standard dictionary, because it is.

In order to deal with that, I created something that I call LinkedDictionary. A linked dictionary is based on the following idea: Since dictionaries can only contain a single value per item, it is perfectly safe to layer them on one another. This means that we can just have a linked list of dictionaries, and when we want to read from it, we just need to walk through the list in reverse addition order. That means that it is pretty much perfect for append only linked list model.

The downside of that is that we now have to traverse the linked list in order to find our value. The good thing about this is that we are able to limit that, by introducing a merge step along the way. The merge step would merge all the pending parts into a single dictionary. Currently we are doing a merge every 32 writes or so. This strikes a nice balance between read and write speed.

The end result is that for 50,000 iterations, we have the following values:

  Mutable Dictionary Immutable Dictionary Cloned Dictionary Linked Dictionary
50,000 adds

1 minutes and 12 seconds

29 minutes and 28 seconds

7 minutes and 37 seconds

4 minutes and 37 seconds

10,000 reads

0.0002 seconds

0.0064

0.0002 seconds

0.0032

In other words, we are able to get write speeds that are just about 4 times the standard mutable behavior, and we are able to get read speeds that are half the immutable dictionary model.

For reference, the entire code is in the following gist.

HotSpot Shell Game

One of the annoying things about doing performance work is that you sometimes feel like you found the problem. You fix it, and another pops out.

After the previous run of work, we have tackled two important perf issues:

  • Immutable collections – which were replaced with more memory intensive but hopefully faster safe collections.
  • Transaction contention from background work – which we reduced and close to eliminated.

Once we have done that, it is time to actually look at the code in the profiler again. Note that I am writing this while I am traveling on the train (on battery power), so you can pretty much ignore any actual numbers, we want to look at the percentages.

Let us look at the contention issue first. Here is the old code:

image

And the new one:

image

We still have some cost, but that appears to have been drastically cut down. I’ll call it good for now.

What about transaction commit? Here is the old code:

image

And here is the new one:

image

What we can see is that our change didn’t actually improve anything much. In fact, we are worse off. We’ll probably go with a more complex system of chained dictionaries, rather than the just plain copying that appears to be causing so much trouble.

However, there is something else to now. Both Commit and NewTransaction together are less than 60% of the cost of the operation. The major cost is actually in adding item to the in memory tree:

image

In particular, it looks like finding the right page to place the item is what is costing us so much. We added 2 million items (once sequential, once random), which means that we need to do an O(logN) operation 2 million times. I don’t know if we can improve on the way it works for random writes, but we should be able to do something about how it behaves for sequential writes.

a

SafeList, SafeDictionary–fast immutable structures

After the issues we run into with immutable structures, I decided that I still wanted to maintain the same guarantees that immutable data structures gave us, but that still maintain the speed of mutable structures.

I managed to do that by making different design choices when building my implementation. While the BCL immutable collections are based on binary trees, and attempt to spare GC pressure in favor of CPU cycles, I find that for a lot of the things that we do, we can spare the memory in favor of actually being faster overall. In particular, this is true since the immutable collections are order of magnitude slower for reads than their mutable counterparts. I could have dealt with the slow writes, somehow. But we do a lot of reads, and that is utterly unacceptable.

Here is how I implemented SafeList:

   1: public class SafeList<T> : IEnumerable<T>
   2: {
   3:     List<T> _inner = new List<T>();
   4:  
   5:     public static readonly SafeList<T> Empty = new SafeList<T>();
   6:  
   7:     private SafeList()
   8:     {
   9:  
  10:     }
  11:  
  12:     public SafeList<T> AddRange(IEnumerable<T> items)
  13:     {
  14:         var inner = new List<T>(_inner);
  15:         inner.AddRange(items);
  16:         return new SafeList<T>
  17:         {
  18:             _inner = inner
  19:         };
  20:     }
  21:  
  22:     public SafeList<T> Add(T item)
  23:     {
  24:         return new SafeList<T>
  25:         {
  26:             _inner = new List<T>(_inner) {item}
  27:         };
  28:     }
  29:  
  30:     public int Count
  31:     {
  32:         get { return _inner.Count; }
  33:     }
  34:  
  35:     public T this[int i]
  36:     {
  37:         get { return _inner[i]; }
  38:     }
  39:  
  40:  
  41:     public SafeList<T> RemoveAll(Func<T, bool> filter)
  42:     {
  43:         return new SafeList<T>
  44:         {
  45:             _inner = new List<T>(_inner.Where(x => filter(x) == false))
  46:         };
  47:     }
  48:  
  49:     public T Find(Predicate<T> predicate)
  50:     {
  51:         return _inner.Find(predicate);
  52:     }
  53:  
  54:     public SafeList<T> RemoveAllAndGetDiscards(Predicate<T> filter, out List<T> discards)
  55:     {
  56:         var list = new List<T>();
  57:  
  58:         discards = list;
  59:  
  60:         return new SafeList<T>
  61:         {
  62:             _inner = new List<T>(_inner.Where(x =>
  63:             {
  64:                 if (filter(x) == false)
  65:                 {
  66:                     list.Add(x);
  67:                     return false;
  68:                 }
  69:                 return true;
  70:             }))
  71:         };
  72:  
  73:     }
  74: }

The implementation for SafeDictionary is pretty similar as well. Note that I have dedicated & optimized methods for the type of things that I need in my code.

But the key part here is that I gain the safety by always creating another copy of the underlying implementation. We are never actually mutating anything.

Sure, this results in us having to deal with a lot more allocations, but we get the same speed for reads as you do for the standard mutable collections. And more importantly, we are still faster for writes.

Tags:

Published at

Originally posted at

Comments (20)

Immutable collections performance, take II

Why is the performance of an immutable list over 16 times slower than a standard list? I took a peek at what it was actually doing, and it made a lot of sense. In order to maintain efficient indexing access, the actual storage of the data in the immutable list is a binary tree. With the key being used as the indexer.

This result is a much higher cost for pretty much everything. Let us look at the following:

   1: var listsp = Stopwatch.StartNew();
   2: var list = new List<int>(Enumerable.Range(0, 10*1000*1000));
   3:  
   4: for (int i = 0; i < list.Count; i++)
   5: {
   6:     var _ = list[i];
   7: }
   8:  
   9: Console.WriteLine(listsp.Elapsed);
  10:  
  11: var ilist = ImmutableList<int>.Empty.AddRange(list);
  12: listsp.Restart();
  13:  
  14: for (int i = 0; i < ilist.Count; i++)
  15: {
  16:     var _ = ilist[i];
  17: }
  18: Console.WriteLine(listsp.Elapsed);

This List<T> is 0.23 seconds, ImmutableList<T> takes 1.28 seconds. When I use foreach, instead, we get 0.22 seconds vs. 2.29 seconds.

As you can see from the blog post describing them, because immutable collections are mostly implemented as binary trees, I don’t really think that there is a good way to approach this as is. The problem is that the immutable collections actually need to do a lot more than what I need them to do.

Now, it might have been more acceptable to use them if the perf was limited to just writing to them, it might have been acceptable. But as you can see, we have the same problem when we are reading, and that is quite unacceptable.

Tags:

Published at

Originally posted at

Comments (6)

Immutable Collections performance

After finding out that a lot of our costs in Voron is actually related to immutable collections, I decided to run some isolated perf tests.

   1: private static void Dictionary()
   2: {
   3:     var dic = new Dictionary<object, object>();
   4:     var sp = Stopwatch.StartNew();
   5:  
   6:     for (int i = 0; i < 10*1000*1000; i++)
   7:     {
   8:         var obj = new object();
   9:         dic[obj] = obj;
  10:     }
  11:  
  12:     Console.WriteLine(sp.Elapsed);
  13: }                                                          
   1: private static void ImmutableDictionary()
   2: {
   3:     var dic = ImmutableDictionary<object,object>.Empty;
   4:     var sp = Stopwatch.StartNew();
   5:  
   6:     for (int i = 0; i < 10 * 1000 * 1000; i++)
   7:     {
   8:         var obj = new object();
   9:         dic = dic.SetItem(obj, obj);
  10:     }
  11:  
  12:     Console.WriteLine(sp.Elapsed);
  13: }

3.639 seconds

1 minute and 58 seconds

Yes, that is correct, the immutable version is over thirty times slower. And given that we are only using 10 million items, that is a ridiculous high rate.

I decided to do some tests to see if it is just the number of calls that we are making here:

   1: private static void ImmutableDictionary()
   2: {
   3:     var dic = ImmutableDictionary<object,object>.Empty;
   4:     var sp = Stopwatch.StartNew();
   5:  
   6:     dic = dic.SetItems(Enumerable.Range(0, 10*1000*1000).Select(i =>
   7:     {
   8:         var obj = new object();
   9:         return new KeyValuePair<object, object>(obj, obj);
  10:     }));
  11:     
  12:     Console.WriteLine(sp.Elapsed);
  13: }

1 minute

So that it appears that it isn’t the number of calls, but something intrinsic to the way it works. And how about lists?

   1: private static void List()
   2: {
   3:     var list = new List<object>();
   4:     var sp = Stopwatch.StartNew();
   5:  
   6:     for (int i = 0; i < 10 * 1000 * 1000; i++)
   7:     {
   8:         var obj = new object();
   9:         list.Add(obj);
  10:     }
  11:  
  12:     Console.WriteLine(sp.Elapsed);
  13: }                                                           
   1: private static void ImmutableList()
   2: {
   3:     var list = ImmutableList<object>.Empty;
   4:     var sp = Stopwatch.StartNew();
   5:  
   6:     for (int i = 0; i < 10 * 1000 * 1000; i++)
   7:     {
   8:         var obj = new object();
   9:         list = list.Add(obj);
  10:     }
  11:  
  12:     
  13:     Console.WriteLine(sp.Elapsed);
  14: }

0.9 seconds

16 seconds

Ouch.

I think we are going to need to do something about this, even though I love the idea of immutable collections, 16 – 32 times slower is just going to be utterly unacceptable.

But this is for writes, how about reads?

I had written the following for both of them:

   1: var sp = Stopwatch.StartNew();
   2: for (int i = 0; i < 10*1000*1000; i++)
   3: {
   4:     object value;
   5:     dic1.TryGetValue(i, out value);
   6: }
   7:  
   8: Console.WriteLine(sp.Elapsed);
   1: var sp1 = Stopwatch.StartNew();
   2:  
   3: for (int i = 0; i < 10 * 1000 * 1000; i++)
   4: {
   5:     object value;
   6:     dic2.TryGetValue(i, out value);
   7: }
   8: Console.WriteLine(sp1.Elapsed);

0.45 seconds

3.11 seconds

So the summary is, the performance for our workloads is probably going to be simply unacceptable. We’ll have to find another way to handle this. The way to handle this nicely is to basically copy the values. So I wanted to know how long it would take to fully clone a 10 millions items dictionary. The answer: 0.6 seconds. Doing the same with immutable dictionary? 16 seconds.

So we are going to need to do something else Sad smile.

The difference between benchmarks & performance tests

Also known as: Please check the subtitle of this blog.

This post is in response to this one. Kelly took offence with this post about Voron performance. In particular, it appears that the major issues are:

This benchmark doesn’t actually provide much useful information. It is too short and compares fully featured DBMS systems to storage engines. I always stress very much that people never make decisions based on benchmarks like this.

These paint the fully featured DBMS systems in a negative light that isn’t a fair comparison. They are doing a LOT more work. I’m sure the FoundationDB folks will not be happy to know they were roped into an unfair comparison in a benchmark where the code is not even available.

This isn’t a benchmark. This is just an interim step along the way of developing Voron. It is a way for us to see where we stand and where we need to go. A benchmark include full details about what you did (machine specs, running environment, full source code, etc). This is just us putting stress on our machine and comparing where we are at. And yes, we could have done it in isolation, but that wouldn’t really give us any major advantage. We need to see how we compare to other database.

And yes, we compare apples to oranges here when we compare a low level storage engine like Voron to SQL Server. I am well aware of that. But that isn’t the point. For the same reason that we are currently doing a lot of micro benchmarks rather than the 48 hours ones we have in the pipeline.

I am trying to see how users will evaluate Voron down the road. A lot of the time, that means users doing micro benchmarks to see how good we are. Yes, those aren’t very useful, but they are a major way people make decisions. And I want to make sure that we come out in a good light under that scenario.

With regards to Foundation DB, I am sure they are as happy about it as I am about them making silly claims about RavenDB transaction support. And the source code is available if you really want to, in fact, we got the Foundation DB there because we had an explicit customer request, and because they contributed the code for that.

Next, let us move to something else equally important. This is my personal blog. I publish here things that I do on a daily basis. And if I am currently in a performance boost stage, you’re going to be getting a lot of details on that. Those are the results of performance runs, they aren’t benchmarks. They don’t get anywhere beyond this blog. When we’ll put the results on ravendb.net, or something like that, then it will be a proper benchmark.

And while I fully agree that making decisions based on micro benchmarks is a silly way to go about doing so, the reality is that many people do just that. So one of the things that I’m focusing on is exactly those things. It helps that we currently see a lot of places to improve in those micro benchmarks. We already have a plan (and code) to see how we do on a 24 – 48 hours benchmark, which would also allow us to see all sort of interesting things (mixed reads & writes, what happens when you go beyond physical memory size, longevity issues, etc).

Hunting the performance bottleneck

We got a support request from a customer, and this time it was quite a doozie. The were using the FreeDB dataset as a test bed for a lot of experiments, and they found very slow indexing speed with it.  In fact, what they found was utterly unacceptable indexing speed, to be frank. It took days to complete. However, that run contrary to all of the optimizations that we have done in the past few years.

So something there was quite fishy. Luckily, it was fairly simple to reproduce. And the culprit was very easily identified. It was the SQL Replication Bundle. But why? That turned out to be a pretty complex answer. 

The FreeDB dataset currently contains over 3.1 million records, and the sample the customer send us had about 8 indexes, varying in complexity from the trivial to full text analyzed and map reduce indexes. We expect such work load plus the replication to SQL to take a while. But it should be pretty fast process.

What turned out to be the problem was the way the SQL Replication bundle work. Since we don’t know what changes you are going to replicate to SQL, the first thing we do is to delete all the data that might have previously been replicated. In practice, it means that we execute something like:

DELETE FROM Disks WHERE DocumentId in (@p1, @p2, @p3)

INSERT INTO Disks (...) VALUES( ... )
INSERT INTO Disks (...) VALUES( ... )
INSERT INTO Disks (...) VALUES( ... )

And over time, this became slower & slower. Now, SQL Replication need access to a lot of documents (potentially all of them), so we use the same prefetcher technique that we use for indexing. And we also use the same optimizations to decide how much to load.

However, in this case, we had the SQL Replication being slow, and because we use the same optimization, to the optimizer it looked like we were having a very slow index. That calls for reducing the load on the server so we can have greater responsiveness and to reduce overall resource utilization. And that impacted the indexes. In effect, SQL Replication being slow forced us to feed the data into the indexes in small tidbits, and that drastically increased the I/O costs that we had to pay.

So the first thing to do was to actually break it apart, we now have different optimizers instances for indexes and SQL Replication (and RavenDB Replication, for that matter), and they cannot impact one another.

But the root cause was that SQL Replication was slow. And I think you should be able to figure out why from the information outline above.

As we replicated more and more data into SQL, we increased the table size. And as we increased the table size, statements like our DELETE would take more and more time. SQL was doing a lot of table scans.

To be honest, I never really thought about it. RavenDB in the same circumstances would just self optimize and thing would get faster fast. SQL Server (and any other relational database) would just be dog slow until you came and added the appropriate index.

Once that was done, our performance was back on track and we could run things speedily both for indexes and for SQL Replication.

What is your control group?

One of the areas that where we think Voron can be improved is the free space utilization policy. In particular, smarter free space utilization can lead to better performance, since we won’t have to seek so much.

I spent some time working on that, and I got something that on paper, at least, looks much better, performance wise. But… actual benchmarks showed little to no improvement, and in some cases, actual degradation! That was the point when I realize that I actually needed to have some sort of a control, to see what would be the absolute optimal scenario for us. So I wrote a null free space policy. With no free space, Voron will always go to the end of the file, giving us the best case scenario of sequential writes.

This gives us the following behavior:

Flush      1 with   2 pages -   8 kb writes and   1 seeks (  2 leaves,   0 branches,   0 overflows)
Flush      2 with   8 pages -  32 kb writes and   1 seeks (  7 leaves,   1 branches,   0 overflows)
Flush      3 with  10 pages -  40 kb writes and   1 seeks (  9 leaves,   1 branches,   0 overflows)
Flush     27 with  74 pages - 296 kb writes and   1 seeks ( 72 leaves,   2 branches,   0 overflows)
Flush     28 with  74 pages - 296 kb writes and   1 seeks ( 72 leaves,   2 branches,   0 overflows)
Flush     29 with  72 pages - 288 kb writes and   1 seeks ( 70 leaves,   2 branches,   0 overflows)
Flush  1,153 with 155 pages - 620 kb writes and   1 seeks (102 leaves,  53 branches,   0 overflows)
Flush  1,154 with 157 pages - 628 kb writes and   1 seeks (104 leaves,  53 branches,   0 overflows)
Flush  1,155 with 165 pages - 660 kb writes and   1 seeks (108 leaves,  57 branches,   0 overflows)
Flush  4,441 with 191 pages - 764 kb writes and   1 seeks (104 leaves,  87 branches,   0 overflows)
Flush  4,442 with 196 pages - 784 kb writes and   1 seeks (107 leaves,  89 branches,   0 overflows)
Flush  4,443 with 198 pages - 792 kb writes and   1 seeks (108 leaves,  90 branches,   0 overflows)
Flush  7,707 with 200 pages - 800 kb writes and   1 seeks (106 leaves,  94 branches,   0 overflows)
Flush  7,708 with 204 pages - 816 kb writes and   1 seeks (106 leaves,  98 branches,   0 overflows)
Flush  7,709 with 211 pages - 844 kb writes and   1 seeks (113 leaves,  98 branches,   0 overflows)
Flush  9,069 with 209 pages - 836 kb writes and   1 seeks (107 leaves, 102 branches,   0 overflows)
Flush  9,070 with 205 pages - 820 kb writes and   1 seeks (106 leaves,  99 branches,   0 overflows)
Flush  9,071 with 208 pages - 832 kb writes and   1 seeks (108 leaves, 100 branches,   0 overflows)

And with this, 10,000 transactions with 100 random values each

fill rnd buff separate tx          :    106,383 ms      9,400 ops / sec

And that tells me that for the best case scenario, there is something else that is causing this problem, and it ain’t the cost of doing seeks. I dropped the number of transactions to 500 and run it through a profiler, and I got the following:

image

image

In other words, pretty much the entire time was spent just calling FlushViewOfFile. However, I think that we optimized that enough already, didn’t we? Looking at the calls, it seems that we have just one FlushViewOfFile per transaction in this scenario.

In fact, looking at the actual system behavior, we can see:

image

So seeks wise, we are good. What I can’t understand, however, is why we see those ReadFile calls. Looking at the data, it appears that we run into this whenever we access now portion of the file, so this is the mmap subsystem paging the file contents into memory before we start doing that. It is actually pretty great that it is able to page 1 MB at a time.

Next, let us see what else we can do here. I run the 500 tx test on an HDD drive. And it have given me the following results.

fill rnd sync separate tx          :     25,540 ms      1,958 ops / sec

But note that each write has two writes. One at the end of the file, and one at the file beginning (which is the actual final act of the commit). What happened if we just removed that part?

This give me a very different number:

fill rnd sync separate tx          :     21,764 ms      2,297 ops / sec

So just seeking and writing a single page cost us 17% of our performance. Here are the details from running this test:

image

Now, this is a meaningless test, added just to check what the relative costs are. We have to do the header write, otherwise we can’t do real transactions.

For fun, I run the same thing using sequential write, giving me 3,619 ops / sec. Since in both cases we are actually doing sequential writes, the major differences was how much we actually wrote. This is the view of writing sequentially:

image

As you can see, we only have to write 8 – 10 pages per transaction, compare to 110 – 130 in the random case. And that obviously has a lot of implications.

All of this has thought me something very important. In the end, the actual free space policy matters, but not that much. So I need to select something that is good, but that is about it.

Raven Storage–Voron’s Performance

Voron is the code name for another storage engine that we are currently trying, based on LMDB. After taking that for a spin for a while, it is not pretty complete, and I decided to give it some perf test runs. So far, there has been zero performance work. All the usual caveat applies here, with regards to short test runs, etc.

Just like before, we found that this is horribly slow on the first run. The culprit was our debug code that verified the entire structure whenever we added or removed something. One we run it in release mode we started getting more interesting results.

Here is the test code:

using (var env = new StorageEnvironment(new MemoryMapPager("bench.data", flushMode)))
{
    using (var tx = env.NewTransaction(TransactionFlags.ReadWrite))
    {
        var value = new byte[100];
        new Random().NextBytes(value);
        var ms = new MemoryStream(value);
        for (long i = 0; i < Count; i++)
        {
            ms.Position = 0;
            env.Root.Add(tx, i.ToString("0000000000000000"), ms);
        }

        tx.Commit();
    }
     using (var tx = env.NewTransaction(TransactionFlags.ReadWrite))
     {
         DebugStuff.RenderAndShow(tx, tx.GetTreeInformation(env.Root).Root);

         tx.Commit();
     }
}

We write 1 million entries with 100 bytes in size and 8 bytes of the key. We run it in three mode:

fill seq none                      :      9,578 ms    104,404 ops / sec
fill seq buff                      :     10,802 ms     92,575 ops / sec
fill seq sync                      :      9,387 ms    106,528 ops / sec

None means no flushing to disk, let the OS deals with that completely. Buffers means flush to the OS, but not to disk, and full means do a full fsync.

Now, this is pretty stupid way to go about it, I’ve to say. This is doing everything in a single transaction, and we are actually counting the time to close & open the db here as well, but that is okay for now. We aren’t interested in real numbers, just some rough ideas.

Now, let us see how we read it?

using (var env = new StorageEnvironment(new MemoryMapPager("bench.data")))
{
    using (var tx = env.NewTransaction(TransactionFlags.Read))
    {
        var ms = new MemoryStream(100);
        for (int i = 0; i < Count; i++)
        {
            var key = i.ToString("0000000000000000");
            using (var stream = env.Root.Read(tx, key))
            {
                ms.Position = 0;
                stream.CopyTo(ms);
            }
        }

        tx.Commit();
    }
}

And this gives us:

read seq                           :      3,289 ms    304,032 ops / sec

And again, this is with opening & closing the entire db.

We could do better with pre-allocation of space on the disk, etc. But I wanted to keep things realistic and to allow us to grow.

Next, I wanted to see how much we would gain by parallelizing everything, so I wrote the following code:

using (var env = new StorageEnvironment(new MemoryMapPager("bench.data")))
{
    var countdownEvent = new CountdownEvent(parts);
    for (int i = 0; i < parts; i++)
    {
        var currentBase = i;
        ThreadPool.QueueUserWorkItem(state =>
        {
            using (var tx = env.NewTransaction(TransactionFlags.Read))
            {
                var ms = new MemoryStream(100);
                for (int j = 0; j < Count / parts; j++)
                {
                    var current = j * currentBase;
                    var key = current.ToString("0000000000000000");
                    using (var stream = env.Root.Read(tx, key))
                    {
                        ms.Position = 0;
                        stream.CopyTo(ms);
                    }
                }

                tx.Commit();
            }

            countdownEvent.Signal();
        });
    }
    countdownEvent.Wait();
}

I then run it with 1 – 16 parts, so see how it behaves. Here are the details for this machine:

image

And the results pretty much match what I expected:

read seq                           :      3,317 ms    301,424 ops / sec
read parallel 1                    :      2,539 ms    393,834 ops / sec
read parallel 2                    :      1,950 ms    512,711 ops / sec
read parallel 4                    :      2,201 ms    454,172 ops / sec
read parallel 8                    :      2,139 ms    467,387 ops / sec
read parallel 16                   :      2,010 ms    497,408 ops / sec

We get a 2x perf improvement from running on two cores, running on 4 threads require some dancing around, which caused some perf drop, then we see more time spent in thread switching than anything else, pretty much. As you can see, we see a really pretty jump in performance the more cores we use, until we reach the actual machine limitations.

Note that I made sure to clear the OS buffer cache before each test. If we don't do that, we get:

read seq                           :      2,562 ms    390,291 ops / sec
read parallel 1                    :      2,608 ms    383,393 ops / sec
read parallel 2                    :      1,868 ms    535,220 ops / sec
read parallel 4                    :      1,646 ms    607,283 ops / sec
read parallel 8                    :      1,673 ms    597,557 ops / sec
read parallel 16                   :      1,581 ms    632,309 ops / sec

So far, I am pretty happy with those numbers. What I am not happy is the current API, but I’ll talk about this in a separate post.

Optimizing writes in Voron

As I mentioned, one of the things that I have been working on with Voron is optimizing the sad case of random writes.  I discussed some of the issues that we had already, and now I want to explain how we approach resolving them.

With LMDB, free space occur on every write, because we don’t make modifications in place, instead, we make modifications to a copy, and free the existing page to be reclaimed later. The way the free space reclamation work, a new page can be allocated anywhere on the file. That can lead to a lot of seeks. With Voron, we used a more complex policy. The file is divided in 4 MB sections. And we will aggregate free space in each section. When we need more space, we will find a section with enough free space and use that, and we will continue to use that for as long as we can. The end result is that we tend to be much more local in the way we are reusing space.

Here are the original results:

Flush     1 with  12 pages   - 48 kb writes and 1  seeks   (11 leaves, 1 branches, 0 overflows)
Flush     2 with  13 pages   - 52 kb writes and 1  seeks   (12 leaves, 1 branches, 0 overflows)
Flush     3 with  21 pages   - 84 kb writes and 1  seeks   (20 leaves, 1 branches, 0 overflows)
 
Flush    27 with  76 pages   - 304 kb writes and 1 seeks  (75 leaves,  1 branches, 0 overflows)
Flush    28 with  73 pages   - 292 kb writes and 1 seeks  (72 leaves,  1 branches, 0 overflows)
Flush    29 with  84 pages   - 336 kb writes and 1 seeks  (80 leaves,  4 branches, 0 overflows)
 
Flush 1,153 with 158 pages - 632 kb writes and 67  seeks (107 leaves, 51 branches, 0 overflows)
Flush 1,154 with 168 pages - 672 kb writes and 65  seeks (113 leaves, 55 branches, 0 overflows)
Flush 1,155 with 165 pages - 660 kb writes and 76  seeks (113 leaves, 52 branches, 0 overflows)
 
Flush 4,441 with 199 pages - 796 kb writes and 146 seeks (111 leaves, 88 branches, 0 overflows)
Flush 4,442 with 198 pages - 792 kb writes and 133 seeks (113 leaves, 85 branches, 0 overflows)
Flush 4,443 with 196 pages - 784 kb writes and 146 seeks (109 leaves, 87 branches, 0 overflows)
 
Flush 7,707 with 209 pages - 836 kb writes and 170 seeks (111 leaves, 98 branches, 0 overflows)
Flush 7,708 with 217 pages - 868 kb writes and 169 seeks (119 leaves, 98 branches, 0 overflows)
Flush 7,709 with 197 pages - 788 kb writes and 162 seeks (108 leaves, 89 branches, 0 overflows)
 
Flush 9,069 with 204 pages - 816 kb writes and 170 seeks (108 leaves, 96 branches, 0 overflows)
Flush 9,070 with 206 pages - 824 kb writes and 166 seeks (112 leaves, 94 branches, 0 overflows)
Flush 9,071 with 203 pages - 812 kb writes and 169 seeks (105 leaves, 98 branches, 0 overflows)

And here are the improved results:

Flush      1 with   2 pages -     8 kb writes and   1 seeks (  2 leaves,   0 branches,   0 overflows)
Flush      2 with   8 pages -    32 kb writes and   1 seeks (  7 leaves,   1 branches,   0 overflows)
Flush      3 with  10 pages -    40 kb writes and   1 seeks (  9 leaves,   1 branches,   0 overflows)
  
Flush     27 with  73 pages -   292 kb writes and   1 seeks ( 72 leaves,   1 branches,   0 overflows)
Flush     28 with  72 pages -   288 kb writes and   1 seeks ( 71 leaves,   1 branches,   0 overflows)
Flush     29 with  71 pages -   284 kb writes and   1 seeks ( 70 leaves,   1 branches,   0 overflows)
  
Flush  1,153 with 157 pages -   628 kb writes and  11 seeks (105 leaves,  52 branches,   0 overflows)
Flush  1,154 with 159 pages -   636 kb writes and   2 seeks (107 leaves,  52 branches,   0 overflows)
Flush  1,155 with 167 pages -   668 kb writes and  17 seeks (111 leaves,  56 branches,   0 overflows)
  
Flush  4,441 with 210 pages -   840 kb writes and  11 seeks (121 leaves,  86 branches,   3 overflows)
Flush  4,442 with 215 pages -   860 kb writes and   1 seeks (124 leaves,  88 branches,   3 overflows)
Flush  4,443 with 217 pages -   868 kb writes and   9 seeks (126 leaves,  89 branches,   2 overflows)
  
Flush  7,707 with 231 pages -   924 kb writes and   7 seeks (136 leaves,  93 branches,   2 overflows)
Flush  7,708 with 234 pages -   936 kb writes and   9 seeks (136 leaves,  97 branches,   1 overflows)
Flush  7,709 with 241 pages -   964 kb writes and  13 seeks (140 leaves,  97 branches,   4 overflows)

Flush  9,069 with 250 pages - 1,000 kb writes and   6 seeks (144 leaves, 101 branches,   5 overflows)
Flush  9,070 with 250 pages - 1,000 kb writes and  13 seeks (145 leaves,  98 branches,   7 overflows)
Flush  9,071 with 248 pages -   992 kb writes and  12 seeks (143 leaves,  99 branches,   6 overflows)

Let us plot this in a chart, so we can get a better look at things:

image

As you can see, this is a pretty major improvement. But it came at a cost, let us see the cost of size per transaction…

image

So we improved on the seeks / tx, but got worse on the size / tx. That is probably because of the overhead of keeping the state around, but it also relates to some tunable configuration that we added (the amount of free space in a section that will make it eligible for use.

Annoyingly, after spending quite a bit of time & effort on this, we don’t see a major perf boost here. But I am confident that it’ll come.

Seek, and you shall find, or maybe delay a lot

I have been doing a lot more analysis about the actual disk access patterns that we saw in Voron. In the beginning, I strongly suspected that the problem was with how I was using memory mapped files. In particular, some experiments with using my better knowledge of the environment has led to substantial performance gains. Instead of calling FlushViewOfFile(0, fileLen), I was able to call FlushViewOfFile on just the ranges that I knew changed.

That helped, but it wasn’t nearly enough. So I run some quick tests using file stream, and realized that the fault was obviously with the broken way Windows is using Memory Mapped files. So I decided to (no, I didn’t write my own mmap impl, thank you very much) to take manual care of how Voron is writing to disk. I used WriteFile with all the bells and whistles, even had async I/O for a while there. I was able to directly pinpoint the exact locations where we needed to write, pass that to Windows in an efficient manner, and be done with it. It required me to write malloc implementation in managed code, but it was quite efficient looking code.

And then I run it. Just to give you some perspective, the scenario under question here is 10,000 transactions doing 100 random writes each. Using the memory map approach, after the FlushViewOfFile range optimization, I got roughly 7,000 operations / second. Using my own hand written, optimized I/O, I got… 262 operations / second. Okaaaay… so maybe I need to go back to the drawing board.

I sat down and started working on figuring out what is actually going on. I looked at the actual output that we had, in particular, how many writes did we have per transaction? I sat down to analyze what is going on. We are writing 100 records with 16 bytes key and 100 bytes value. That means that the absolute minimum amount we can write would be 11,600 bytes. However, we are writing in 4Kb pages, which bring us to 3 pages and 12,288 bytes per transaction. Of course, this ignore things like the writing of branch pages in the B+Tree, so let us see what the real numbers are.

   1: Flush     1 with  12 pages   - 48 kb writes and 1  seeks   (11 leaves, 1 branches, 0 overflows)
   2: Flush     2 with  13 pages   - 52 kb writes and 1  seeks   (12 leaves, 1 branches, 0 overflows)
   3: Flush     3 with  21 pages   - 84 kb writes and 1  seeks   (20 leaves, 1 branches, 0 overflows)
   4:  
   5: Flush    27 with  76 pages   - 304 kb writes and 1 seeks  (75 leaves,  1 branches, 0 overflows)
   6: Flush    28 with  73 pages   - 292 kb writes and 1 seeks  (72 leaves,  1 branches, 0 overflows)
   7: Flush    29 with  84 pages   - 336 kb writes and 1 seeks  (80 leaves,  4 branches, 0 overflows)
   8:  
   9: Flush 1,153 with 158 pages - 632 kb writes and 67  seeks (107 leaves, 51 branches, 0 overflows)
  10: Flush 1,154 with 168 pages - 672 kb writes and 65  seeks (113 leaves, 55 branches, 0 overflows)
  11: Flush 1,155 with 165 pages - 660 kb writes and 76  seeks (113 leaves, 52 branches, 0 overflows)
  12:  
  13: Flush 4,441 with 199 pages - 796 kb writes and 146 seeks (111 leaves, 88 branches, 0 overflows)
  14: Flush 4,442 with 198 pages - 792 kb writes and 133 seeks (113 leaves, 85 branches, 0 overflows)
  15: Flush 4,443 with 196 pages - 784 kb writes and 146 seeks (109 leaves, 87 branches, 0 overflows)
  16:  
  17: Flush 7,707 with 209 pages - 836 kb writes and 170 seeks (111 leaves, 98 branches, 0 overflows)
  18: Flush 7,708 with 217 pages - 868 kb writes and 169 seeks (119 leaves, 98 branches, 0 overflows)
  19: Flush 7,709 with 197 pages - 788 kb writes and 162 seeks (108 leaves, 89 branches, 0 overflows)
  20:  
  21: Flush 9,069 with 204 pages - 816 kb writes and 170 seeks (108 leaves, 96 branches, 0 overflows)
  22: Flush 9,070 with 206 pages - 824 kb writes and 166 seeks (112 leaves, 94 branches, 0 overflows)
  23: Flush 9,071 with 203 pages - 812 kb writes and 169 seeks (105 leaves, 98 branches, 0 overflows)

The very first transactions are already showing something very interesting, we are actually writing 12 - 21 pages, or 48  - 84 Kb of data, instead of 12 Kb. Why do we write 4 times as much data as we wanted?

The answer is that we are writing data that is random in nature, so it can’t all sit in the same page, we get a lot of page splits and very quickly we end up with a lot of pages. This is pretty much inevitable, since this is how trees work. But look at what happens down the road. In particular look at lines 9 – 10. You can see that we are now at a pretty stable state. We are writing ~160 pages per transaction. And since we write random data, we tend to touch about 100 leaf pages per transaction (the stuff after 100 is usually page splits). But something that is much more interesting can be seen in the count of seeks.

The way LMDB works, we use copy-on-write, so whenever we modify a page, we are actually modify a copy and mark the actual page as available for future transaction to re-use. This has the great advantage in that we don’t ever need to do compaction, but it also means that when we do want to do writes, we have to make them pretty much all over place. And it actually gets worse as more times goes by.

Now, you have to realize that this is pretty much the worst case scenario, a transaction that does a lot of writes all over the place. But it means that toward the end, we are really hurting.

You already know the numbers, right? What about just straight out writing 1MB? What is the cost of seeks? I wrote the following code:

   1: var buffer = new byte[1024*1024];
   2: var random = new Random();
   3: random.NextBytes(buffer);
   4: if(File.Exists("test.bin"))
   5:     File.Delete("test.bin");
   6: using (var fs = new FileStream("test.bin", FileMode.CreateNew, FileAccess.ReadWrite))
   7: {
   8:     fs.SetLength(1024 * 1024 * 768);
   9:     // warm up
  10:     for (int i = 0; i < 200; i++)
  11:     {
  12:         fs.Position = random.Next(0, (int)fs.Length);
  13:         fs.Write(buffer,0, random.Next(0, 1024));
  14:     }
  15:  
  16:     var sp = Stopwatch.StartNew();
  17:     for (int i = 0; i < 200; i++)
  18:     {
  19:           fs.Position = random.Next(0, (int)fs.Length);
  20:           fs.WriteByte(1);
  21:     }
  22:     fs.Flush(true);
  23:     sp.Stop();
  24:     Console.WriteLine("200 seeks & 200 bytes {0:#,#} ms", sp.ElapsedMilliseconds);
  25:  
  26:     sp = Stopwatch.StartNew();
  27:     fs.Position = random.Next(0, (int)fs.Length);
  28:     fs.Write(buffer, 0, buffer.Length);
  29:     fs.Flush(true);
  30:     sp.Stop();
  31:     Console.WriteLine("1 MB write {0:#,#} ms", sp.ElapsedMilliseconds);
  32: }

The results are quite interesting:

   1: 200 seeks & 200 bytes 146 ms
   2: 1 MB write 6 ms

Just to note, this is when running on SSD, the numbers are supposed to be a lot worse when running on HDD.

In other words, it ain’t the size of the write, but how you spread it around that really matters. Just to compare, here are the numbers for when we are doing sequential writes:

   1: Flush      1 with   6 pages -  24 kb writes and   1 seeks (  5 leaves,   1 branches,   0 overflows)
   2: Flush      2 with   6 pages -  24 kb writes and   1 seeks (  5 leaves,   1 branches,   0 overflows)
   3: Flush      3 with   6 pages -  24 kb writes and   1 seeks (  5 leaves,   1 branches,   0 overflows)
   4:  
   5: Flush    159 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
   6: Flush    160 with   8 pages -  32 kb writes and   3 seeks (  6 leaves,   2 branches,   0 overflows)
   7: Flush    161 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
   8: Flush    162 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
   9:  
  10: Flush  1,320 with   8 pages -  32 kb writes and   3 seeks (  6 leaves,   2 branches,   0 overflows)
  11: Flush  1,321 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
  12: Flush  1,322 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
  13:  
  14: Flush  4,316 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
  15: Flush  4,317 with   7 pages -  28 kb writes and   2 seeks (  5 leaves,   2 branches,   0 overflows)
  16: Flush  4,318 with   8 pages -  32 kb writes and   3 seeks (  6 leaves,   2 branches,   0 overflows)
  17:  
  18: Flush  7,409 with   8 pages -  32 kb writes and   4 seeks (  5 leaves,   3 branches,   0 overflows)
  19: Flush  7,410 with   9 pages -  36 kb writes and   2 seeks (  6 leaves,   3 branches,   0 overflows)
  20: Flush  7,411 with   8 pages -  32 kb writes and   4 seeks (  5 leaves,   3 branches,   0 overflows)
  21:  
  22: Flush  9,990 with   8 pages -  32 kb writes and   3 seeks (  5 leaves,   3 branches,   0 overflows)
  23: Flush  9,991 with   9 pages -  36 kb writes and   4 seeks (  6 leaves,   3 branches,   0 overflows)
  24: Flush  9,992 with   8 pages -  32 kb writes and   3 seeks (  5 leaves,   3 branches,   0 overflows)
  25: Flush  9,993 with   8 pages -  32 kb writes and   4 seeks (  5 leaves,   3 branches,   0 overflows)

Because we are always writing at the end, we only need to touch very few pages. Note that even with the small number of pages, we still need to do quite a bit of seeks, relative to the number of pages we write.

I have some ideas about this, but they are still unformed. Mostly we need to balance between the amount of free space that is used to the number of seeks allowed. I think we can get there by being smart about tracking the number of pages modified by a transaction, and waiting until free space become available in sufficient numbers to be relevant. Something like that would allow auto tuning of the amount of garbage we accumulate vs. the number of seeks we require.

Tags:

Published at

Originally posted at

Comments (14)

Degenerate performance scenario for LMDB

I am working peacefully through some stuff with Voron, when I run into some really bad performance in a pretty important scenario. I decided that this is probably something that I am doing wrong, and set out to reproduce the same scenario in LMDB, to figure out what I am doing wrong.

Here it the code:

int main(int argc,char * argv[])
{
    int i = 0, j = 0, rc;
    UUID id;
    MDB_env *env;

    MDB_val key, data;

    MDB_stat mst;
    MDB_cursor *cursor;
    char sval[32];
    clock_t start = clock(), end;

    srandom(time(NULL));

    rc = mdb_env_create(&env);
    rc = mdb_env_set_mapsize(env, 1024*1024*768);
rc = mdb_env_set_flags(env, MDB_WRITEMAP, 1); rc = mdb_env_open(env, "E:\\data\\lmdb2", 0, 0664); key.mv_size = sizeof(UUID); data.mv_size = sizeof(sval); data.mv_data = sval; for (i=0;i<100;i++) { MDB_txn *txn; MDB_dbi dbi; rc = mdb_txn_begin(env, NULL, 0, &txn); rc = mdb_open(txn, NULL, 0, &dbi); for (j= 0; j < 100; j++) { UuidCreate(&id); key.mv_data = &id; rc = mdb_put(txn, dbi, &key, &data, 0); } rc = mdb_txn_commit(txn); mdb_close(env, dbi); } end = clock(); printf("%i", (end - start)); fgetc(stdin); mdb_env_close(env); return 0; }

As you can see, we are inserting 10,000 items (100 transactions of 100 items each). Each item has key of 16 bytes and a value of 100 bytes. Now, you might note that this is probably the worst case scenario for a B+Tree, UUID are unordered, and this generate a lot of fragmentation in the tree. Bad scenario, yes, but also a relatively common one, and one that needs to be handled properly for our needs. It took me a long while to narrow down what is actually going on. At first I was convinced that the problem was with Windows’ implementation of memory mapped files, but eventually I realized that select ain’t broken.

Here is the Process Monitor’s trace of the first few transactions. The highlighted calls are to FlushBuffersFile, which is how FlushFileBuffers look like, which indicate a commit.

image

As I said, those are the first few transactions. You can see that the OS is doing a pretty good job in merging writes and avoid seeks. However, as times goes by…

image

And as more times goes by, we get to… ( there are more rows at the top that I had to remove so I could take the snapshot).

image

In fact, I put the data in Excel and got:

 

image

And those are some really bad numbers, I think you’ll agree. After the very short period of time, the cost of committing a transaction goes to over 50 seeks and writing of 350KB.

Let us compare that to what happens when we are actually writing sequential data (using UuidCreateSequential instead of UuidCreate). The test complete in half the time, and the actual statistics are also very interesting.

image

 

We are writing a lot less, but much more important, we are doing a lot less seeks. For reference, here is the final transaction that we have with the sequential write scenario:

image

I run those tests on a HDD drive, so seeks times matter a lot, but I have gotten similar results when running on SSD.

Now, the interesting question here is what exactly is going on? And I think that a large part of this is the way LMDB allocate pages. It uses a copy on write method, which means that after the transaction is done, the previously used page are free. That means that they can be reclaimed. And the next transaction will do just that. The problem with this approach is that it tends to spread writes all over the file. This saves disk space, but require a lot more seeks when you need to commit a transaction.

That means that in order to optimize this particular scenario, I probably need to do some thinking about the free page allocation behavior. We want to be a lot more conservative about when / how we give out those pages, to make sure that we aren’t generating a really bad situation for the db when commit time comes.

Tags:

Published at

Originally posted at

Comments (15)