Ayende @ Rahien

Refunds available at head office

With malice aforethought, we can try even better

Continuing on with the same theme from our last post. How can we improve the speed in which we write to disk. In particular, I am currently focused on my worst case scenario:

fill rnd buff 10,000 tx            :    161,812 ms      6,180 ops / sec

This is 10,000 transactions all running one after another, and taking really way too long to go about doing their thing. Now, we did some improvements and we got it all the way to 6,340 ops / sec, but I think you’ll agree that even this optimization is probably still bad. We spent more time there, trying to figure out exactly how we can do micro optimizations, and we got all the way up to 8,078 ops /sec.

That is the point where I decided that I would really like to look at the raw numbers that I can get from this system. So I wrote the following code:

   1: var key = Guid.NewGuid().ToByteArray();
   2: var buffer = new byte[100];
   3: new Random().NextBytes(buffer);
   4:  
   5: using (var fs = new FileStream("test.bin", FileMode.Truncate, FileAccess.ReadWrite))
   6: {
   7:     fs.SetLength(1024*1024*768);
   8:  
   9:     var sp = Stopwatch.StartNew();
  10:  
  11:     for (int i = 0; i < 10*1000; i++)
  12:     {
  13:         for (int j = 0; j < 100; j++)
  14:         {
  15:             fs.Write(key,0, 16);
  16:             fs.Write(buffer, 0, 100);
  17:         }
  18:         fs.Flush(true);
  19:     }
  20:     Console.WriteLine("{0:#,#} ms for {1:#,#} ops / sec", sp.ElapsedMilliseconds, (1000*1000)/sp.Elapsed.TotalSeconds);
  21: }

This code mimic the absolute best scenario we could hope for. Zero cost for managing the data, pure sequential writes. Note that we call to Flush(true) to simulate 10,000 transactions. This code gives me: 147,201 ops / sec.

This is interesting, mostly because I thought the reason our random writes with 10K transactions are bad was the calls to Flush(), but it appears that this is actually working very well. I then tested this with some random writes, by adding the following lines before line 13:

   1: var next = random.Next(0, 1024*1024*512);
   2: fs.Position = next - next%4096;

I then decided to try it with memory mapped files, and I wrote:

   1: using (var fs = new FileStream("test.bin", FileMode.Truncate, FileAccess.ReadWrite))
   2: {
   3:     fs.SetLength(1024 * 1024 * 768);
   4:  
   5:     var memoryMappedFile = MemoryMappedFile.CreateFromFile(fs,
   6:                                     "test", fs.Length, MemoryMappedFileAccess.ReadWrite,
   7:                                     null, HandleInheritability.None, true);
   8:     var memoryMappedViewAccessor = memoryMappedFile.CreateViewAccessor();
   9:  
  10:     byte* p = null;
  11:     memoryMappedViewAccessor.SafeMemoryMappedViewHandle.AcquirePointer(ref p);
  12:  
  13:     var sp = Stopwatch.StartNew();
  14:  
  15:     for (int i = 0; i < 10 * 1000; i++)
  16:     {
  17:         var next = random.Next(0, 1024 * 1024 * 512);
  18:         byte* basePtr = p + next;
  19:         using (var ums = new UnmanagedMemoryStream(basePtr, 12 * 1024,12*1024, FileAccess.ReadWrite))
  20:         {
  21:             for (int j = 0; j < 100; j++)
  22:             {
  23:                 ums.Write(key, 0, 16);
  24:                 ums.Write(buffer, 0, 100);
  25:             }
  26:         }
  27:     }
  28:     Console.WriteLine("{0:#,#} ms for {1:#,#} ops / sec", sp.ElapsedMilliseconds, (1000 * 1000) / sp.Elapsed.TotalSeconds);
  29: }

You’ll note that I am not doing any flushing here. That is intention for now, using this, I am getting 5+ million ops per second. But since I am not doing flushing, this is pretty much me testing how fast I can write to memory.

Adding a single flush cost us 1.8 seconds for a 768MB file. And what about doing the right thing? Adding the following in line 26 means that we are actually flushing the buffers.

   1: FlushViewOfFile(basePtr, new IntPtr(12 * 1024));

Note that we are not flushing to disk, we still need to do that. But for now, let us try doing it. This single line changed the code from 5+ million ops to doing 170,988 ops / sec. And that does NOT include actual flushing to disk. When we do that, too, we get a truly ridiculous number: 20,547 ops / sec. And that explains quite a lot, I think.

For reference, here is the full code:

   1: unsafe class Program
   2: {
   3:     [DllImport("kernel32.dll", SetLastError = true)]
   4:     [return: MarshalAs(UnmanagedType.Bool)]
   5:     extern static bool FlushViewOfFile(byte* lpBaseAddress, IntPtr dwNumberOfBytesToFlush);
   6:  
   7:     static void Main(string[] args)
   8:     {
   9:         var key = Guid.NewGuid().ToByteArray();
  10:         var buffer = new byte[100];
  11:         var random = new Random();
  12:         random.NextBytes(buffer);
  13:  
  14:         using (var fs = new FileStream("test.bin", FileMode.Truncate, FileAccess.ReadWrite))
  15:         {
  16:             fs.SetLength(1024 * 1024 * 768);
  17:  
  18:             var memoryMappedFile = MemoryMappedFile.CreateFromFile(fs,
  19:                                             "test", fs.Length, MemoryMappedFileAccess.ReadWrite,
  20:                                             null, HandleInheritability.None, true);
  21:             var memoryMappedViewAccessor = memoryMappedFile.CreateViewAccessor();
  22:  
  23:             byte* p = null;
  24:             memoryMappedViewAccessor.SafeMemoryMappedViewHandle.AcquirePointer(ref p);
  25:  
  26:             var sp = Stopwatch.StartNew();
  27:  
  28:             for (int i = 0; i < 10 * 1000; i++)
  29:             {
  30:                 var next = random.Next(0, 1024 * 1024 * 512);
  31:                 byte* basePtr = p + next;
  32:                 using (var ums = new UnmanagedMemoryStream(basePtr, 12 * 1024, 12 * 1024, FileAccess.ReadWrite))
  33:                 {
  34:                     for (int j = 0; j < 100; j++)
  35:                     {
  36:                         ums.Write(key, 0, 16);
  37:                         ums.Write(buffer, 0, 100);
  38:                     }
  39:                 }
  40:                 FlushViewOfFile(basePtr, new IntPtr(12 * 1024));
  41:                 fs.Flush(true);
  42:             }
  43:             Console.WriteLine("{0:#,#} ms for {1:#,#} ops / sec", sp.ElapsedMilliseconds, (1000 * 1000) / sp.Elapsed.TotalSeconds);
  44:         }
  45:     }
  46: }

This is about as efficient a way you can get for writing to the disk using memory mapped files if you need to do that using memory mapped files in a transactional manner. And that is the absolute best case scenario, pretty much. Where we know exactly what we wrote and were we wrote it. And we always write a single entry, of a fixed size, etc. In Voron’s case, we might write to multiple pages at the same transaction (in fact, we are pretty much guaranteed to do just that).

This means that I need to think about other ways of doing that.

With a little knowledge and a profiler, let us optimize!

As part of the work we have been doing on Voron, I wrote a few benchmarks and looked at where the hot spots are. One of the major ones was this function:

   1: public override void Flush()
   2: {
   3:     if (_flushMode == FlushMode.None)
   4:         return;
   5:  
   6:     PagerState.Accessor.Flush();
   7:     _fileStream.Flush(_flushMode == FlushMode.Full);
   8: }

This is the “fsync()” call, effectively. Accessor.Flush() call will resolve to a FlushViewOfFile(0, size); and _fileStream.Flush(true) will resolve to FlushFileBuffers on Windows.

It isn’t surprising that this would be THE hotspot, it is the part where we actually have to wait for the hardware to do stuff, after all. But further investigation revealed that it wasn’t the FlushFileBuffers that was really costly, it was the FlushViewOfFile. What FlushViewOfFile will do is scan all of the pages in range, and flush them to the OS (not to disk) if they are dirty. That is great, but it is effectively an O(N) operation. We have more knowledge about what is going on, so we can do better. We already know what are the dirty pages, so we can actually use that, instead of letting the OS do all the work.

But then we run into another problem. If we actually just call FlushViewOfFile for every page separately, we are going to have to spend a lot of time just calling to the OS when we have to do a large write. So we need to balance the amount of data we send to FlushViewOfFile with the number of times we are going to call FlushViewOfFile. Therefor, I came up with the following logic. We are going to group calls to FlushViewOfFile, as long as they are nearby (within 256KB of one another), this will give us the best balance between reducing the number of pages that FlushViewOfFile needs to call and the number of times we call FlushViewOfFile.

This now looks like this:

   1: public override void Flush(List<long> sortedPagesToFlush)
   2: {
   3:     if (_flushMode == FlushMode.None || sortedPagesToFlush.Count == 0)
   4:         return;
   5:  
   6:     // here we try to optimize the amount of work we do, we will only 
   7:     // flush the actual dirty pages, and we will do so in sequential order
   8:     // ideally, this will save the OS the trouble of actually having to flush the 
   9:     // entire range
  10:     long start = sortedPagesToFlush[0];
  11:     long count = 1;
  12:     for (int i = 1; i < sortedPagesToFlush.Count; i++)
  13:     {
  14:         var difference = sortedPagesToFlush[i] - sortedPagesToFlush[i - 1];
  15:         // if the difference between them is not _too_ big, we will just merge it into a single call
  16:         // we are trying to minimize both the size of the range that we flush AND the number of times
  17:         // we call flush, so we need to balance those needs.
  18:         if (difference < 64)
  19:         {
  20:             count += difference;
  21:             continue;
  22:         }
  23:         FlushPages(start, count);
  24:         start = sortedPagesToFlush[i];
  25:         count = 1;
  26:     }
  27:     FlushPages(start, count);
  28:  
  29:     if (_flushMode == FlushMode.Full)
  30:         _fileStream.Flush(true);
  31: }

A side affect of this is that we are more likely to be writing to the disk in a sequential fashion because of this.

The end result of this change was doubling the performance of the system under worse case scenario to “just” 25% faster under best conditions.

Tags:

Published at

Originally posted at

Comments (14)

Memory Mapped Files, File I/O & Performance

I have been testing out several approaches for writing out to files. And I thought that the results are interesting enough to share. In all cases, I was writing a 128Kb buffer of random data to a file with size of 256Mb.

The first thing that I wanted to try was the trivial managed memory map approach:

using (var mmf = MemoryMappedFile.CreateFromFile("test.bin", FileMode.Create, "test", 1024*1024*256))
{
    using (var accessor = mmf.CreateViewAccessor())
    {
        for (int i = 0; i < accessor.Capacity; i += buffer.Length)
        {
            accessor.WriteArray(i, buffer, 0, buffer.Length);
        }
        accessor.Flush();
    }
}

This completed in 3.871 seconds.

Next, I Wanted to see what would happen if I were using direct memory access, and used CopyMemory to do that:

[DllImport("kernel32.dll", EntryPoint = "RtlMoveMemory")]
static extern void CopyMemory(byte* dst, byte* src, long size);

using (var mmf = MemoryMappedFile.CreateFromFile("test.bin", FileMode.Create, "test", 1024*1024*256))
{
    using (var accessor = mmf.CreateViewAccessor())
    {
        byte* p = null;
        accessor.SafeMemoryMappedViewHandle.AcquirePointer(ref p);
        fixed (byte* src = buffer)
        {
            for (int i = 0; i < accessor.Capacity; i += buffer.Length)
            {
                CopyMemory(p + i, src, buffer.Length);
            }
        }
        accessor.SafeMemoryMappedViewHandle.ReleasePointer();
        accessor.Flush();
    }
}

As you can see, this is somewhat more complex, and require unsafe code. But this completed in 2.062 seconds. Nearly twice as fast.

Then I decided to try with raw file IO:

using (var f = new FileStream("test.bin",FileMode.Create))
{
    f.SetLength(1024*1024*256);
    for (int i = 0; i < f.Length; i += buffer.Length)
    {
        f.Write(buffer, 0, buffer.Length);
    }
    f.Flush(true);
}

This is about the most trivial code that you can think of, and this completed in about 1.956 seconds. Slightly faster, but within the margin of error (note, in repeated tests, they were consistently very close, and the file I/O was very near).

So, in other words, the accessor code adds a lot of overhead when using Memory Mapped Files.

Raven Storage, early perf numbers

So, our managed implementation of leveldb is just about ready to go out and socialize. Note that those are just for a relatively short duration, but they give us good indicator of where we are. We are still running longer term perf tests now. Also note that they are early numbers. We did performance work, but it is still not done.

The following tests were done on a HDD, and all include writing a million records (16 bytes key, 100 bytes values) to storage.

  • Writing 1 million sequential keys - 52,152 op/s
  • Writing 1 million random keys -  11,986 op/s
  • Writing 1 million sequential keys with fsync - 17,225 op/s

And now, for the reads portion:

  • Sequential reads - 104,620 op/s
  • Reverse sequential reads - 57,932 op/s
  • Random reads - 3,191 op/s

Note that I am pretty sure that the reason for the later performance is that it is using an HDD, instead of SSD.

Some thoughts about compression & storage

One of the advantages that keeps showing up with leveldb is the notion that it compresses the data on disk by default. Since reading data from disk is way more expensive than the CPU cost of compression & decompression, that is a net benefit.

Or is it? In the managed implementation we are currently working on, we chose to avoid this for now. For a very simple reason. By storing the compressed data on disk, it means that you cannot just give a user the memory mapped buffer and be done with it, you actually have to decompress the data yourself, then hand the user the buffer to the decompressed memory. In other words, instead of having a single read only buffer that the OS will manage / page / optimize for you, you are going to have to allocate memory over & over again, and you’ll pay the cost of decompressing again and again.

I think that it would be better to have the client make that decision. They can send us data that is already compressed, so we won’t need to do anything else, and we would still be able to just hand them a buffer of data. Sure, it sounds like we are just moving the cost around, isn’t it? But what actually happens is that you have a better chance to do optimizations. For example, if I am storing the data compressing via gzip. And I’m exposing the data over the wire, I can just stream the results from the storage directly to the HTTP stream, without having to do anything about it. It can be decompressed on the client.

On the other hand, if I have storage level decompression, I am paying for the cost of reading the compressed data from disk, then allocating new buffer, decompressing the data, then going right ahead and compressing it again for sending over the wire.

What comes after leveldb?

Kellabyte rightfully points out that leveldb was designed primarily for mobile devices, and that there have been many things that can be done that weren’t because of that design goal. I was already pointed out to HyperDex’s port of leveldb, and I am also looking into what Basho is doing with it.

You can read a bit about what HyperDex did to improve things here. And Basho challenges are detailed in the following presentation. The interesting thing about this is why people want to go to leveldb to start with.

To put it simply, it is a really good product, in the sense that it is doing what it needs to do, and it is small enough so you can understand it in a reasonably short order. The fun starts when you disagree with the decisions made for you by leveldb, it is actually quite easy to make changes to the root behavior of the project without making significant changes to the overall architecture. Looking over the HyperDex changes overview, it appears that they were able to make very targeted changes. All the hard problems (storing, searching, atomicity, etc) were already solved, now the problem is dealing with optimizing them.

I’m going to review the changes made by both Basho & HyperDex to leveldb, because I think that it would be a facinating object of study. In particular, I was impressed with how leveldb merged transactions, and I am looking forward at how HyperDex parallelized them.

Published at

Originally posted at

The RavenDB indexing process: Optimization–Tuning? Why, we have auto tuning

The final aspect of RavenDB’s x7 jump in indexing performance is the fact that we made it freakishly smart.

During standard operation, most indexes only update when new information comes in, we are usually talking about a small number of documents for every indexing run. The problem is what happens when you have a sudden outpour of documents into RavenDB? For example, during nightly ETL batch, or just if you suddenly have a flood of users doing write operations.

The problem here is that we actually have to balance a lot of variable at the same time:

  • The number of documents that we have to index*.
  • The current memory utilization**.
  • How any cores I have available to do the index work with?
  • How much time do I have to do this?

Basically, the idea goes like this, if I have a small batch size, I am able to index more quickly, ensuring that we have fresher results. If I have big batch size, I am able to index more documents, and my overall indexing times goes down.

There is a non trivial cost associated with every indexing run, so reducing the number of indexing run is good, but the more documents I shove into a single run, the more memory will I use, and the more time it will take before the results are visible to the users.

* It is non trivial because there is no easy way for us to even know how many documents we have left to index (to find out is costly).

** Memory utilization is hard to figure out in a managed world. I don’t actually have a way to know how much memory I am using for indexing and how much for other stuff, and there is no real way to say “free the memory from the last indexing run”, or even estimate how much memory that took.

What we have decided on doing is to start from a very small (low hundreds) indexing batch size, and see what is actually going on live. If we see that we have more documents to index than the current batch size, we will slowly double the size of the batch. Slowly, because bigger batches requires more memory, and we also have to take into account current utilization, memory usage, and a bunch of other factors as well. We also go the other way around, able to reduce the indexing batch size on demand based on how much work we have to do right now.

We also provide an upper limit, because at some point it make sense to just do a big batch and make the indexing results visible than to try to do everything all at once.

The fun part in all of that is that once we have found the appropriate algorithm for this, it means that RavenDB will automatically adjust itself based on real production load. If you have an low update rate, it will favor small indexing batches and immediately execute indexing on the new documents. However, if you suddenly have a spike in traffic and the update rate goes up, RavenDB will adjust the indexing batch size so it will be able to keep up with your rate.

We have done some (read, a huge amount) testing with regards to this new optimization, and it turns out that under slow update frequency, we are seeing an average of 15 – 25 ms between a document update and it showing up in the indexes. That is pretty good, but what is going on when we have data just pouring in?

We tested this with a 3 million documents and 3 indexes. And it turn out that under this scenario, where we are trying to shove data into RavenDB as fast as it can accept it, we do see an increase in index latency. Under those condition, latency rose all the way to 1.5 seconds.

This is actually something that I am very happy about, because we were able to automatically adjust to the changing conditions, and were still able to index things at a reasonable rate (note that under this scenario, the batch size was usually 8 – 16 thousands documents, vs. the 128 – 256 that it is normally).

Because we were able to adjust the batch size on the fly, we could handle sustained writes at this rate with no interruption in service and no real need to think about this from the users perspective.. Exactly what the RavenDB philosophy calls for.

The RavenDB indexing process: Optimization–Getting documents from disk

As I noted in my previous post, we have done major optimizations for RavenDB. One of the areas where we improved the performance was reading the documents from the disk for indexing.

In Pseudo Code, it looks like this:

while database_is_running:
  stale = find_stale_indexes()
  lastIndexedEtag = find_last_indexed_etag(stale)
  docs_to_index = get_documents_since(lastIndexedEtag, batch_size)
  

As it turned out, we had a major optimization option here, because of the way the data is actually structured on disk. In simple terms, we have an on disk index that lists the documents in the order in which they were updated, and then we have the actual documents themselves, which may be anywhere on the disk.

Instead of loading the documents in the orders in which they were modified, we decided to try something different. We first query the information we need to find the document on disk from the index, then we sort them based on the optimal access pattern, to reduce disk movement and ensure that we have as sequential reads as possible. Then we take those results in memory and sort them based on their last update time again.

This seems to be a perfectly obvious thing to do, assuming that you are aware of such things, but it is actually something that is very easy not to notice. The end result is quite promising, and it contributed to the 7+ times improvements in perf that we had for indexing costs.

But surprisingly, it wasn’t the major factor, I’ll discuss a huge perf boost in this area tomorrow.

The RavenDB indexing process: Optimization–De-parallelizing work

One of the major dangers in doing perf work is that you have a scenario, and you optimize the hell out of that scenario. It is actually pretty easy to do without even noticing it. The problem is that when you do things like that, you are likely to be optimizing a single scenario to perform really well, but you are hurting the overall system performance.

In this example, we have moved heaven and earth to make sure that we are indexing things as fast as possible, and we tested with 3 indexes, on an 4 cores machine. As it turned out, we actually had improved things, for that particular scenario.

Using the same test case on a single core machine was suddenly far more heavy weight, because we were pushing a lot of work at the same time. More than the machine could process. The end result was that it actually got there, but much more slowly than if we would have run things sequentially.

Of course, I give you the outliers, but those are good indicators for what we found out. Initially, we thought that we could resolve that by using the TPL’s MaxDegreeOfParallelism, but it turned out to be more complex than that. We have IO bound and we have CPU bound tasks that we need to execute, and trying to execute IO heavy tasks with this would actually cause issues in this scenario.

We had to manually throttle things ourselves, both to ensure limited number of parallel work, and because we have a lot more information about the actual tasks than the TPL have. We can schedule them in a way that is far more efficient because we can tell what is actually going on.

The end result is that we are actually using less parallelism, overall, but in a more efficient manner.

In my next post, I’ll discuss the auto batch tuning support, which allows us to do some really amazing things from the point of view of system performance.

The RavenDB indexing process: Optimization–Parallelizing work

One of the things that we are doing during the index process for RavenDB is applying triggers and deciding what, if and how a document will be indexed. The actual process is a bit more involved, because we have to do additional things (like figure out which indexes have already indexed those particular documents).

At any rate, the interesting thing is that this is a process which is pretty basic:

for doc in docs:
    matchingIndexes = FindIndexesFor(doc)
    if matchingIndexes.Count > 0:
       doc = ExecuteTriggers(doc) 
       if doc != null:
          yield doc

The interesting thing about this is that this is a set of operations that only works on a single document at a time, and the result is the modified documents.

We were able to gain significant perf boost by simply moving to a Parallel.ForEach call.  This seems simple enough, right? Parallelize the work, get better benefits.

Except that there are issues with this as well, which I’ll touch on my next post.

The RavenDB indexing process: Optimization

The actual process done by RavenDB to index documents is a fairly complex one. In order to understand what exactly happened, I decided to break it apart to pseudo code.

It looks something like this:

while database_is_running:
  stale = find_stale_indexes()
  lastIndexedEtag = find_last_indexed_etag(stale)
  docs_to_index = get_documents_since(lastIndexedEtag, batch_size)
  
  filtered_docs = execute_read_filters(docs_to_index)
  
  indexing_work = []
  
  for index in stale:
    
    index_docs = select_matching_docs(index, filtered_docs)
    
    if index_docs.empty:
      set_indexed(index, lastIndexedEtag)
    else
      indexing_work.add(index, index_docs)
      
  for work in indexing_work:
  
     work.index(work.index_docs)

And now let me show you the areas in which we did some perf work:

while database_is_running:
  stale = find_stale_indexes()
  lastIndexedEtag = find_last_indexed_etag(stale)
  docs_to_index = get_documents_since(lastIndexedEtag, batch_size)
  
  filtered_docs = execute_read_filters(docs_to_index)
  
  indexing_work = []
  
  for index in stale:
    
    index_docs = select_matching_docs(index, filtered_docs)
    
    if index_docs.empty:
      set_indexed(index, lastIndexedEtag)
    else
      indexing_work.add(index, index_docs)
      
  for work in indexing_work:
  
     work.index(work.index_docs)

All of which gives us a major boost in the system performance. I’ll discuss each part of that work in detail, don’t worry Winking smile

Performance implications of method signatures

In my previous post, I asked: What are the performance implications of the two options?

image_thumb

Versus:

image_thumb1

And the answer is quite simple. The chance to optimize how it works.

In the first example, we have to return an unknown amount of information. In the second example, we know how much data we need to return. That means that we can optimize ourselves based on that.

What do I mean by that?

Look at the method signatures, those requires us to scan a secondary index, and get the results back. From there, we need to get back to the actual data. If we knew what the size of the data that we need to return is, we could fetch just the locations from the index, then optimize our disk access pattern to take advantage of sequential reads.

In the first example, we have to assume that every read is the last read. Callers may request one item, or 25 or 713, so we don’t really have a way to optimize things. The moment that we have the amount that the caller wants, things change.

We can scan the index to get just actual position of the document on disk, and then load the documents from the disk based on the optimal access pattern in terms of disk access. It is a very small change, but it allowed us to make a big optimization.

If you throttle me any me I am going to throttle you back!

It is interesting to note that for a long while, what we were trying to do with RavenDB was make it use less and less resources. One of the reasons for that is that less resources is obviously better, because we aren’t wasting anything.

The other reason is that we have users running us on a 512MB/650 MHz Celeron 32 bit machines. So we really need to be able to fit into a small box (and also allow enough processing power for the user to actually do something with the machine).

We have gotten really good in doing that, actually.

The problem is that we also have users running RavenDB on standard server hardware (32 GB / 16 cores, RAID and what not) in which case they (rightly) complain that RavenDB isn’t actually using all of their hardware.

Now, being conservative about resource usage is generally good, and we do have the configuration in place which can tell RavenDB to use more memory. It is just that this isn’t polite behavior.

RavenDB in most cases shouldn’t require anything special for you to run, we want it to be truly a zero admin database. The solution?  Take into account the system state and increase the amount of work that we do to get things done. And yes, I am aware of the pitfalls.

As long as there is enough free RAM available, we will increase the amount of documents that we are going to index in a single batch. That is subject to some limits (for example, if we just created a new index on a big database, we need to make sure we aren’t trying to load it entirely to memory), and it knows how to reserve some room for other things, and how to throttle down and as well as up.

This post is written before I had the chance to actually test this on production level size dataset, but I am looking forward to seeing how it works.

Update: Okay, that is encouraging, it looks like what we did just made things over 7 times faster. And this isn’t a micro benchmark, this is when you throw this on a multi GB database with full text search indexing.

Next, we need to investigate what we are going to do about multiple running indexes and how this optimization affects them. Fun Smile.

Watch your 6, or is it your I/O? It is the I/O, yes

As I said in my previous post, tasked with having to load 3.1 million files into RavenDB, most of them in the 1 – 2 KB range.

Well, the first thing I did had absolutely nothing to do with RavenDB, it had to do with avoiding dealing with this:

image

As you can see, that is a lot.

But when the freedb dataset is distributed, what we have is actually:

image

This is a tar.bz2, which we can read using the SharpZipLib library.

The really interesting thing is that reading the archive (even after adding the cost of decompressing it) is far faster than reading directly from the file system. Most file systems do badly on large amount of small files, and at any rate, it is very hard to optimize the access pattern to a lot of small files.

However, when we are talking about something like reading a single large file? That is really easy to optimize and significantly reduces the cost on the input I/O.

Just this step has reduced the cost of importing by a significant factor, we are talking about twice as much as before, and with a lot less disk activity.

Tags:

Published at

Originally posted at

Comments (9)

Watch your 6, or is it your I/O?

One of the interesting things about the freedb dataset is that it is distributed as a 3.1 million separate files, most of them in the 1 – 2 KB range.

Loading that to RavenDB took a while, so I set out to fix that. Care to guess what is the absolutely the first thing that I did?

Tags:

Published at

Originally posted at

Comments (24)

When you pit RavenDB & SQL Server against one another…

Here is how it works. I hate benchmarks, because they are very easily manipulated. Whenever I am testing performance stuff, I am posting numbers, but they are usually in reference to themselves (showing improvements).

That said…

Mark Rodseth .Net Technical Architect at Fortune Cookie in London, UK and he did a really interesting comparison between RavenDB & SQL Server. I feel good about posting this because Mark is a totally foreign agent (hm…. well, maybe not that Smile ) but he has no association with RavenDB or Hibernating Rhinos.

Also, this post really made my day.

Update: Mark posted more details on his test case.

Mark setup a load test for two identical applications, one using RavenDB, the other one using SQL Server. The results:

SQL Load Test
Transactions: 111,014 (Transaction = Single Get Request)
Failures: 110,286 (Any 500 or timeout)

And for RavenDB ?

RavenDB Load Test
Transactions: 145,554 (Transaction = Single Get Request)
Failures: 0 (Any 500 or timeout)

And now that is pretty cool.

Stupid smart code: Solution

The reason that I said that this is very stupid code?

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteCount = Encoding.UTF8.GetByteCount(data);
    req.ContentLength = byteCount;
    using (var dataStream = req.GetRequestStream())
    {
        if(byteCount <= 0x1000) // small size, just let the system allocate it
        {
            var bytes = Encoding.UTF8.GetBytes(data);
            dataStream.Write(bytes, 0, bytes.Length);
            dataStream.Flush();
            return;
        }

        var buffer = new byte[0x1000];
        var maxCharsThatCanFitInBuffer = buffer.Length / Encoding.UTF8.GetMaxByteCount(1);
        var charBuffer = new char[maxCharsThatCanFitInBuffer];
        int start = 0;
        var encoder = Encoding.UTF8.GetEncoder();
        while (start < data.Length)
        {
            var charCount = Math.Min(charBuffer.Length, data.Length - start);

            data.CopyTo(start, charBuffer, 0, charCount);
            var bytes = encoder.GetBytes(charBuffer, 0, charCount, buffer, 0, false);
            dataStream.Write(buffer, 0, bytes);
            start += charCount;
        }
        dataStream.Flush();
    }
}

Because all of this lovely code can be replaced with a simple:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    req.ContentLength = Encoding.UTF8.GetByteCount(data);

    using (var dataStream = req.GetRequestStream())
    using(var writer = new StreamWriter(dataStream, Encoding.UTF8))
    {
        writer.Write(data);
        writer.Flush();
    }
}

And that is so much better.

Tags:

Published at

Originally posted at

Comments (12)

Stupid smart code

We had the following code:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteArray = Encoding.UTF8.GetBytes(data);

    req.ContentLength = byteArray.Length;

    using (var dataStream = req.GetRequestStream())
    {
        dataStream.Write(byteArray, 0, byteArray.Length);
        dataStream.Flush();
    }
}

And that is a problem, because it allocates the memory twice, once for the string, once for the buffer. I changed that to this:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteCount = Encoding.UTF8.GetByteCount(data);
    req.ContentLength = byteCount;
    using (var dataStream = req.GetRequestStream())
    {
        if(byteCount <= 0x1000) // small size, just let the system allocate it
        {
            var bytes = Encoding.UTF8.GetBytes(data);
            dataStream.Write(bytes, 0, bytes.Length);
            dataStream.Flush();
            return;
        }

        var buffer = new byte[0x1000];
        var maxCharsThatCanFitInBuffer = buffer.Length / Encoding.UTF8.GetMaxByteCount(1);
        var charBuffer = new char[maxCharsThatCanFitInBuffer];
        int start = 0;
        var encoder = Encoding.UTF8.GetEncoder();
        while (start < data.Length)
        {
            var charCount = Math.Min(charBuffer.Length, data.Length - start);

            data.CopyTo(start, charBuffer, 0, charCount);
            var bytes = encoder.GetBytes(charBuffer, 0, charCount, buffer, 0, false);
            dataStream.Write(buffer, 0, bytes);
            start += charCount;
        }
        dataStream.Flush();
    }
}

And I was quite proud of myself.

Then I realized that I was stupid. Why?

Tags:

Published at

Originally posted at

Comments (49)

You can’t cache DateTime.Now

One of the things that were itching me was the fact that it seems that not all the queries in RaccoonBlog were hitting the cache. Oh, it is more than fast enough, but I couldn’t really figure out what is going on there. Then again, it was never important enough for me to dig in.

I was busy doing the profiling stuff for RavenDB and I used RaccoonBlog as my testing ground, when I realized what the problem was:

image

Everything was working just fine, the problem was here:

image

Do you get the light bulb moment that I had? I was using Now in a query, and since Now by definition changes, we keep generating new queries, which can’t be cached, etc.

I changed all of the queries that contained Now to:

image

Which means that it would only use a different value every minute. Once I fixed that, I still saw that there was a caching problem, which led me to discover that there was an error in how we calculated etags for dynamic indexes after they have been promoted. Even a very basic profiling tool helped us fix two separate bugs (in Raccoon Blog and in RavenDB).

RavenDB Aggressive Caching Mode

RavenDB has the notion of HTTP caching out of the box, what this means is that by default, without you having to take any action, RavenDB will cache as much as possible for you. It can get away with doing this because it is utilizing the notion of the 304 Not Modified response. The second time that we load an entity or execute a query, we can simply ask the server whatever the data has been modified since the last time we saw it, and if it wasn’t, we can just skip executing the code and get the data directly from the cache.

This saves a lot in terms of bandwidth and processing power, it also means that we don’t have to worry about RavenDB’s caching returning any stale results, because we checked for that. It does mean, however, that we have to send a request to the server. There are situation where we want to squeeze even better performance from the system, and we can move to RavenDB’s aggressive caching mode.

Aggressive caching means that RavenDB won’t even ask the server whatever anything has changed, it will simply return the reply directly from the local cache if it is there. This means that you might get stale data, but it also means that you’ll get it fast.

You can activate this mode using:

using (session.Advanced.DocumentStore.AggressivelyCacheFor(TimeSpan.FromMinutes(5)))
{
    session.Load<User>("users/1");
}

Now, if there is a value in the cache for users/1 that is at most 5 minutes old, we can directly use that.

It also works on queries too:

using (session.Advanced.DocumentStore.AggressivelyCacheFor(TimeSpan.FromMinutes(5)))
{
    session.Query<User>().ToList();
}

This is an explicit step beyond the normal caching, and it does mean that you might get out of date information, but if you really want to reduce the number of remote calls, it is a really nice feature.

Performance numbers in the pub

Originally posted at 3/31/2011

I am currently sitting with 3 guys in the pub, and the discussion naturally turned to performance. I asked all three the following question: “How many CLR objects can you create in one second?”

I got the following replies:

  • 2,000 objects per second
  • 50,000 objects per second
  • 100,000 objects per second

Then I sat down to check:

class Program
{
    static void Main(string[] args)
    {
        var sp = Stopwatch.StartNew();

        int i = 0;
        while(sp.ElapsedMilliseconds < 1000)
        {
            new MyClass();
            i++;
        }
        sp.Stop();
        Console.WriteLine("Created {0} in {1}", i, sp.Elapsed);
    }
}

public class MyClass
{
    public string A;
    public int B;
    public DateTime C;
}

The result?

Created 7,715,305 in 00:00:01

You are only as fast as your slowest bottleneck

Chris points out something very important:

“A much better solution would have been to simply put the database on a compressed directory, which would slow down some IO ..."

I don't agree.
Compression needs CPU. We got a lot of more IO by switching on compression (it's just less to write and read). Previous our CPU was about 40%, now averaging at 70%. Compression rate saves us about 30% per file. After switching on compression our IO bound application was about 20% faster.
We are currently planning switching on compression on all our production servers over Christmas, because using cpu-cores for compression is even cheaper than adding hard disks and raid for performance.

In general, most operations today are mostly IO bound, with the CPU mostly sitting there twiddling  the same byte until that byte threatens to sue for harassment. It make sense to trade off IO for CPU time, because our systems are being starved for IO.

In fact, you can just turn on compression at the File System level in most OSes, and it is likely to result in a significant saving for the application performance, assuming that the data does not already fits in memory.

How to become a speaker?

I get asked that quite frequently. More to the point, how to become an international speaker?

I was recently at a gathering where no less than three different people asked me this question, so I thought that it might be a good post.

Note: this post isn’t meant for someone who isn’t already speaking. And if you are speaking but are bad at it, this isn’t for you. The underlying assumption here is that you can speak and are reasonably good at it.

Note II: For this post, speaking is used to refer to presenting some technical content in front of an audience.

Why would you want to be a speaker anyway?

I heard that it is actually possible to make a living as a speaker. I haven’t found it to be the case, but then again, while I speak frequently, I don’t speak that frequently.

There are several reasons to want to be a speaker:

  • reputation (and in the end, good reputation means you get to raise your rates, get more work, etc).
  • contacts (speaking put you in front of dozens or hundreds of people, and afterward you get to talk with the people who are most interested in what you talked about)
  • advertising for your product (all those “lap around Visual Studio 2010” are actually an hour long ad that you paid to see :-) ).

I’ll focus on the first two, reputation & contacts gives you a much wider pool of potential work that you can choose from, increase the money you can make, etc.

So how do I do that, damn it?

Honestly, I have no idea. The first time that I presented at a technical conference, it was due to a mixup in communication. Apparently when in the US “it would have been delightful” means “we regret to inform”, but in Israel we read that as “great, let us do it”, and put the guy on the spot, so he had to scramble and do something.

Okay, I lied, I do have some idea about how to do this.

Again, I am assuming you are a reasonably good speaker (for myself, I know that my accent is a big problem when speaking English), but there are a lot of reasonably good speakers out there.

So, what is the answer? Make yourself different.

Pick a topic that is near & dear to your heart (or to your purse, which also works) and prepare a few talks on it. Write about it in a blog, comment on other people blogs about the topic. Your goal should be that when people think about topic X, your name would be on that list.  Forums like Stack Overflow can help, writing articles (whatever it is for pay or in places like CodeProject). Join a mailing list and be active there (and helpful). Don’t focus on regionally associated forums / mailing list, though. The goal is international acknowledgement.

This will take at least a year, probably, for people to start recognizing your name (it took over 2 years for me). If it is possible, produce a set of tools that relate to your topic. Publish them for free, and write it off as an investment in your future.

For myself, NHibernate Query Analyzer would a huge boost in terms of getting recognized. And Rhino Mocks was probably what clinched the deal. I honestly have no idea how much time & effort I put into Rhino Mocks, but Ohloh estimate that project at $ 12,502,089(!). While I disagree about that number, I did put a lot of effort into it, but it paid itself off several times over.

If you don’t have a blog, get one. Don’t get one at a community site, either. Community sites like blogs.microsoft.co.il are good to get your stuff read, but they have a big weakness in terms of branding yourself. You don’t want to get lost in a crowd, you want people to notice who you are. And most people are going to read your posts in a feed reader, and they are going to notice that the community feed is interesting, not that you are interesting.

Post regularly. I try to have a daily post, but that would probably not be possible for you, try to post at least once a week, and try to time it so it is always on the same date & time. Monday’s midnight usually works.

Okay, I did all of that, what now?

Another note, this is something that you may want to do in parallel to the other efforts.

Unless you become very well known, you won’t be approached, you’ll have to submit session suggestions. Keep an eye on the conferences that interest you, and wait until they have a call for sessions. Submit your stuff. Don’t get offended if they reject you.

If you live in a place that host international conferences (which usually rule Israel out), a good bet is to try to get accepted as a speaker there. You would be considerably cheaper than bringing someone from out of town/country. And that also play a role. Usually, if you managed to get into a conference once, they’ll be much more likely to have you again. They have your speaker eval, and unless you truly sucked (like going on stage and starting to speak in Hebrew at Denmark), and that gives them more confidence in bringing you a second time.

And that is about it for now.