Ayende @ Rahien

Refunds available at head office

With a little knowledge and a profiler, let us optimize!

As part of the work we have been doing on Voron, I wrote a few benchmarks and looked at where the hot spots are. One of the major ones was this function:

   1: public override void Flush()
   2: {
   3:     if (_flushMode == FlushMode.None)
   4:         return;
   5:  
   6:     PagerState.Accessor.Flush();
   7:     _fileStream.Flush(_flushMode == FlushMode.Full);
   8: }

This is the “fsync()” call, effectively. Accessor.Flush() call will resolve to a FlushViewOfFile(0, size); and _fileStream.Flush(true) will resolve to FlushFileBuffers on Windows.

It isn’t surprising that this would be THE hotspot, it is the part where we actually have to wait for the hardware to do stuff, after all. But further investigation revealed that it wasn’t the FlushFileBuffers that was really costly, it was the FlushViewOfFile. What FlushViewOfFile will do is scan all of the pages in range, and flush them to the OS (not to disk) if they are dirty. That is great, but it is effectively an O(N) operation. We have more knowledge about what is going on, so we can do better. We already know what are the dirty pages, so we can actually use that, instead of letting the OS do all the work.

But then we run into another problem. If we actually just call FlushViewOfFile for every page separately, we are going to have to spend a lot of time just calling to the OS when we have to do a large write. So we need to balance the amount of data we send to FlushViewOfFile with the number of times we are going to call FlushViewOfFile. Therefor, I came up with the following logic. We are going to group calls to FlushViewOfFile, as long as they are nearby (within 256KB of one another), this will give us the best balance between reducing the number of pages that FlushViewOfFile needs to call and the number of times we call FlushViewOfFile.

This now looks like this:

   1: public override void Flush(List<long> sortedPagesToFlush)
   2: {
   3:     if (_flushMode == FlushMode.None || sortedPagesToFlush.Count == 0)
   4:         return;
   5:  
   6:     // here we try to optimize the amount of work we do, we will only 
   7:     // flush the actual dirty pages, and we will do so in sequential order
   8:     // ideally, this will save the OS the trouble of actually having to flush the 
   9:     // entire range
  10:     long start = sortedPagesToFlush[0];
  11:     long count = 1;
  12:     for (int i = 1; i < sortedPagesToFlush.Count; i++)
  13:     {
  14:         var difference = sortedPagesToFlush[i] - sortedPagesToFlush[i - 1];
  15:         // if the difference between them is not _too_ big, we will just merge it into a single call
  16:         // we are trying to minimize both the size of the range that we flush AND the number of times
  17:         // we call flush, so we need to balance those needs.
  18:         if (difference < 64)
  19:         {
  20:             count += difference;
  21:             continue;
  22:         }
  23:         FlushPages(start, count);
  24:         start = sortedPagesToFlush[i];
  25:         count = 1;
  26:     }
  27:     FlushPages(start, count);
  28:  
  29:     if (_flushMode == FlushMode.Full)
  30:         _fileStream.Flush(true);
  31: }

A side affect of this is that we are more likely to be writing to the disk in a sequential fashion because of this.

The end result of this change was doubling the performance of the system under worse case scenario to “just” 25% faster under best conditions.

Tags:

Published at

Originally posted at

Comments (14)

Memory Mapped Files, File I/O & Performance

I have been testing out several approaches for writing out to files. And I thought that the results are interesting enough to share. In all cases, I was writing a 128Kb buffer of random data to a file with size of 256Mb.

The first thing that I wanted to try was the trivial managed memory map approach:

using (var mmf = MemoryMappedFile.CreateFromFile("test.bin", FileMode.Create, "test", 1024*1024*256))
{
    using (var accessor = mmf.CreateViewAccessor())
    {
        for (int i = 0; i < accessor.Capacity; i += buffer.Length)
        {
            accessor.WriteArray(i, buffer, 0, buffer.Length);
        }
        accessor.Flush();
    }
}

This completed in 3.871 seconds.

Next, I Wanted to see what would happen if I were using direct memory access, and used CopyMemory to do that:

[DllImport("kernel32.dll", EntryPoint = "RtlMoveMemory")]
static extern void CopyMemory(byte* dst, byte* src, long size);

using (var mmf = MemoryMappedFile.CreateFromFile("test.bin", FileMode.Create, "test", 1024*1024*256))
{
    using (var accessor = mmf.CreateViewAccessor())
    {
        byte* p = null;
        accessor.SafeMemoryMappedViewHandle.AcquirePointer(ref p);
        fixed (byte* src = buffer)
        {
            for (int i = 0; i < accessor.Capacity; i += buffer.Length)
            {
                CopyMemory(p + i, src, buffer.Length);
            }
        }
        accessor.SafeMemoryMappedViewHandle.ReleasePointer();
        accessor.Flush();
    }
}

As you can see, this is somewhat more complex, and require unsafe code. But this completed in 2.062 seconds. Nearly twice as fast.

Then I decided to try with raw file IO:

using (var f = new FileStream("test.bin",FileMode.Create))
{
    f.SetLength(1024*1024*256);
    for (int i = 0; i < f.Length; i += buffer.Length)
    {
        f.Write(buffer, 0, buffer.Length);
    }
    f.Flush(true);
}

This is about the most trivial code that you can think of, and this completed in about 1.956 seconds. Slightly faster, but within the margin of error (note, in repeated tests, they were consistently very close, and the file I/O was very near).

So, in other words, the accessor code adds a lot of overhead when using Memory Mapped Files.

Raven Storage, early perf numbers

So, our managed implementation of leveldb is just about ready to go out and socialize. Note that those are just for a relatively short duration, but they give us good indicator of where we are. We are still running longer term perf tests now. Also note that they are early numbers. We did performance work, but it is still not done.

The following tests were done on a HDD, and all include writing a million records (16 bytes key, 100 bytes values) to storage.

  • Writing 1 million sequential keys - 52,152 op/s
  • Writing 1 million random keys -  11,986 op/s
  • Writing 1 million sequential keys with fsync - 17,225 op/s

And now, for the reads portion:

  • Sequential reads - 104,620 op/s
  • Reverse sequential reads - 57,932 op/s
  • Random reads - 3,191 op/s

Note that I am pretty sure that the reason for the later performance is that it is using an HDD, instead of SSD.

Some thoughts about compression & storage

One of the advantages that keeps showing up with leveldb is the notion that it compresses the data on disk by default. Since reading data from disk is way more expensive than the CPU cost of compression & decompression, that is a net benefit.

Or is it? In the managed implementation we are currently working on, we chose to avoid this for now. For a very simple reason. By storing the compressed data on disk, it means that you cannot just give a user the memory mapped buffer and be done with it, you actually have to decompress the data yourself, then hand the user the buffer to the decompressed memory. In other words, instead of having a single read only buffer that the OS will manage / page / optimize for you, you are going to have to allocate memory over & over again, and you’ll pay the cost of decompressing again and again.

I think that it would be better to have the client make that decision. They can send us data that is already compressed, so we won’t need to do anything else, and we would still be able to just hand them a buffer of data. Sure, it sounds like we are just moving the cost around, isn’t it? But what actually happens is that you have a better chance to do optimizations. For example, if I am storing the data compressing via gzip. And I’m exposing the data over the wire, I can just stream the results from the storage directly to the HTTP stream, without having to do anything about it. It can be decompressed on the client.

On the other hand, if I have storage level decompression, I am paying for the cost of reading the compressed data from disk, then allocating new buffer, decompressing the data, then going right ahead and compressing it again for sending over the wire.

What comes after leveldb?

Kellabyte rightfully points out that leveldb was designed primarily for mobile devices, and that there have been many things that can be done that weren’t because of that design goal. I was already pointed out to HyperDex’s port of leveldb, and I am also looking into what Basho is doing with it.

You can read a bit about what HyperDex did to improve things here. And Basho challenges are detailed in the following presentation. The interesting thing about this is why people want to go to leveldb to start with.

To put it simply, it is a really good product, in the sense that it is doing what it needs to do, and it is small enough so you can understand it in a reasonably short order. The fun starts when you disagree with the decisions made for you by leveldb, it is actually quite easy to make changes to the root behavior of the project without making significant changes to the overall architecture. Looking over the HyperDex changes overview, it appears that they were able to make very targeted changes. All the hard problems (storing, searching, atomicity, etc) were already solved, now the problem is dealing with optimizing them.

I’m going to review the changes made by both Basho & HyperDex to leveldb, because I think that it would be a facinating object of study. In particular, I was impressed with how leveldb merged transactions, and I am looking forward at how HyperDex parallelized them.

Published at

Originally posted at

The RavenDB indexing process: Optimization–Tuning? Why, we have auto tuning

The final aspect of RavenDB’s x7 jump in indexing performance is the fact that we made it freakishly smart.

During standard operation, most indexes only update when new information comes in, we are usually talking about a small number of documents for every indexing run. The problem is what happens when you have a sudden outpour of documents into RavenDB? For example, during nightly ETL batch, or just if you suddenly have a flood of users doing write operations.

The problem here is that we actually have to balance a lot of variable at the same time:

  • The number of documents that we have to index*.
  • The current memory utilization**.
  • How any cores I have available to do the index work with?
  • How much time do I have to do this?

Basically, the idea goes like this, if I have a small batch size, I am able to index more quickly, ensuring that we have fresher results. If I have big batch size, I am able to index more documents, and my overall indexing times goes down.

There is a non trivial cost associated with every indexing run, so reducing the number of indexing run is good, but the more documents I shove into a single run, the more memory will I use, and the more time it will take before the results are visible to the users.

* It is non trivial because there is no easy way for us to even know how many documents we have left to index (to find out is costly).

** Memory utilization is hard to figure out in a managed world. I don’t actually have a way to know how much memory I am using for indexing and how much for other stuff, and there is no real way to say “free the memory from the last indexing run”, or even estimate how much memory that took.

What we have decided on doing is to start from a very small (low hundreds) indexing batch size, and see what is actually going on live. If we see that we have more documents to index than the current batch size, we will slowly double the size of the batch. Slowly, because bigger batches requires more memory, and we also have to take into account current utilization, memory usage, and a bunch of other factors as well. We also go the other way around, able to reduce the indexing batch size on demand based on how much work we have to do right now.

We also provide an upper limit, because at some point it make sense to just do a big batch and make the indexing results visible than to try to do everything all at once.

The fun part in all of that is that once we have found the appropriate algorithm for this, it means that RavenDB will automatically adjust itself based on real production load. If you have an low update rate, it will favor small indexing batches and immediately execute indexing on the new documents. However, if you suddenly have a spike in traffic and the update rate goes up, RavenDB will adjust the indexing batch size so it will be able to keep up with your rate.

We have done some (read, a huge amount) testing with regards to this new optimization, and it turns out that under slow update frequency, we are seeing an average of 15 – 25 ms between a document update and it showing up in the indexes. That is pretty good, but what is going on when we have data just pouring in?

We tested this with a 3 million documents and 3 indexes. And it turn out that under this scenario, where we are trying to shove data into RavenDB as fast as it can accept it, we do see an increase in index latency. Under those condition, latency rose all the way to 1.5 seconds.

This is actually something that I am very happy about, because we were able to automatically adjust to the changing conditions, and were still able to index things at a reasonable rate (note that under this scenario, the batch size was usually 8 – 16 thousands documents, vs. the 128 – 256 that it is normally).

Because we were able to adjust the batch size on the fly, we could handle sustained writes at this rate with no interruption in service and no real need to think about this from the users perspective.. Exactly what the RavenDB philosophy calls for.

The RavenDB indexing process: Optimization–Getting documents from disk

As I noted in my previous post, we have done major optimizations for RavenDB. One of the areas where we improved the performance was reading the documents from the disk for indexing.

In Pseudo Code, it looks like this:

while database_is_running:
  stale = find_stale_indexes()
  lastIndexedEtag = find_last_indexed_etag(stale)
  docs_to_index = get_documents_since(lastIndexedEtag, batch_size)
  

As it turned out, we had a major optimization option here, because of the way the data is actually structured on disk. In simple terms, we have an on disk index that lists the documents in the order in which they were updated, and then we have the actual documents themselves, which may be anywhere on the disk.

Instead of loading the documents in the orders in which they were modified, we decided to try something different. We first query the information we need to find the document on disk from the index, then we sort them based on the optimal access pattern, to reduce disk movement and ensure that we have as sequential reads as possible. Then we take those results in memory and sort them based on their last update time again.

This seems to be a perfectly obvious thing to do, assuming that you are aware of such things, but it is actually something that is very easy not to notice. The end result is quite promising, and it contributed to the 7+ times improvements in perf that we had for indexing costs.

But surprisingly, it wasn’t the major factor, I’ll discuss a huge perf boost in this area tomorrow.

The RavenDB indexing process: Optimization–De-parallelizing work

One of the major dangers in doing perf work is that you have a scenario, and you optimize the hell out of that scenario. It is actually pretty easy to do without even noticing it. The problem is that when you do things like that, you are likely to be optimizing a single scenario to perform really well, but you are hurting the overall system performance.

In this example, we have moved heaven and earth to make sure that we are indexing things as fast as possible, and we tested with 3 indexes, on an 4 cores machine. As it turned out, we actually had improved things, for that particular scenario.

Using the same test case on a single core machine was suddenly far more heavy weight, because we were pushing a lot of work at the same time. More than the machine could process. The end result was that it actually got there, but much more slowly than if we would have run things sequentially.

Of course, I give you the outliers, but those are good indicators for what we found out. Initially, we thought that we could resolve that by using the TPL’s MaxDegreeOfParallelism, but it turned out to be more complex than that. We have IO bound and we have CPU bound tasks that we need to execute, and trying to execute IO heavy tasks with this would actually cause issues in this scenario.

We had to manually throttle things ourselves, both to ensure limited number of parallel work, and because we have a lot more information about the actual tasks than the TPL have. We can schedule them in a way that is far more efficient because we can tell what is actually going on.

The end result is that we are actually using less parallelism, overall, but in a more efficient manner.

In my next post, I’ll discuss the auto batch tuning support, which allows us to do some really amazing things from the point of view of system performance.

The RavenDB indexing process: Optimization–Parallelizing work

One of the things that we are doing during the index process for RavenDB is applying triggers and deciding what, if and how a document will be indexed. The actual process is a bit more involved, because we have to do additional things (like figure out which indexes have already indexed those particular documents).

At any rate, the interesting thing is that this is a process which is pretty basic:

for doc in docs:
    matchingIndexes = FindIndexesFor(doc)
    if matchingIndexes.Count > 0:
       doc = ExecuteTriggers(doc) 
       if doc != null:
          yield doc

The interesting thing about this is that this is a set of operations that only works on a single document at a time, and the result is the modified documents.

We were able to gain significant perf boost by simply moving to a Parallel.ForEach call.  This seems simple enough, right? Parallelize the work, get better benefits.

Except that there are issues with this as well, which I’ll touch on my next post.

The RavenDB indexing process: Optimization

The actual process done by RavenDB to index documents is a fairly complex one. In order to understand what exactly happened, I decided to break it apart to pseudo code.

It looks something like this:

while database_is_running:
  stale = find_stale_indexes()
  lastIndexedEtag = find_last_indexed_etag(stale)
  docs_to_index = get_documents_since(lastIndexedEtag, batch_size)
  
  filtered_docs = execute_read_filters(docs_to_index)
  
  indexing_work = []
  
  for index in stale:
    
    index_docs = select_matching_docs(index, filtered_docs)
    
    if index_docs.empty:
      set_indexed(index, lastIndexedEtag)
    else
      indexing_work.add(index, index_docs)
      
  for work in indexing_work:
  
     work.index(work.index_docs)

And now let me show you the areas in which we did some perf work:

while database_is_running:
  stale = find_stale_indexes()
  lastIndexedEtag = find_last_indexed_etag(stale)
  docs_to_index = get_documents_since(lastIndexedEtag, batch_size)
  
  filtered_docs = execute_read_filters(docs_to_index)
  
  indexing_work = []
  
  for index in stale:
    
    index_docs = select_matching_docs(index, filtered_docs)
    
    if index_docs.empty:
      set_indexed(index, lastIndexedEtag)
    else
      indexing_work.add(index, index_docs)
      
  for work in indexing_work:
  
     work.index(work.index_docs)

All of which gives us a major boost in the system performance. I’ll discuss each part of that work in detail, don’t worry Winking smile

Performance implications of method signatures

In my previous post, I asked: What are the performance implications of the two options?

image_thumb

Versus:

image_thumb1

And the answer is quite simple. The chance to optimize how it works.

In the first example, we have to return an unknown amount of information. In the second example, we know how much data we need to return. That means that we can optimize ourselves based on that.

What do I mean by that?

Look at the method signatures, those requires us to scan a secondary index, and get the results back. From there, we need to get back to the actual data. If we knew what the size of the data that we need to return is, we could fetch just the locations from the index, then optimize our disk access pattern to take advantage of sequential reads.

In the first example, we have to assume that every read is the last read. Callers may request one item, or 25 or 713, so we don’t really have a way to optimize things. The moment that we have the amount that the caller wants, things change.

We can scan the index to get just actual position of the document on disk, and then load the documents from the disk based on the optimal access pattern in terms of disk access. It is a very small change, but it allowed us to make a big optimization.

If you throttle me any me I am going to throttle you back!

It is interesting to note that for a long while, what we were trying to do with RavenDB was make it use less and less resources. One of the reasons for that is that less resources is obviously better, because we aren’t wasting anything.

The other reason is that we have users running us on a 512MB/650 MHz Celeron 32 bit machines. So we really need to be able to fit into a small box (and also allow enough processing power for the user to actually do something with the machine).

We have gotten really good in doing that, actually.

The problem is that we also have users running RavenDB on standard server hardware (32 GB / 16 cores, RAID and what not) in which case they (rightly) complain that RavenDB isn’t actually using all of their hardware.

Now, being conservative about resource usage is generally good, and we do have the configuration in place which can tell RavenDB to use more memory. It is just that this isn’t polite behavior.

RavenDB in most cases shouldn’t require anything special for you to run, we want it to be truly a zero admin database. The solution?  Take into account the system state and increase the amount of work that we do to get things done. And yes, I am aware of the pitfalls.

As long as there is enough free RAM available, we will increase the amount of documents that we are going to index in a single batch. That is subject to some limits (for example, if we just created a new index on a big database, we need to make sure we aren’t trying to load it entirely to memory), and it knows how to reserve some room for other things, and how to throttle down and as well as up.

This post is written before I had the chance to actually test this on production level size dataset, but I am looking forward to seeing how it works.

Update: Okay, that is encouraging, it looks like what we did just made things over 7 times faster. And this isn’t a micro benchmark, this is when you throw this on a multi GB database with full text search indexing.

Next, we need to investigate what we are going to do about multiple running indexes and how this optimization affects them. Fun Smile.

Watch your 6, or is it your I/O? It is the I/O, yes

As I said in my previous post, tasked with having to load 3.1 million files into RavenDB, most of them in the 1 – 2 KB range.

Well, the first thing I did had absolutely nothing to do with RavenDB, it had to do with avoiding dealing with this:

image

As you can see, that is a lot.

But when the freedb dataset is distributed, what we have is actually:

image

This is a tar.bz2, which we can read using the SharpZipLib library.

The really interesting thing is that reading the archive (even after adding the cost of decompressing it) is far faster than reading directly from the file system. Most file systems do badly on large amount of small files, and at any rate, it is very hard to optimize the access pattern to a lot of small files.

However, when we are talking about something like reading a single large file? That is really easy to optimize and significantly reduces the cost on the input I/O.

Just this step has reduced the cost of importing by a significant factor, we are talking about twice as much as before, and with a lot less disk activity.

Tags:

Published at

Originally posted at

Comments (9)

Watch your 6, or is it your I/O?

One of the interesting things about the freedb dataset is that it is distributed as a 3.1 million separate files, most of them in the 1 – 2 KB range.

Loading that to RavenDB took a while, so I set out to fix that. Care to guess what is the absolutely the first thing that I did?

Tags:

Published at

Originally posted at

Comments (24)

When you pit RavenDB & SQL Server against one another…

Here is how it works. I hate benchmarks, because they are very easily manipulated. Whenever I am testing performance stuff, I am posting numbers, but they are usually in reference to themselves (showing improvements).

That said…

Mark Rodseth .Net Technical Architect at Fortune Cookie in London, UK and he did a really interesting comparison between RavenDB & SQL Server. I feel good about posting this because Mark is a totally foreign agent (hm…. well, maybe not that Smile ) but he has no association with RavenDB or Hibernating Rhinos.

Also, this post really made my day.

Update: Mark posted more details on his test case.

Mark setup a load test for two identical applications, one using RavenDB, the other one using SQL Server. The results:

SQL Load Test
Transactions: 111,014 (Transaction = Single Get Request)
Failures: 110,286 (Any 500 or timeout)

And for RavenDB ?

RavenDB Load Test
Transactions: 145,554 (Transaction = Single Get Request)
Failures: 0 (Any 500 or timeout)

And now that is pretty cool.

Stupid smart code: Solution

The reason that I said that this is very stupid code?

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteCount = Encoding.UTF8.GetByteCount(data);
    req.ContentLength = byteCount;
    using (var dataStream = req.GetRequestStream())
    {
        if(byteCount <= 0x1000) // small size, just let the system allocate it
        {
            var bytes = Encoding.UTF8.GetBytes(data);
            dataStream.Write(bytes, 0, bytes.Length);
            dataStream.Flush();
            return;
        }

        var buffer = new byte[0x1000];
        var maxCharsThatCanFitInBuffer = buffer.Length / Encoding.UTF8.GetMaxByteCount(1);
        var charBuffer = new char[maxCharsThatCanFitInBuffer];
        int start = 0;
        var encoder = Encoding.UTF8.GetEncoder();
        while (start < data.Length)
        {
            var charCount = Math.Min(charBuffer.Length, data.Length - start);

            data.CopyTo(start, charBuffer, 0, charCount);
            var bytes = encoder.GetBytes(charBuffer, 0, charCount, buffer, 0, false);
            dataStream.Write(buffer, 0, bytes);
            start += charCount;
        }
        dataStream.Flush();
    }
}

Because all of this lovely code can be replaced with a simple:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    req.ContentLength = Encoding.UTF8.GetByteCount(data);

    using (var dataStream = req.GetRequestStream())
    using(var writer = new StreamWriter(dataStream, Encoding.UTF8))
    {
        writer.Write(data);
        writer.Flush();
    }
}

And that is so much better.

Tags:

Published at

Originally posted at

Comments (12)

Stupid smart code

We had the following code:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteArray = Encoding.UTF8.GetBytes(data);

    req.ContentLength = byteArray.Length;

    using (var dataStream = req.GetRequestStream())
    {
        dataStream.Write(byteArray, 0, byteArray.Length);
        dataStream.Flush();
    }
}

And that is a problem, because it allocates the memory twice, once for the string, once for the buffer. I changed that to this:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteCount = Encoding.UTF8.GetByteCount(data);
    req.ContentLength = byteCount;
    using (var dataStream = req.GetRequestStream())
    {
        if(byteCount <= 0x1000) // small size, just let the system allocate it
        {
            var bytes = Encoding.UTF8.GetBytes(data);
            dataStream.Write(bytes, 0, bytes.Length);
            dataStream.Flush();
            return;
        }

        var buffer = new byte[0x1000];
        var maxCharsThatCanFitInBuffer = buffer.Length / Encoding.UTF8.GetMaxByteCount(1);
        var charBuffer = new char[maxCharsThatCanFitInBuffer];
        int start = 0;
        var encoder = Encoding.UTF8.GetEncoder();
        while (start < data.Length)
        {
            var charCount = Math.Min(charBuffer.Length, data.Length - start);

            data.CopyTo(start, charBuffer, 0, charCount);
            var bytes = encoder.GetBytes(charBuffer, 0, charCount, buffer, 0, false);
            dataStream.Write(buffer, 0, bytes);
            start += charCount;
        }
        dataStream.Flush();
    }
}

And I was quite proud of myself.

Then I realized that I was stupid. Why?

Tags:

Published at

Originally posted at

Comments (49)

You can’t cache DateTime.Now

One of the things that were itching me was the fact that it seems that not all the queries in RaccoonBlog were hitting the cache. Oh, it is more than fast enough, but I couldn’t really figure out what is going on there. Then again, it was never important enough for me to dig in.

I was busy doing the profiling stuff for RavenDB and I used RaccoonBlog as my testing ground, when I realized what the problem was:

image

Everything was working just fine, the problem was here:

image

Do you get the light bulb moment that I had? I was using Now in a query, and since Now by definition changes, we keep generating new queries, which can’t be cached, etc.

I changed all of the queries that contained Now to:

image

Which means that it would only use a different value every minute. Once I fixed that, I still saw that there was a caching problem, which led me to discover that there was an error in how we calculated etags for dynamic indexes after they have been promoted. Even a very basic profiling tool helped us fix two separate bugs (in Raccoon Blog and in RavenDB).

RavenDB Aggressive Caching Mode

RavenDB has the notion of HTTP caching out of the box, what this means is that by default, without you having to take any action, RavenDB will cache as much as possible for you. It can get away with doing this because it is utilizing the notion of the 304 Not Modified response. The second time that we load an entity or execute a query, we can simply ask the server whatever the data has been modified since the last time we saw it, and if it wasn’t, we can just skip executing the code and get the data directly from the cache.

This saves a lot in terms of bandwidth and processing power, it also means that we don’t have to worry about RavenDB’s caching returning any stale results, because we checked for that. It does mean, however, that we have to send a request to the server. There are situation where we want to squeeze even better performance from the system, and we can move to RavenDB’s aggressive caching mode.

Aggressive caching means that RavenDB won’t even ask the server whatever anything has changed, it will simply return the reply directly from the local cache if it is there. This means that you might get stale data, but it also means that you’ll get it fast.

You can activate this mode using:

using (session.Advanced.DocumentStore.AggressivelyCacheFor(TimeSpan.FromMinutes(5)))
{
    session.Load<User>("users/1");
}

Now, if there is a value in the cache for users/1 that is at most 5 minutes old, we can directly use that.

It also works on queries too:

using (session.Advanced.DocumentStore.AggressivelyCacheFor(TimeSpan.FromMinutes(5)))
{
    session.Query<User>().ToList();
}

This is an explicit step beyond the normal caching, and it does mean that you might get out of date information, but if you really want to reduce the number of remote calls, it is a really nice feature.

Performance numbers in the pub

Originally posted at 3/31/2011

I am currently sitting with 3 guys in the pub, and the discussion naturally turned to performance. I asked all three the following question: “How many CLR objects can you create in one second?”

I got the following replies:

  • 2,000 objects per second
  • 50,000 objects per second
  • 100,000 objects per second

Then I sat down to check:

class Program
{
    static void Main(string[] args)
    {
        var sp = Stopwatch.StartNew();

        int i = 0;
        while(sp.ElapsedMilliseconds < 1000)
        {
            new MyClass();
            i++;
        }
        sp.Stop();
        Console.WriteLine("Created {0} in {1}", i, sp.Elapsed);
    }
}

public class MyClass
{
    public string A;
    public int B;
    public DateTime C;
}

The result?

Created 7,715,305 in 00:00:01

You are only as fast as your slowest bottleneck

Chris points out something very important:

“A much better solution would have been to simply put the database on a compressed directory, which would slow down some IO ..."

I don't agree.
Compression needs CPU. We got a lot of more IO by switching on compression (it's just less to write and read). Previous our CPU was about 40%, now averaging at 70%. Compression rate saves us about 30% per file. After switching on compression our IO bound application was about 20% faster.
We are currently planning switching on compression on all our production servers over Christmas, because using cpu-cores for compression is even cheaper than adding hard disks and raid for performance.

In general, most operations today are mostly IO bound, with the CPU mostly sitting there twiddling  the same byte until that byte threatens to sue for harassment. It make sense to trade off IO for CPU time, because our systems are being starved for IO.

In fact, you can just turn on compression at the File System level in most OSes, and it is likely to result in a significant saving for the application performance, assuming that the data does not already fits in memory.

How to become a speaker?

I get asked that quite frequently. More to the point, how to become an international speaker?

I was recently at a gathering where no less than three different people asked me this question, so I thought that it might be a good post.

Note: this post isn’t meant for someone who isn’t already speaking. And if you are speaking but are bad at it, this isn’t for you. The underlying assumption here is that you can speak and are reasonably good at it.

Note II: For this post, speaking is used to refer to presenting some technical content in front of an audience.

Why would you want to be a speaker anyway?

I heard that it is actually possible to make a living as a speaker. I haven’t found it to be the case, but then again, while I speak frequently, I don’t speak that frequently.

There are several reasons to want to be a speaker:

  • reputation (and in the end, good reputation means you get to raise your rates, get more work, etc).
  • contacts (speaking put you in front of dozens or hundreds of people, and afterward you get to talk with the people who are most interested in what you talked about)
  • advertising for your product (all those “lap around Visual Studio 2010” are actually an hour long ad that you paid to see :-) ).

I’ll focus on the first two, reputation & contacts gives you a much wider pool of potential work that you can choose from, increase the money you can make, etc.

So how do I do that, damn it?

Honestly, I have no idea. The first time that I presented at a technical conference, it was due to a mixup in communication. Apparently when in the US “it would have been delightful” means “we regret to inform”, but in Israel we read that as “great, let us do it”, and put the guy on the spot, so he had to scramble and do something.

Okay, I lied, I do have some idea about how to do this.

Again, I am assuming you are a reasonably good speaker (for myself, I know that my accent is a big problem when speaking English), but there are a lot of reasonably good speakers out there.

So, what is the answer? Make yourself different.

Pick a topic that is near & dear to your heart (or to your purse, which also works) and prepare a few talks on it. Write about it in a blog, comment on other people blogs about the topic. Your goal should be that when people think about topic X, your name would be on that list.  Forums like Stack Overflow can help, writing articles (whatever it is for pay or in places like CodeProject). Join a mailing list and be active there (and helpful). Don’t focus on regionally associated forums / mailing list, though. The goal is international acknowledgement.

This will take at least a year, probably, for people to start recognizing your name (it took over 2 years for me). If it is possible, produce a set of tools that relate to your topic. Publish them for free, and write it off as an investment in your future.

For myself, NHibernate Query Analyzer would a huge boost in terms of getting recognized. And Rhino Mocks was probably what clinched the deal. I honestly have no idea how much time & effort I put into Rhino Mocks, but Ohloh estimate that project at $ 12,502,089(!). While I disagree about that number, I did put a lot of effort into it, but it paid itself off several times over.

If you don’t have a blog, get one. Don’t get one at a community site, either. Community sites like blogs.microsoft.co.il are good to get your stuff read, but they have a big weakness in terms of branding yourself. You don’t want to get lost in a crowd, you want people to notice who you are. And most people are going to read your posts in a feed reader, and they are going to notice that the community feed is interesting, not that you are interesting.

Post regularly. I try to have a daily post, but that would probably not be possible for you, try to post at least once a week, and try to time it so it is always on the same date & time. Monday’s midnight usually works.

Okay, I did all of that, what now?

Another note, this is something that you may want to do in parallel to the other efforts.

Unless you become very well known, you won’t be approached, you’ll have to submit session suggestions. Keep an eye on the conferences that interest you, and wait until they have a call for sessions. Submit your stuff. Don’t get offended if they reject you.

If you live in a place that host international conferences (which usually rule Israel out), a good bet is to try to get accepted as a speaker there. You would be considerably cheaper than bringing someone from out of town/country. And that also play a role. Usually, if you managed to get into a conference once, they’ll be much more likely to have you again. They have your speaker eval, and unless you truly sucked (like going on stage and starting to speak in Hebrew at Denmark), and that gives them more confidence in bringing you a second time.

And that is about it for now.

Paxos enlightment

Paxos is an algorithm used to reach consensus among a group of machines, which is resilient to failures. For a long time, I had a really hard time understand Paxos. Or, to be rather more exact, I didn’t have an issue with Paxos per-se, I understood the protocol. What I had a trouble with is its application.

My problem always was that I couldn’t figure out what you do with it. That had to do with a basic problem on my part, I failed to understand how to go from a shared consensus on a numeric value to something that is actually useful. After a while, I had enough of feeling stupid, and I started reading all the material that I could on that, including going through the source codes of available Paxos implementations (such as libpaxos). The main problem was that I had a huge misconception in my head.

I kept thinking that Paxos is a consensus algorithm to arrive at a value in a distributed system. It isn’t, and because I kept thinking that it is, I had a really hard time understand what it does and how to apply it.

Leslie Lamport [pdf], the original author of the algorithm, describe the goal of Paxos as follows:

Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value.

This is from the Paxos Made Simple paper, I don’t know about you, but I have a hard time going from proposed value to something useful. What finally made everything click was the Paxos Made Code, which describe the implementation of libpaxos. While reading the paper, I had a light bulb moment.

Paxos is an algorithm used to ensure consist ordering semantics over a set of values in a cluster of machines. Why is this important? Because if you have consistent ordering over a set of values, and the values are events (or commands, or states), you can be sure that all machines in the cluster have either the same state or a previous version of the state.

Let us see how this can be useful. We have the following scenario:

  • Jane has no money in her back account.
  • Joe sends 100$ to Jane
  • When Jane is notified that it got the 100$, she send a 50$ check to the IRS
  • The IRS cash the check and spend it on something stupid.

image

Without a consistent ordering, each machine in the cluster may view the events in any order, which means that the following three timelines are allowed:

image

As you can imagine, Jane isn’t very happy about that overdraft fee she was just charged with. It is important to note that Paxos sole use here is to ensure that all machines will have a consistent view of events across all machines. That view might not be the same as the order those events showed up. It is possible to give that guarantee as well, on top of Paxos, but that isn’t the topic of this post.

Now that we all (hopefully) understand what Paxos is and what it is used for, let us talk about the algorithm itself.

Paxos has three actor types (Usually, the different actors are all part of the same system, either all together or two of the roles together), we will assume that we have 3 of each in our cluster and that we are interested in events ordered by sequential integer event ids with no gaps:

image

When you want to make a change in the system, you go to the proposer and tell it that you want to add event SendCheck { To= “IRS”, Amount = 50 } to the system.

The proposer then check what is the latest event id that it knows about, increment it, and then ask all the acceptors in the cluster to reserve that event id for its use. (Please note that I intentionally skip the details of how this is done, I am trying to get a high level description here, you can read the actual algorithm description for all the details).

There are several options here:

  • Another proposer is currently trying to reserve that event id.
  • Another proposer successfully claimed this event id.
  • Another proposer tried to claim this id and then crashed midway.
  • Etc… :-)

What Paxos ensures is that in the end, even in the presence of failures of network and machines, the proposer will be able to write the SendCheck { To= “IRS”, Amount = 50 } to an event id in such a way that no other machine will see another value in that location and that eventually all machines in the cluster will see that value in that location.

It is important to understand that even with Paxos, the following timelines are possible:

image

That is, due to some error, we may not be fully up to date on some machines (as seen on machine #2) or that we have missing events (as seen on machine #3).

What Paxos provides is that there wouldn’t be a scenario in which we have missing events and are not explicitly aware of that. At that point, we can defer processing of cashing the check until we know what event we are missing, explicitly decide to ignore the discontinuity in the time line or something else that fit the business needs.

In order to understand how this all works, I had to write my own implementation of Paxos in C#. There doesn’t seem to be anything like that available to the public, and I (at least) find the code I wrote much easier to understand than libpaxos’ C or Erlang implementations. You can find the implementation here: http://github.com/ayende/Paxos.Demo