Ayende @ Rahien

Refunds available at head office

Design practice: Building a search engine library

Note: This is done purely as a design practice. We don’t have any current plans to implement this, but I find that it is a good exercise in general.

How would I go about building a search engine for RavenDB to replace Lucene. Well, we have Voron as the basis for storage, so from the get go, we have several interesting changes. To start with, we inherit the transactional properties of Voron, but more importantly, we don’t have to do merges, so we don’t have any of those issues. In other words, we can actually generate stable document ids.

But I’m probably jumping ahead a bit. We’ll start with the basics. Analysis / indexing is going to be done in very much the same way. Users will provide an analyzer and a set of documents, like so:

   1: var index = new Corax.Index(storageOptions, new StandardAnalyzer());
   3: var scope = index.CreateIndexingScope();
   5: long docId = scope.Add(new Document
   6: {
   7:     {"Name", "Oren Eini", Analysis.Analyzer},
   8:     {"Name", "Ayende Rahien", Analysis.Analyzer},
   9:     {"Email", "ayende@ayende.com", Analyzed.AsIs, Stored.Yes}
  10: });
  12: docId = scope.Add(new Document
  13: {
  14:     {"Name", "Arava Eini", Analysis.Analyzer},
  15:     {"Email", "arava@doghouse.com", Analyzed.AsIs, Stored.Yes}
  16: });
  18: index.Sumbit(index);

Some notes about this API. It is modeled quite closely after the Lucene API, and it would probably need additional work. The idea here is that you are going to need to get an indexing scope, which is single threaded. And you can have multiple indexing scopes running a the same time. You can batch multiple writes into a single scope, and it behaves like a transaction.

The idea is to deal with all of the work associated with indexing the document into a single threaded work, so that make it easier for us. Note that we immediately get the generated document id, but that the document will only be available for searching when you have submitted the scope.

Under the hood, this does all of the work at the time of calling Add(). The analyzer will run on the relevant fields, and we will create the appropriate entries. How does that work?

Every document has a long id associated with it. In Voron, we are going to have a ‘Documents’ tree, with the long id as the key, and the value is going to be all the data about the document we need to keep. For example, it would have the stored fields for that documents, or the term positions, if required, etc. We’ll also have a a Fields tree, which will have a mapping of all the field names to a integer value.

Of more interest is how we are going to deal with the terms and the fields. For each field, we are going to have a tree called “@Name”, “@Email”, etc. Those are multi trees, with the keys in that tree being the terms, and the multi values being the document ids that has those threes. In other words, the code above is going to generate the following data in Voron.

  • Documents tree:
    • 0 – { “Fields”: { 1: “ayende@ayende.com” } }
    • 1 – { “Fields”: { 1: “arava@doghouse.com” } }
  • Fields tree
    • 0 – Name
    • 1 – Email
  • @Name tree
    • ayende – [ 0 ]
    • arava – [ 1 ]
    • oren – [ 0 ]
    • eini – [ 0, 1 ]
    • rahien – [ 0 ]
  • @Email tree
    • ayende@ayende.com – [ 0 ]
    • arava@doghouse.com – [ 1 ]

Given this model, we can now turn to the actual querying part of the business. Let us assume that we have the following query:

   1: var queryResults = index.Query("Name: Oren OR Email: ayende@ayende.com");
   2: foreach(var result in queryResult.Matches)
   3: {
   4:     Console.WriteLine("{0}: {1}", result.Id, result["Email"] )
   5: }

How does querying works? The query above would actually be something like:

   1: new BooleanQuery(
   2:     Match.Or,
   3:     new TermQuery("Name", "Oren"),
   4:     new TermQuery("Email", "ayende@ayende.com")
   5:     )

The actual implementation of this query would just access the relevant field tree and load the values in the particular key. The result is a set of ids for both parts of the query, and since we are using an OR here, we will just union them and return the list of results back.

Optimizations here can be to run those queries in parallel, or to just cache the results of particular term query, so we can speed things even more.  This looks simple, and it is, because a lot of the work has already been done in Voron. Searching is pretty much complete, we only need to tell it what to search with.

My distributed build system

Yes, I know that you are probably getting geared up to hear about some crazy setup, and in some manner, it is crazy. My distributed build system is this:


Yep, that is me manually distributing the build to do a cross check on a reasonable time frame.

I’ve mentioned before that our build time is atrocious. With over 3,500 tests, this is no longer a feasible alternative to just run them normally. So we started parallel efforts (pun intended), to reduce the time it takes the system to build and reduce individual test times as well as the ability to parallelize things.

We are actually at the point where we can run concurrent tests, even those we previously had to distribute. And we can even do that with replicated tests, which are the ones usually taking the longest. But what we’ll probably have in the end is just a bunch of test projects (currently we have ~5) that we will run on different machines at the same time.

We are using Team City for build system, and I know that it has capabilities in this regard, but I’ve never actually looked into that. We’re going to be pretty busy in the next couple of weeks, so I thought that this would be a good time to ask.

My current thinking is:

  • One build that would actually do the compilation, etc.
  • Then branch off that to a build per test project, which can run on different agents
  • Then a build on top of that that can actually deploy the build.

Any recommendations on that? We are going to need 5 – 7 agents to run the tests in parallel. Any recommendations on how to get those? Ideally I would like to avoid having to pay for those machines to sit idle 60% – 80% of the time.

Any other alternatives?

Maximum addressable virtual memory

I wonder what is says about what I am doing right now that I really wish that I could have the OS give me more control over virtual memory allocation. At any rate, the point of this post is to point out something quite important to people writing databases, especially databases that make use of virtual memory a lot.

There isn’t quite as much of it as I thought it would be. Oh, on 32 bits the 4GB limits is really hard to swallow. But on 64 bits, the situation is much better, but still constrained.

On Windows, using x64, you are actually limited to merely 8TB of address space. In Windows 8.1, (and I assume, but couldn’t verify, Windows 2012 R2) you can use up to 128TB of virtual address space. With Linux, at least since 2.6.32, and probably earlier, the limit is 128TB per process.

Implications for Voron, by the way, is that the total size of of all databases in a single process can be up to 8TB (probably somewhat less than that, the other stuff will also need memory). Currently the biggest RavenDB database that I am aware of was approaching the 1.5 – 2.0 TB mark last I checked (several months ago), and Esent, our current technology, is limited to 16TB per database.

So it isn’t great news, but it is probably something that I can live with. And at least I can give proper recommendations. In practice, I don’t think that this would be an issue. But that is good to know.


Published at

Originally posted at

Comments (16)

Big Data Search: Sorting randomness

As it turns out, doing work on big data sets is quite hard. To start with, you need to get the data, and it is… well, big. So that takes a while. Instead, I decided to test my theory on the following scenario. Given 4GB of random numbers, let us find how many times we have the number 1.

Because I wanted to ensure a consistent answer, I wrote:

   1: public static IEnumerable<uint> RandomNumbers()
   2: {
   3:     const long count = 1024 *  1024 * 1024L  * 1;
   4:     var random = new MyRand();
   5:     for (long i = 0; i < count; i++)
   6:     {
   7:         if (i % 1024 == 0)
   8:         {
   9:             yield return 1;
  10:             continue;
  11:         }
  12:         var result = random.NextUInt();
  13:         while (result == 1)
  14:         {
  15:             result = random.NextUInt();
  16:         }
  17:         yield return result;
  18:     }
  19: }
  21: /// <summary>
  22: /// Based on Marsaglia, George. (2003). Xorshift RNGs.
  23: ///  http://www.jstatsoft.org/v08/i14/paper
  24: /// </summary>
  25: public class MyRand
  26: {
  27:     const uint Y = 842502087, Z = 3579807591, W = 273326509;
  28:     uint _x, _y, _z, _w;
  30:     public MyRand()
  31:     {
  32:         _y = Y;
  33:         _z = Z;
  34:         _w = W;
  36:         _x = 1337;
  37:     }
  39:     public uint NextUInt()
  40:     {
  41:         uint t = _x ^ (_x << 11);
  42:         _x = _y; _y = _z; _z = _w;
  43:         return _w = (_w ^ (_w >> 19)) ^ (t ^ (t >> 8));
  44:     }
  45: }

I am using a custom Rand function because it is significantly faster than System.Random. This generate 4GB of random numbers, at also ensure that we get exactly 1,048,576 instances of 1. Generating this in an empty loop takes about 30 seconds on my machine.

For fun, I run the external sort routine in 32 bits mode, with a buffer of 256MB. It is currently processing things, but I expect it to take a while. Because the buffer is 256 in size, we flush it every 128 MB (while we still have half the buffer free to do more work). The interesting thing is that even though we generate random number, sorting then compressing the values resulted in about 60% compression rate.

The problem is that for this particular case, I am not sure if that is a good thing. Because the values are random, we need to select a pretty high degree of compression just to get a good compression rate. And because of that, a significant amount of time is spent just compressing the data. I am pretty sure that for real world scenario, it would be better, but that is something that we’ll probably need to test. Not compressing the data in the random test is a huge help.

Next, external sort is pretty dependent on the performance of… sort, of course. And sort isn’t that fast. In this scenario, we are sorting arrays of about 26 million items. And that takes time. Implementing parallel sort cut this down to less than a minute per batch of 26 million.

That let us complete the entire process, but then it halts with the merge. The reason for that is that we push all the values into a heap, and there are 1 billion of them. Now, the heap never exceed 40 items, but those are still 1 billion * O(log 40) or about 5.4 billion comparisons that we have to do, and we do this sequentially, which takes time. I tried thinking about ways to parallel, but I am not sure how that can be done. We have 40 sorted files, and we want to merge all of them.

Obviously we can sort each 10 files set in parallel, then sort the resulting 4, but the cost we have now is the actual sorting cost, not I/O. I am not sure how to approach this.

For what is it worth, you can find the code for this here.


Published at

Originally posted at

Comments (6)

Big Data Search: Sorting Optimizations

I mentioned several times that the entire point of the exercise was to just see how this works, not to actually do anything production worthy. But it is interesting to see how we could do better here.

In no particular order, I think that there are at least several things that we could do to significantly improve the time it takes to sort. Right now we defined 2 indexes on top of a 1GB file, and it took under 1 minute to complete. That gives us a runtime of about 10 days over a 15TB file.

Well, one of the reason for this performance is that we execute this in a serial fashion, that is, one after another. But we have to completely isolated indexes, there is no reason why we can’t parallelize the work between them.

For that matter, we are buffering in memory up to a certain point, then we sort, then we buffer some more, etc. That is pretty inefficient. We can push the actual sorting to a different thread, and continue parsing and adding to a buffer while we are adding to the buffer.

We wrote to intermediary files, but we wrote to those using plain file I/O. But it is usually a lot more costly to write to disk than to compress and then write to disk.  We are writing sorted data, so it is probably going to compress pretty well.

Those are the things that pop to mind. Can you think of additional options?


Published at

Originally posted at

Comments (6)

Big Data Search: The index format is horrible

I have completed my own exercise, and while I did wanted to try it with “few allocations” rule, it is interesting to see just how far out there the code is.

This isn’t something that you can really for anything except as a basis to see how badly you are doing. Let us start with the index format. It is just a CSV file with the value and the position in the original file.

That means that any search we want to do on the file is actually a binary search, as discussed in the previous post. But doing a binary search like that is an absolute killer for performance.

Let us consider our 15TB data set. In my tests, a 1GB file with 4.2 million rows produced roughly 80MB index. Assuming the same is true for the larger file, that gives us a 1.2 TB file.

In my small index, we have to do 24 seeks to get to the right position in the file. And as you should know, disk seeks are expensive. They are in the order of 10ms or so. So the cost of actually searching the index is close to quarter of a second.  Now, to be fair, there is going to be a lot of caching opportunities here, but probably not that many if we have a lot of queries to deal with ere.

Of course, the fun thing about this is that even with a 1.2 TB file, we are still talking about less than 40 seeks (the beauty of O(logN) in action), but that is still pretty expensive. Even worse, this is what happens when we are running on a single query at a time.

What do you think will happen if we are actually running this with multiple threads generating queries. Now we will have a lot of seeks (effective random) that would generate a big performance sink. This is especially true if we consider that any storage solution big enough to store the data is going to be composed of an aggregate of HDD disks. Sure, we get multiple spindles, so we get better performance overall, but still…

Obviously, there are multiple solutions for this issue.

B+Trees solve the problem by packing multiple keys into a single page, so instead of doing a O(log2N), you are usually doing O(log36N) or O(log100N). Consider those fan outs, we will have 6 – 8 seeks to do to get to our data. Much better than the 40 seeks required using plain binary search. It would actually be better than that in the common case, since the first few levels of the trees are likely to reside in memory (and probably in L1, if we are speaking about that).

However, given that we are storing sorted strings here, one must give some attention to Sorted Strings Tables. The way those work, you have the sorted strings in the file, and the footer contains two important bits of information. The first is the bloom filter, which allows you to quickly rule out missing values, but the more important factor is that it also contains the positions of (by default) every 16th entry to the file. This means that in our 15 TB data file (with 64.5 billion entries), we will use about 15GB just to store pointers to the different locations in the index file (which will be about 1.2 TB). Note that the numbers actually are probably worse. Because SST (note that when talking about SST I am talking specifically about the leveldb implementation) utilize many forms of compression, it is actually that the file size will be smaller (although, since the “value” we use is just a byte position in the data file, we won’t benefit from compression there). Key compression is probably a lot more important here.

However, note that this is a pretty poor way of doing things. Sure, the actual data format is better, in the sense that we don’t store as much, but in terms of the number of operations required? Not so much. We still need to do a binary search over the entire file. In particular, the leveldb implementation utilizes memory mapped files. What this ends up doing is rely on the OS to keep the midway points in the file in RAM, so we don’t have to do so much seeking. Without that, the cost of actually seeking every time would make SSTs impractical. In fact, you would pretty much have to introduce another layer on top of this, but at that point, you are basically doing trees, and a binary tree is a better friend here.

This leads to an interesting question. SST is probably so popular inside Google because they deal with a lot of data, and the file format is very friendly to compression of various kinds. It is also a pretty simple format. That make it much nicer to work with. On the other hand, a B+Tree implementation is a lot more complex, and it would probably several orders of magnitude more complex if it had to try to do the same compression tricks that SSTs do. Another factor that is probably as important is that as I understand it, a lot of the time, SSTs are usually used for actual sequential access (map/reduce stuff) and not necessarily for the random reads that are done in leveldb.

It is interesting to think about this in this fashion, at least, even if I don’t know what I’ll be doing with it.


Published at

Originally posted at

Comments (11)

Big Data Search: Binary Search of Textual Data

The index I created for the exercise is just a text file, sorted by the indexed key. When doing a search by a human, that make it very easy to work with. Much easier than trying to work with a binary file, it also helps debugging.

However, it does make it running a binary search on the data a bit harder. Mostly because there isn’t a nice way to say “give me the #th line”. Instead, I wrote the following:

   1: public void SetPositionToLineAt(long position)
   2: {
   3:     // now we need to go back until we either get to the start of the file
   4:     // or find a \n character
   5:     const int bufferSize = 128;
   6:     _buffer.Capacity = Math.Max(bufferSize, _buffer.Capacity);
   8:     var charCount = _encoding.GetMaxCharCount(bufferSize);
   9:     if (charCount > _charBuf.Length)
  10:         _charBuf = new char[Utils.NearestPowerOfTwo(charCount)];
  12:     while (true)
  13:     {
  14:         _input.Position = position - (position < bufferSize ? 0 : bufferSize);
  15:         var read = ReadToBuffer(bufferSize);
  16:         var buffer = _buffer.GetBuffer();
  17:         var chars = _encoding.GetChars(buffer, 0, read, _charBuf, 0);
  18:         for (int i = chars - 1; i >= 0; i--)
  19:         {
  20:             if (_charBuf[i] == '\n')
  21:             {
  22:                 _input.Position = position - (bufferSize - i) + 1;
  23:                 return;
  24:             }
  25:         }
  26:         position -= bufferSize;
  27:         if (position < 0)
  28:         {
  29:             _input.Position = 0;
  30:             return;
  31:         }
  32:     }
  33: }

This code starts at an arbitrary byte position, and go backward until it find the new line character ‘\n’. This give me the ability to go to a rough location and get the line oriented input.

Once I have that, the rest is pretty easy. Here is the binary search:

   1: while (lo <= hi)
   2: {
   3:     position = (lo + hi) / 2;
   4:     _reader.SetPositionToLineAt(position);
   6:     bool? result;
   7:     do
   8:     {
   9:         result = _reader.ReadOneLine();
  10:     } while (result == null); // skip empty lines
  12:     if (result == false)
  13:         yield break; // couldn't find anything
  15:     var entry = _reader.Current.Values[0];
  16:     match = Utils.CompareArraySegments(expectedIndexEntry, entry);
  18:     if (match == 0)
  19:     {
  20:         break;
  21:     }
  22:     if (match > 0)
  23:         lo = position + _reader.Current.Values.Sum(x => x.Count) + 1;
  24:     else
  25:         hi = position - 1;
  26: }
  28: if (match != 0)
  29: {
  30:     // no match
  31:     yield break;
  32: }

The idea is that this positions us on the location of the index that has an entry with a value that is equal to what we are searched on.

We then write the following to actually get the data from the actual data file:

   1: // we have a match, now we need to return all the matches
   2: _reader.SetPositionToLineAt(position);
   4: while(true)
   5: {
   6:     bool? result;
   7:     do
   8:     {
   9:         result = _reader.ReadOneLine();
  10:     } while (result == null); // skip empty lines
  12:     if(result == false)
  13:         yield break; // end of file
  15:     var entry = _reader.Current.Values[0];
  16:     match = Utils.CompareArraySegments(expectedIndexEntry, entry);
  17:     if (match != 0)
  18:         yield break; // out of the valid range we need
  20:     _buffer.SetLength(0);
  21:     _data.Position = Utils.ToInt64(_reader.Current.Values[1]);
  23:     while (true)
  24:     {
  25:         var b = _data.ReadByte();
  26:         if (b == -1)
  27:             break;
  28:         if (b == '\n')
  29:         {
  30:             break;
  31:         }
  32:         _buffer.WriteByte((byte)b);
  33:     }
  35:     yield return _encoding.GetString(_buffer.GetBuffer(), 0, (int)_buffer.Length);
  36: }

As you can see, we are moving forward in the index file, reading one line at a time. Then we take the second value, the position of the relevant line in the data file, and read that.

We continue to do so as long as the indexed value is the same. Pretty simple, all told. But it comes with its own set of problems. I’ll discuss that in my next post.


Published at

Originally posted at

Comments (11)

Big Data Search: Setting up

The interesting thing about this problem is that I was very careful in how I phrased things. I said what I wanted to happen, but didn’t specify what needs to be done. That was quite intentional. For that matter, the fact that I am posting about what is going to be our acceptance criteria is also intentional. The idea is to have a non trivial task, but something that should be very well understood and easy to research. It also means that the candidate needs to be able to write some non trivial code. And I can tell a lot about a dev from such a project. At the same time, this is a very self contained scenario. The idea is that this is something that you can do in a short amount of time.

The reason that this is an interesting exercise is that this is actually at least two totally different but related problems. First, in a 15TB file, we obviously cannot rely on just scanning the entire file. That means that we have to have an index. And that mean that we have to build it. Interestingly enough, an index being a sorted structure, that means that we have to solve the problem of sorting more data than can fit in main memory.

The second problem is probably easier, since it is just an implementation of external sort, and there are plenty of algorithms around to handle that. Note that I am not really interested in actual efficiencies for this particular scenario. I care about being able to see the code. See that it works, etc. My solution, for example, is a single threaded system that make no attempt at parallelism or I/O optimizations. It clocks at over 1 GB / minute and the memory consumption is at under 150MB. Queries for a unique value return the result in 0.0004 seconds. Queries that returned 153K results completed in about 2 seconds.

When increasing the used memory to about 650MB, there isn’t really any difference in performance, which surprised me a bit.

Then again, the entire code is probably highly inefficient. But that is good enough for now.

The process is kicked off with indexing:

   1: var options = new DirectoryExternalStorageOptions("/path/to/index/files");
   2: var input = File.OpenRead(@"/path/to/data/Crimes_-_2001_to_present.csv");
   3: var sorter = new ExternalSorter(input, options, new int[]
   4: {
   5:     1,// case number
   6:     4, // ICHR
   8: });
  10: sorter.Sort();

I am actually using the Chicago crime data for this. This is a 1GB file that I downloaded from the Chicago city portal in CSV format. This is what the data looks like:


The ExternalSorter will read and parse the file, and start reading it into a buffer. When it gets to a certain size (about 64MB of source data, usually), it will sort the values in memory and output them into temporary files.

Those file looks like this:


Initially, I tried to do that with binary data, but it turns out that that was too complex to be easy, and writing this in a human readable format made it much easier to work with. The format is pretty simple, you have the value of the left, and on the right you have start position of the row for this value.

We generate about 17 such temporary files for the 1GB file. One temporary file per each 64 MB of the original file. This lets us keep our actual memory consumption very low, but for larger data sets, we’ll probably want to actually do the sort every 1 GB or maybe more. Our test machine has 16 GB of RAM, so doing a sort and outputting a temporary file every 8 GB can be a good way to handle things. But that is beside the point.

The end result is that we have multiple sorted files, but they aren’t sequential. In other words, in file #1 we have values 1,4,6,8 and in file #2 we have 1,2,6,7. We need to merge all of them together. Luckily, this is easy enough to do. We basically have a heap that we feed entries from the files into. And that pretty much takes care of this. See merge sort if you want more details about this.

The end result of merging all of those files is… another file, just like them, that contains all of the data sorted. Then it is time to actually handle the other issue, actually searching the data.

We can do that using simple binary search, with the caveat that because this is a text file, and there is no fixed size records or pages, it is actually a big hard to figure out where to start reading.

In effect, what I am doing is to select an arbitrary byte position, then walk backward until I find a ‘\n’. Once I found the new line character, I can read the full line, check the value, and decide where I need to look next. Assuming that I actually found my value, I can now go to the byte position of the value in the original file and read the original line, giving it to the user.

Assuming an indexing rate of 1 GB / minute a 15 TB file would take about 10 days to index. But there are ways around that as well, but I’ll touch on them in my next post.

What all of this did was bring home just how much we usually don’t have to worry about such things. But I consider this research well spent, we’ll be using this in the future.

The cost of working with strings

Following my last post, I decided that it might be better to actually show what the difference is between direct string manipulation and working at lower levels.

I generated a sample CSV file with 10 million lines and 6 columns. The file size was 658MB. I then wrote the simplest code that I could possibly think of:

   1: public class TrivialCsvParser
   2: {
   3:     private readonly string _path;
   5:     public TrivialCsvParser(string path)
   6:     {
   7:         _path = path;
   8:     }
  10:     public IEnumerable<string[]> Parse()
  11:     {
  12:         using (var reader = new StreamReader(_path))
  13:         {
  14:             while (true)
  15:             {
  16:                 var line = reader.ReadLine();
  17:                 if (line == null)
  18:                     break;
  19:                 var fields = line.Split(',');
  20:                 yield return fields;
  21:             }
  22:         }
  23:     }
  24: }

This run in 8.65 seconds (with a no-op action) and kept the memory utilization at about 7MB.

Then next thing to try was just reading through the file without doing any parsing. So I wrote this:

   1: public class NoopParser
   2: {
   3:     private readonly string _path;
   5:     public NoopParser(string path)
   6:     {
   7:         _path = path;
   8:     }
  10:     public IEnumerable<object> Parse()
  11:     {
  12:         var buffer = new byte[1024];
  13:         using (var stream = new FileStream(_path,FileMode.Open, FileAccess.Read))
  14:         {
  15:             while (true)
  16:             {
  17:                 var result = stream.Read(buffer, 0, buffer.Length);
  18:                 if (result == 0)
  19:                     break;
  20:                 yield return null; // noop
  21:             }
  22:         }
  23:     }
  24: }

Note that this isn’t actually doing anything. But this took 0.83 seconds, so we see a pretty important big difference here. By the way, the amount of memory used isn’t noticeably different here. Both use about 7 MB. Probably because we aren’t actually holding up to any of the data in any meaningful way.

I have run the results using release build, and I run it multiple times, so the file is probably all in the OS cache. So I/O cost is pretty minimal here. However, note that we aren’t doing a lot of stuff that is being done by the TrivialCsvParser. For example, doing line searches, splitting the string to fields, etc. But interestingly enough, just removing the split will reduce the cost from 8.65 seconds to 3.55 seconds.

Without strings, it is a dark, cold place…

So I set out to do some non trivial stuff with file parsing. The file format is CSV, and I am explicitly trying to do it with as few string allocations as possible.

In effect, I am basically relying on a char array that I manually manage. But as it turns out, this is not so easy. To start with, 65279 should be taken out and shot. That is the BOM marker (U+FEFF), and it is has a very nasty habit of showing up when you are mixing StreamWriter and reading from a byte stream, even when I made sure to use the UTF8 encoding anyway.

It is possible, as I said, but it is anything but nice. I set out to do non trivial stuff using this approach, but I wonder how useful this actually is. From experience, this can kill a system performance. This has been more than just my experience: http://joeduffyblog.com/2012/10/30/beware-the-string

Of course, the moment that you start dealing with your own string type, it is all back in the good bad days of C++ and BSTR vs cstr vs std::string vs. MyString vs OmgStr. For example, how do you look at the value during debug…

I am pretty sure that in general, that isn’t something that you’ll want to do. In my spike, quite a lot of the issues that came up were directly associated with this. On the other hand, this did let me do things like string pooling, efficient parsing with no allocations, etc.

But I’ll talk about that specific project in my next post.


Published at

Originally posted at

Comments (1)

Growable memory

I wish that I could have more control over virtual memory. In particular, what I would really like at this time is the ability to map two virtual address ranges to the same physical memory. And yes, I am well aware (and already using) the ability to do just that using memory mapped files.  However, what I would like to do is to have a pure memory based system, without any files being involved. This is meant to be used for the memory only system, which is commonly meant to be be used for unit testing.

The reason that I want to be able to map the physical memory multiple times is that I have a buffer. Let us say that it is 1MB in size. And now I need to grow it. I can’t ask the OS to just grow the buffer, since it might not have the virtual address available past the buffer end by the time I request it. What I would like to do is to request a new virtual allocation, let us say of 2 MB, and then ask the OS to map the same physical memory for the first buffer to the first section of the new buffer, and new memory for the rest.

The end result is that the first part of the buffer is mapped twice, and any changes you make there will be visible in both locations. Now, it is pretty easy to do this with memory mapped files, but I couldn’t find a way to do it sans files.

What I ended up doing is reserve a large portion of the virtual address space (using VirtualAlloc) and then committing it on demand. But I would really have liked to do something better, because now just moved the problem to when I run out of the reserved buffered space.

The difference between fsync & Write Through, through the OS eyes

I got an interesting question from Thomas:

How can the OS actually ensure that the write through calls go to disk before returning, when it cannot do the same for fsync. I mean couldn't there still be some driver / raid controller / hard disk doing some caching but telling windows the data is written?

And I think that the answer is worth a post.

Let us look at the situation from the point of view of the operation system. You have an application that issue a write request. And the OS will take the data to be written and write it to its own buffers, or maybe it will send it to the disk, with instructions to write the data, but nothing else. The disk driver is then free to decide what the optimal way to actually do that would be. In many cases, that means not writing the data right now, but placing that in its own buffer, and do a lazy write when it feels like it. This is obviously a very simplified view of how it works, but it is good enough for what we are doing.

Now, when we call fsync, we have to do that with a file handle. But as it turned out, that isn’t quite as useful as you might have thought it would be.

The OS is able to use the file handle to find all of the relevant data that has been written to this file and weren’t send to the disk yet. And it will call the disk and tell it, “hi, how about writing those pieces too, if you don’t mind*”. However, that is only part of what it needs to do. What about data that has already been written by the OS to the disk drive, but is still in the disk drive cache?

* It is a polite OS.

Well, we need to force the drive to flush it to the actual physical media, but here we run into an interesting problem. There is actually no way for the OS to tell a disk drive “flush just the data belong to file X”. That is because at the level of the disk drive, we aren’t actually talking about files, we talk about sectors. Therefor, there isn’t any way to say, flush just this data. And since the disk drive won’t tell the OS when it actually flushed the data to disk, the OS has no way of telling (nor does it needs to track it) what specific pieces need to actually be flushed.

Therefor, what it does is go to the disk driver and tell it, flush everything that is in your cache, and tell me when you are done. As you can imagine, if you are currently doing any writes, and someone call fsync, that can be a killer for performance, because the disk needs to flush the entire cache. It is pretty common for disks to come with 64MB or 128MB caches. That means that when fsync is called, it might be doing a lot of work. the FireFox fsync issue is probably the most high profile case where this was observed. There have been a lot of people looking into that, and you can read a lot of fascinating information about it.

Now, what about Write Through? Well, for that the OS does something slightly differently. Instead of just handing the data to the disk driver and telling it do whatever it wants with it, what it does is to tell the disk driver that it needs to write this data right now.  Because we can give the disk driver the specific instructions about what to flush to disk, it can do that without having to flush everything in its cache. That is the difference between writing a few KB and writing tens of MB.

I said that this is a great oversimplification. There are some drivers that would choose to just ignore fsync. They can do that, and they should do that, under certain circumstances. For example, if you are using a disk that comes with its own battery backed memory, there is no reason to actually wait for the flush, we are already ensured that the data cannot go away if someone pulls the plug. However, there are plenty of drives that would just ignore fsync (or handling only 3rd fsync, or whatever) because it leads to better performance.

This also ignore things like the various intermediaries along the way. If you are using hardware RAID, for example, you also have the RAID cache, etc, etc, etc. And yes, I think that there are drivers there that would ignore write through as well.

At the low level Write Through uses SCSI commands with Force Unit Access, and fsync uses SYNCHRONIZE_CACHE  for SCSI and FLUSH_CACHE for ATAPI. I think that ATAPI 7 has Force Unit Access, as well, but I am not sure.

Deleting ain’t easy

We started getting test failures in Voron regarding to being unable to cleanup the data directory. Which was a major cause for concern. We thought that we had a handle leak, or something like that.

As it turned out, we could reproduce this issue with the following code:

   1: class Program
   2: {
   3:     static void Main(string[] args)
   4:     {
   5:         var name = Guid.NewGuid().ToString();
   6:         for (int i = 0; i < 1000; i++)
   7:         {
   8:             Console.WriteLine(i);
   9:             Directory.CreateDirectory(name);
  11:             for (int j = 0; j < 10; j++)
  12:             {
  13:                 File.WriteAllText(Path.Combine(name, "test-" +j), "test");
  14:             }
  16:             if (Directory.Exists(name))
  17:                 Directory.Delete(name, true);
  18:         }
  20:     }
  21: }

If you run this code, in 6 out of 10 runs, it would fail with something like this:


I am pretty sure that the underlying cause is some low lever “thang” somewhere on my system. FS Indexing, AV, etc. But it basically means that directory delete is not atomic, and that I should basically just suck it up and write a robust delete routine that can handle failure and retry until it is successful.


Published at

Originally posted at

Comments (16)

Why Voron has a Write Ahead Journal, not a Write Ahead Log

Most of the database literature talks about Write Ahead Log, or WAL for short.  Voron uses a Write Ahead Journal. And the reason isn’t technical. The reason is human.

I can’t tell you just how many times I have had to deal with people who send me the RavenDB transaction logs instead of the debug log.

Calling what Voron is doing a Journal means that we have a very clear separation between the things we do to keep the data, and the things we output to inform the user what is actually going on.

Poor man’s “Data” breakpoints in managed debugging

One of the things that is really annoying with unmanaged code is that it is easy to overwrite memory that doesn’t strictly belongs to you. That can lead to interesting corruption bugs, especially if the problem occurs only a while after the actual change happened. This is why it is so good to have a memory breakpoint. That will force a breakpoint in the debugger when a particular memory location is written. However, in VS, those are only available for unmanaged code, not for managed code that directly manipulate memory.

Therefor, I resorted to using the following method, instead:


I run in the debugger, and issue this statement after every line, waiting for that value to change, which will tell me what line is the culprit.

Not as easy, but it works.


Published at

Originally posted at

Comments (6)

Suspending tests

I am doing a big merge, and as you can imagine, some tests are failing. I am currently just to get it out the door, and failing tests are really annoying.

I identified the problem, but resolving it would be too hard right now, so I opted for the following solution instead:



Published at

Originally posted at

Comments (9)

File I/O - Flush or WriteThrough?

Daniel Crabtree has asked a really interesting question:

I've been playing around with file I/O and I am trying to figure out when to use FileOptions.WriteThrough. In your experience, if durability is priority 1 and performance priority 2, is there any reason to use WriteThrough on
file streams in addition to or instead of FlushToDisk?

From my experiments, I found the following:


  • Durable
  • Calls to Write block
  • Slower than FlushToDisk


  • Durable
  • Calls to Flush block
  • Faster than WriteThrough

Both WriteThrough & FlushToDisk

  • Durable
  • Calls to Write block
  • Same performance as WriteThrough alone

I'm asking you, as I notice you've used both approaches.
You used WriteThrough:

Well… that caught up with me, is appears. Basically, I was a bit lazy with the terms I was using in those blog posts, so I think that I have better clarify.

WriteThrough is the .NET name for FILE_FLAG_WRITE_THROUGH which tells the OS to skip any caching and goes directly to the disk. Writes will block until the data is sent to the disk. That is usually the wrong thing to do, actually, since this means that you can’t get any benefit from OS buffers, batching, etc. In practice, what this means is that you are effectively calling fsync() after each and every write call. That is usually the wrong thing to do, since you would usually want to do multiple writes and then flush all those changes to disk, and you do want those changes to take advantage of the work the OS can do to optimize your system.

Instead, you want to use Flush(). Note that even in Munin, both options are used, and WriteThrough can be removed. Although Munin by no means should be seen as a good impl of a storage engine.

That said, you also have to be aware that Flush doesn’t always does its work: http://connect.microsoft.com/VisualStudio/feedback/details/792434/flush-true-does-not-always-flush-when-it-should

It looks like this is fixed in 4.5, but be aware if you need to run on 4.0 or previous versions.


Published at

Originally posted at

Sparse files & memory mapped files

One of the problems with memory mapped files is that you can’t actually map beyond the end of the file. So you can’t use that to extend your file.  I had a thought and set out to check, what will happen if I create a sparse file, a file that only take space when you write to it, and at the same time, map it?

As it turn out, this actually work pretty well in practice. You can do so without any issues. Here is how it works:

using (var f = File.Create(path))
    int bytesReturned = 0;
    var nativeOverlapped = new NativeOverlapped();
    if (!NativeMethod.DeviceIoControl(f.SafeFileHandle, EIoControlCode.FsctlSetSparse, IntPtr.Zero, 0,
                                      IntPtr.Zero, 0, ref bytesReturned, ref nativeOverlapped))
        throw new Win32Exception();

This create a sparse file that is 64Gb in size. Then we can map it normally:

using (var mmf = MemoryMappedFile.CreateFromFile(path))
using (var memoryMappedViewAccessor = mmf.CreateViewAccessor(0, 1024*1024*1024*64L))
    for (long i = 0; i < memoryMappedViewAccessor.Capacity; i += buffer.Length)
        memoryMappedViewAccessor.WriteArray(i, buffer, 0, buffer.Length);

And then we can do stuff to it.  And that include writing to yet unallocated parts of the file. This also means that you don’t have to worry about writing past the end of the file, the OS will take care of all of that for you.

Happy happy, joy joy, etc.

One problem with this method, however. It means that you have a 64Gb file, but you don’t have that much allocated. What that means in turn is that you might not have that much space available for the file. Which brings up an interesting question, what happens when you are trying to commit a new page, and the disk is out of space? Using file I/O you would get an IO error with the right code. But using memory mapped files, the error would actually turn up during access, which can happen pretty much anywhere. It also means that it is a Standard Exception Handling error in Windows, which require special treatment.

To test this out, I wrote the following so it would write to a disk that had only about 50 Gb free. I wanted to know what would happen when it run out of space. That is actually something that happens, and we need to be able to address this issue robustly. The kicker is that this might actually happen at any time, so that would really result is some… interesting behavior with regards to robustness. In other words, I don’t think that this is a viable option, it is a really cool trick, but I don’t think it is a very well thought out option.

By the way, the result of my experiment was that we had an effectively a frozen process. No errors, nothing, just a hung. Also, I am pretty sure that WriteArray() is really slow, but I’ll check this out at another pointer in time.

JSON Packing, Text Based Formats and other stuff that come to mind at 5 AM

This post was written at 5:30AM, I run into this while doing research for another post, and I couldn’t really let it go.

XML as a text base format is really wasteful in space. But that wasn’t what really made it lose its shine. That was when it became so complex that it stopped being human readable. For example, I give you:

   1: <?xml version="1.0" encoding="UTF-8" ?>
   2:  <SOAP-ENV:Envelope
   3:   xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"
   4:   xmlns:xsd="http://www.w3.org/1999/XMLSchema"
   5:   xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
   6:    <SOAP-ENV:Body>
   7:        <ns1:getEmployeeDetailsResponse
   8:         xmlns:ns1="urn:MySoapServices"
   9:         SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
  10:            <return xsi:type="ns1:EmployeeContactDetail">
  11:                <employeeName xsi:type="xsd:string">Bill Posters</employeeName>
  12:                <phoneNumber xsi:type="xsd:string">+1-212-7370194</phoneNumber>
  13:                <tempPhoneNumber
  14:                 xmlns:ns2="http://schemas.xmlsoap.org/soap/encoding/"
  15:                 xsi:type="ns2:Array"
  16:                 ns2:arrayType="ns1:TemporaryPhoneNumber[3]">
  17:                    <item xsi:type="ns1:TemporaryPhoneNumber">
  18:                        <startDate xsi:type="xsd:int">37060</startDate>
  19:                        <endDate xsi:type="xsd:int">37064</endDate>
  20:                        <phoneNumber xsi:type="xsd:string">+1-515-2887505</phoneNumber>
  21:                    </item>
  22:                    <item xsi:type="ns1:TemporaryPhoneNumber">
  23:                        <startDate xsi:type="xsd:int">37074</startDate>
  24:                        <endDate xsi:type="xsd:int">37078</endDate>
  25:                        <phoneNumber xsi:type="xsd:string">+1-516-2890033</phoneNumber>
  26:                    </item>
  27:                    <item xsi:type="ns1:TemporaryPhoneNumber">
  28:                        <startDate xsi:type="xsd:int">37088</startDate>
  29:                        <endDate xsi:type="xsd:int">37092</endDate>
  30:                        <phoneNumber xsi:type="xsd:string">+1-212-7376609</phoneNumber>
  31:                    </item>
  32:                </tempPhoneNumber>
  33:            </return>
  34:        </ns1:getEmployeeDetailsResponse>
  35:    </SOAP-ENV:Body>
  36: /SOAP-ENV:Envelope>

After XML was thrown out of the company of respectable folks, we had JSON show up and entertain us. It is smaller and more concise than XML, and so far has resisted the efforts to make it into some sort of a uber complex enterprisiey tool.

But today I run into quite a few effort to do strange things to JSON. I am talking about things like JSON DB (a compressed json format, not actual json database), JSONH, json.hpack, and friends. All of those attempt to reduce the size of JSON documents.

Let us take an example. the following is a JSON document representing one of RavenDB builds:

   1: {
   2:   "BuildName": "RavenDB Unstable v2.5",
   3:   "IsUnstable": true,
   4:   "Version": "2509-Unstable",
   5:   "PublishedAt": "2013-02-26T12:06:12.0000000",
   6:   "DownloadsIds": [],
   7:   "Changes": [
   8:     {
   9:       "Commiter": {
  10:         "Email": "david@davidwalker.org",
  11:         "Name": "David Walker"
  12:       },
  13:       "Version": "17c661cb158d5e3c528fe2c02a3346305f0234a3",
  14:       "Href": "/app/rest/changes/id:21039",
  15:       "TeamCityId": 21039,
  16:       "Username": "david walker",
  17:       "Comment": "Do not save Has-Api-Key header to metadata\n",
  18:       "Date": "2013-02-20T23:22:43.0000000",
  19:       "Files": [
  20:         "Raven.Abstractions/Extensions/MetadataExtensions.cs"
  21:       ]
  22:     },
  23:     {
  24:       "Commiter": {
  25:         "Email": "david@davidwalker.org",
  26:         "Name": "David Walker"
  27:       },
  28:       "Version": "5ffb4d61ad9102696948f6678bbecac88e1dc039",
  29:       "Href": "/app/rest/changes/id:21040",
  30:       "TeamCityId": 21040,
  31:       "Username": "david walker",
  32:       "Comment": "Do not save IIS Application Request Routing headers to metadata\n",
  33:       "Date": "2013-02-20T23:23:59.0000000",
  34:       "Files": [
  35:         "Raven.Abstractions/Extensions/MetadataExtensions.cs"
  36:       ]
  37:     },
  38:     {
  39:       "Commiter": {
  40:         "Email": "ayende@ayende.com",
  41:         "Name": "Ayende Rahien"
  42:       },
  43:       "Version": "5919521286735f50f963824a12bf121cd1df4367",
  44:       "Href": "/app/rest/changes/id:21035",
  45:       "TeamCityId": 21035,
  46:       "Username": "ayende rahien",
  47:       "Comment": "Better disposal\n",
  48:       "Date": "2013-02-26T10:16:45.0000000",
  49:       "Files": [
  50:         "Raven.Client.WinRT/MissingFromWinRT/ThreadSleep.cs"
  51:       ]
  52:     },
  53:     {
  54:       "Commiter": {
  55:         "Email": "ayende@ayende.com",
  56:         "Name": "Ayende Rahien"
  57:       },
  58:       "Version": "c93264e2a94e2aa326e7308ab3909aa4077bc3bb",
  59:       "Href": "/app/rest/changes/id:21036",
  60:       "TeamCityId": 21036,
  61:       "Username": "ayende rahien",
  62:       "Comment": "Will ensure that the value is always positive or zero (never negative).\nWhen using numeric calc, will div by 1,024 to get more concentration into buckets.\n",
  63:       "Date": "2013-02-26T10:17:23.0000000",
  64:       "Files": [
  65:         "Raven.Database/Indexing/IndexingUtil.cs"
  66:       ]
  67:     },
  68:     {
  69:       "Commiter": {
  70:         "Email": "ayende@ayende.com",
  71:         "Name": "Ayende Rahien"
  72:       },
  73:       "Version": "7bf51345d39c3993fed5a82eacad6e74b9201601",
  74:       "Href": "/app/rest/changes/id:21037",
  75:       "TeamCityId": 21037,
  76:       "Username": "ayende rahien",
  77:       "Comment": "Fixing a bug where we wouldn't decrement reduce stats for an index when multiple values from the same bucket are removed\n",
  78:       "Date": "2013-02-26T10:53:01.0000000",
  79:       "Files": [
  80:         "Raven.Database/Indexing/MapReduceIndex.cs",
  81:         "Raven.Database/Storage/Esent/StorageActions/MappedResults.cs",
  82:         "Raven.Database/Storage/IMappedResultsStorageAction.cs",
  83:         "Raven.Database/Storage/Managed/MappedResultsStorageAction.cs",
  84:         "Raven.Tests/Issues/RavenDB_784.cs",
  85:         "Raven.Tests/Storage/MappedResults.cs",
  86:         "Raven.Tests/Views/ViewStorage.cs"
  87:       ]
  88:     },
  89:     {
  90:       "Commiter": {
  91:         "Email": "ayende@ayende.com",
  92:         "Name": "Ayende Rahien"
  93:       },
  94:       "Version": "ff2c5b43eba2a8a2206152658b5e76706e12945c",
  95:       "Href": "/app/rest/changes/id:21038",
  96:       "TeamCityId": 21038,
  97:       "Username": "ayende rahien",
  98:       "Comment": "No need for so many repeats\n",
  99:       "Date": "2013-02-26T11:27:49.0000000",
 100:       "Files": [
 101:         "Raven.Tests/Bugs/MultiOutputReduce.cs"
 102:       ]
 103:     },
 104:     {
 105:       "Commiter": {
 106:         "Email": "ayende@ayende.com",
 107:         "Name": "Ayende Rahien"
 108:       },
 109:       "Version": "0620c74e51839972554fab3fa9898d7633cfea6e",
 110:       "Href": "/app/rest/changes/id:21041",
 111:       "TeamCityId": 21041,
 112:       "Username": "ayende rahien",
 113:       "Comment": "Merge branch 'master' of https://github.com/cloudbirdnet/ravendb into 2.1\n",
 114:       "Date": "2013-02-26T11:41:39.0000000",
 115:       "Files": [
 116:         "Raven.Abstractions/Extensions/MetadataExtensions.cs"
 117:       ]
 118:     }
 119:   ],
 120:   "ResolvedIssues": [],
 121:   "Contributors": [
 122:     {
 123:       "FullName": "Ayende Rahien",
 124:       "Email": "ayende@ayende.com",
 125:       "EmailHash": "730a9f9186e14b8da5a4e453aca2adfe"
 126:     },
 127:     {
 128:       "FullName": "David Walker",
 129:       "Email": "david@davidwalker.org",
 130:       "EmailHash": "4e5293ab04bc1a4fdd62bd06e2f32871"
 131:     }
 132:   ],
 133:   "BuildTypeId": "bt8",
 134:   "Href": "/app/rest/builds/id:588",
 135:   "ProjectName": "RavenDB",
 136:   "TeamCityId": 588,
 137:   "ProjectId": "project3",
 138:   "Number": 2509
 139: }

This document is 4.52KB in size. Running this through JSONH gives us the following:

   1: [
   2:     14,
   3:     "BuildName",
   4:     "IsUnstable",
   5:     "Version",
   6:     "PublishedAt",
   7:     "DownloadsIds",
   8:     "Changes",
   9:     "ResolvedIssues",
  10:     "Contributors",
  11:     "BuildTypeId",
  12:     "Href",
  13:     "ProjectName",
  14:     "TeamCityId",
  15:     "ProjectId",
  16:     "Number",
  17:     "RavenDB Unstable v2.5",
  18:     true,
  19:     "2509-Unstable",
  20:     "2013-02-26T12:06:12.0000000",
  21:     [
  22:     ],
  23:     [
  24:         {
  25:             "Commiter": {
  26:                 "Email": "david@davidwalker.org",
  27:                 "Name": "David Walker"
  28:             },
  29:             "Version": "17c661cb158d5e3c528fe2c02a3346305f0234a3",
  30:             "Href": "/app/rest/changes/id:21039",
  31:             "TeamCityId": 21039,
  32:             "Username": "david walker",
  33:             "Comment": "Do not save Has-Api-Key header to metadata\n",
  34:             "Date": "2013-02-20T23:22:43.0000000",
  35:             "Files": [
  36:                 "Raven.Abstractions/Extensions/MetadataExtensions.cs"
  37:             ]
  38:         },
  39:         {
  40:             "Commiter": {
  41:                 "Email": "david@davidwalker.org",
  42:                 "Name": "David Walker"
  43:             },
  44:             "Version": "5ffb4d61ad9102696948f6678bbecac88e1dc039",
  45:             "Href": "/app/rest/changes/id:21040",
  46:             "TeamCityId": 21040,
  47:             "Username": "david walker",
  48:             "Comment": "Do not save IIS Application Request Routing headers to metadata\n",
  49:             "Date": "2013-02-20T23:23:59.0000000",
  50:             "Files": [
  51:                 "Raven.Abstractions/Extensions/MetadataExtensions.cs"
  52:             ]
  53:         },
  54:         {
  55:             "Commiter": {
  56:                 "Email": "ayende@ayende.com",
  57:                 "Name": "Ayende Rahien"
  58:             },
  59:             "Version": "5919521286735f50f963824a12bf121cd1df4367",
  60:             "Href": "/app/rest/changes/id:21035",
  61:             "TeamCityId": 21035,
  62:             "Username": "ayende rahien",
  63:             "Comment": "Better disposal\n",
  64:             "Date": "2013-02-26T10:16:45.0000000",
  65:             "Files": [
  66:                 "Raven.Client.WinRT/MissingFromWinRT/ThreadSleep.cs"
  67:             ]
  68:         },
  69:         {
  70:             "Commiter": {
  71:                 "Email": "ayende@ayende.com",
  72:                 "Name": "Ayende Rahien"
  73:             },
  74:             "Version": "c93264e2a94e2aa326e7308ab3909aa4077bc3bb",
  75:             "Href": "/app/rest/changes/id:21036",
  76:             "TeamCityId": "...bug where we wouldn't decrement reduce stats for an index when multiple values from the same bucket are removed\n",
  77:             "Date": "2013-02-26T10:53:01.0000000",
  78:             "Files": [
  79:                 "Raven.Database/Indexing/MapReduceIndex.cs",
  80:                 "Raven.Database/Storage/Esent/StorageActions/MappedResults.cs",
  81:                 "Raven.Database/Storage/IMappedResultsStorageAction.cs",
  82:                 "Raven.Database/Storage/Managed/MappedResultsStorageAction.cs",
  83:                 "Raven.Tests/Issues/RavenDB_784.cs",
  84:                 "Raven.Tests/Storage/MappedResults.cs",
  85:                 "Raven.Tests/Views/ViewStorage.cs"
  86:             ]
  87:         },
  88:         {
  89:             "Commiter": {
  90:                 "Email": "ayende@ayende.com",
  91:                 "Name": "Ayende Rahien"
  92:             },
  93:             "Version": "ff2c5b43eba2a8a2206152658b5e76706e12945c",
  94:             "Href": "/app/rest/changes/id:21038",
  95:             "TeamCityId": 21038,
  96:             "Username": "ayende rahien",
  97:             "Comment": "No need for so many repeats\n",
  98:             "Date": "2013-02-26T11:27:49.0000000",
  99:             "Files": [
 100:                 "Raven.Tests/Bugs/MultiOutputReduce.cs"
 101:             ]
 102:         },
 103:         {
 104:             "Commiter": {
 105:                 "Email": "ayende@ayende.com",
 106:                 "Name": "Ayende Rahien"
 107:             },
 108:             "Version": "0620c74e51839972554fab3fa9898d7633cfea6e",
 109:             "Href": "/app/rest/changes/id:21041",
 110:             "TeamCityId": 21041,
 111:             "Username": "ayende rahien",
 112:             "Comment": "Merge branch 'master' of https://github.com/cloudbirdnet/ravendb into 2.1\n",
 113:             "Date": "2013-02-26T11:41:39.0000000",
 114:             "Files": [
 115:                 "Raven.Abstractions/Extensions/MetadataExtensions.cs"
 116:             ]
 117:         }
 118:     ],
 119:     [
 120:     ],
 121:     [
 122:         {
 123:             "FullName": "Ayende Rahien",
 124:             "Email": "ayende@ayende.com",
 125:             "EmailHash": "730a9f9186e14b8da5a4e453aca2adfe"
 126:         },
 127:         {
 128:             "FullName": "David Walker",
 129:             "Email": "david@davidwalker.org",
 130:             "EmailHash": "4e5293ab04bc1a4fdd62bd06e2f32871"
 131:         }
 132:     ],
 133:     "bt8",
 134:     "/app/rest/builds/id:588",
 135:     "RavenDB",
 136:     588,
 137:     "project3",
 138:     2509
 139: ]

It reduced the document size to 2.93KB! Awesome, nearly half of the size was gone. Except: This is actually generating utterly unreadable mess. I mean, can you look at this and figure out what the hell is going on.

I thought not. At this point, we might as well use a binary format. I happen to have a zip tool at my disposal, so I checked what would happen if I threw this through that. The end result was a file that was 1.42KB. And I had no more loss of readability than I have with the JSONH stuff.

To be frank, I just don’t get efforts like this. JSON is a text base human readable format. If you lose the human readable portion of the format, you might as well drop directly to binary. It is likely to be more efficient and you don’t lose anything by it.

And if you want to compress your data, it is probably better to use something like a compression tool. HTTP Compression, for example, is practically free, since all servers and clients should be able to consume it now. And any tool that you use should be able to inspect through it. And it is likely to generate much better results on your JSON documents than if you will try a clever format like this.

New interview question

So, I think that I run through enough string sorting and tax calculations. My next interview question is going to be:

Given a finite set of unique numbers, find all the runs in the set. Runs are 1 or more consecutive numbers.

That is, given {1,59,12,43,4,58,5,13,46,3,6}, the output should be: {1}, {3,4,5,6}, {12,13}, {43}, {46},{58,59}.

Note that the size of the set may be very large.

This seems pretty simple, as a question, and should introduce interesting results.

Just to give you some idea, this is a real world problem we run into. We need to find runs in a sequence of freed pages so we can optimize sequential I/O.


Published at

Originally posted at

Comments (67)

Get out of the way, we are coding, Part II

Another thing that is pretty common in development cycles is the notion of who can do more. Hours, that is, rather than work. That is a pretty important distinction.

In general, I appreciate Work much better than Hours. For the simple reason that someone doing 12 hours a day in the office usually do a lot less actual work. Sprints are possible, and we do that sometimes, usually if there is a major production issue or we are gearing up for a release.

Then again, we have just released RavenDB 2.5, and we haven’t had the need for doing that. It was simpler & easiest to push the date by a week than do long hours just to hit an arbitrary point in time. I think that in the last six months, we had people stay in the office past 5 – 6 PM twice.

There are three reasons for that. The two obvious ones are:

  • people doing 12 – 18 hours of work each day turn do crappy stuff, so that is bad for the product.
  • people doing 12 – 18 hours of work each day also tend to have… issues. They burn out, quite rapidly, too. Leaving aside issues such as this one. People crash and burn.

I know that I said it before, but it is important to note. Burn out will do nasty things to you. Leaving aside the proven physical and mental health issues that this cause, it boils down to this. I’ve burned out before, it sucks. Let’s us not do that is a pretty important aspect of what I do on a daily basis. That is why I turned to building products, because being on the road 60% of the time isn’t sustainable, and if it is something that I feel, this is certain for other people who work for Hibernating Rhinos.

But I said that there are three reasons. And the third might be just as important as the others. Hibernating Rhinos was built to be a place where people retire from. This is the ideal, and we are probably talking 40 years from now, considering all factors, but that is the idea. We aren’t a startup, chasing the pot of gold for that one in a hundred chance to make it rich.

And that is why I had to kick people out of the office and tell them to continue working on that issue tomorrow.


Published at

Originally posted at

Comments (12)

The importance of comments

A while ago I got a comment to a post of mine “if you [ in here the commenter is talking to another commenter ] think good code does need comments you are obviously unqualified to be in this conversation”. And I wanted to demonstrate that this is a very false statement.

Let us look at the following piece of code.

   1: private long GetNewLength(long current)
   2: {
   3:     DateTime now = DateTime.UtcNow;
   4:     TimeSpan timeSinceLastIncrease = (now - _lastIncrease);
   5:     if (timeSinceLastIncrease.TotalSeconds < 30)
   6:     {
   7:         _increaseSize = Math.Min(_increaseSize * 2, MaxIncreaseSize);
   8:     }
   9:     else if (timeSinceLastIncrease.TotalMinutes > 2)
  10:     {
  11:         _increaseSize = Math.Max(MinIncreaseSize, _increaseSize / 2);
  12:     }
  13:     _lastIncrease = DateTime.UtcNow;
  14:     current = Math.Max(current, 256 * PageSize);
  15:     var actualIncrease = Math.Min(_increaseSize, current / 4);
  16:     return current + actualIncrease;
  17: }

This piece of code is responsible for decided the new file size when we run out of room in the current space we use.

That is 10 lines of code. And what happens is quite clear, even if I say so myself. But the why for this code is lost. In particular, there is a lot of reasoning behind the actual logic here. You can see that we play with the actual increase size, to make sure that if we increase often, we will reserve more space from the OS. That seems clear enough, but what about lines 14 – 15?

In this case, just before those two lines we have:

   1: // At any rate, we won't do an increase by over 25% of current size, to prevent huge empty spaces
   2: // and the first size we allocate is 256 pages (1MB)
   3: // 
   4: // The reasoning behind this is that we want to make sure that we increase in size very slowly at first
   5: // because users tend to be sensitive to a lot of "wasted" space. 
   6: // We also consider the fact that small increases in small files would probably result in cheaper costs, and as
   7: // the file size increases, we will reserve more & more from the OS.
   8: // This also plays avoids "I added 300 records and the file size is 64MB" problems that occur when we are too
   9: // eager to reserve space

And yes, this isn’t about comments such as // increment by 1. I try writing comments like that when explaining policy or heuristics. Or when I am doing something pretty funky all around for some (hopefully) good reason.

Another example might be this:

   1: // here we try to optimize the amount of work we do, we will only 
   2: // flush the actual dirty pages, and we will do so in sequential order
   3: // ideally, this will save the OS the trouble of actually having to flush the 
   4: // entire range
   5: long start = sortedPagesToFlush[0];
   6: long count = 1;
   7: for (int i = 1; i < sortedPagesToFlush.Count; i++)
   8: {
   9:     var difference = sortedPagesToFlush[i] - sortedPagesToFlush[i - 1];
  10:     // if the difference between them is not _too_ big, we will just merge it into a single call
  11:     // we are trying to minimize both the size of the range that we flush AND the number of times
  12:     // we call flush, so we need to balance those needs.
  13:     if (difference < 64)
  14:     {
  15:         count += difference;
  16:         continue;
  17:     }
  18:     FlushPages(start, count);
  19:     start = sortedPagesToFlush[i];
  20:     count = 1;
  21: }
  22: FlushPages(start, count);

Again, relatively short code. And really easy to follow. But good luck trying to figure out that this code is responsible for a 100% perf boost, or that you need to balance multiple aspects to get an optimal solution without getting comments.

Now, I think that when people talk about “good code doesn’t need comments”, they think about code like this (all samples taken from a popular OSS project):

   1: var query = _backInStockSubscriptionRepository.Table;
   2: //customer
   3: query = query.Where(biss => biss.CustomerId == customerId);
   4: //store
   5: if (storeId > 0)
   6:     query = query.Where(biss => biss.StoreId == storeId);
   7: //product
   8: query = query.Where(biss => !biss.Product.Deleted);
   9: query = query.OrderByDescending(biss => biss.CreatedOnUtc);

Yes, I can read, thank you. And here is an example of two comments, the first one good, and the second bad:

   1: if (_commonSettings.UseStoredProceduresIfSupported && _dataProvider.StoredProceduredSupported)
   2: {
   3:     //stored procedures are enabled and supported by the database. 
   4:     //It's much faster than the LINQ implementation below 
   6:     #region Use stored procedure
   8:     //pass category identifiers as comma-delimited string
   9:     string commaSeparatedCategoryIds = "";
  10:     if (categoryIds != null)
  11:     {
  12:         for (int i = 0; i < categoryIds.Count; i++)

The first comment explain why. And that is crucial. You should only need to explain the how very rarely, and then, you need a comment to justify why you need the comment. (Performance, maybe?)

Get out of the way, we are coding, Part I

One of the things that always bugged me when working at a corporate environment was the sheer amount of work you had to do just to be able to do your work.  For example, I know of people who purchase our profilers out of their own pocket, because they want to do a better job, but their workplace won’t or can’t authorize that.

A lot of the time this has to do with the way the business allocate money. They may have a developer budget, but no tools budget, etc. Honestly, I think that this is stupid. And one of the things that I try to do is avoid being stupid.

In the end, almost any tool a developer want is going to cost significantly less then the lack of tool.



Sure, it is a 10$ charge, but we do the same for pretty much every tool requested.


Published at

Originally posted at

Comments (19)

Building async Unit of Work with MVC 4

In the RavenDB mailing list, we had a question about how we can combine the standard unit of work pattern of working with RavenDB in MVC applications with async. In particular, the problematic code was:

   1: public class HomeController : Controller
   2: {
   3:     public IAsyncDocumentSession Db { get; set; }
   5:     public async Task<ActionResult> Index()
   6:     {
   7:         var person = new Person {Name = "Khalid Abuhakmeh"};
   8:         await Db.StoreAsync(person);
  10:         return View(person);
  11:     }
  13:     protected override void OnActionExecuting(ActionExecutingContext filterContext)
  14:     {
  15:         Db = MvcApplication.DocumentStore.OpenAsyncSession();
  16:         base.OnActionExecuting(filterContext);
  17:     }
  19:     protected override void OnActionExecuted(ActionExecutedContext filterContext)
  20:     {
  21:         Db.SaveChangesAsync()
  22:             .ContinueWith(x => { });
  24:         base.OnActionExecuted(filterContext);
  25:     }
  27:     public class Person
  28:     {
  29:         public string Id { get; set; } 
  30:         public string Name { get; set; }
  31:     }
  32: }

As you probably noticed, the problem is with line 21. We want to execute the save changes in an async manner, but we don’t want to do that in a way that would block the thread. The current code just assume the happy path, and any error would be ignored. That ain’t right. If we were using Web API, this would be trivially easy, but we aren’t. So let us see what can be done about it.

I created a new MVC 4 application and wrote the following code:


As you can see, I have a break point after the await, which means that when that break point is hit, I’ll be able to see what is responsible for handling async calls in MVC4. When the breakpoint was hit, I looked at the call stack, and saw:


Not very useful, right? But we can fix that:


And now we get:


This is a whole bunch of stuff that doesn’t really help, I am afraid. But then I thought about putting the breakpoint before the await, which gave me:


And this means that I can check the code here. I got the code and started digging. At first I thought that I couldn’t do it, but then I discovered that I could. See, all you have to do is to create you own async action invoker, like so:

   1: public class UnitOfWorkAsyncActionInvoker : AsyncControllerActionInvoker
   2: {
   3:     protected override IAsyncResult BeginInvokeActionMethod(
   4:         ControllerContext controllerContext,
   5:         ActionDescriptor actionDescriptor,
   6:         IDictionary<string, object> parameters, AsyncCallback callback,
   7:         object state)
   8:     {
   9:         return base.BeginInvokeActionMethod(controllerContext, actionDescriptor, parameters,
  10:                                             result => DoSomethingAsyncAfterTask().ContinueWith(task => callback(task)),
  11:                                             state);
  12:     }
  14:     public async Task DoSomethingAsyncAfterTask()
  15:     {
  16:         await Task.Delay(1000);
  17:     }
  18: }

And then register it:

   1: DependencyResolver.SetResolver(type =>
   2:     {
   3:         if (type == typeof (IAsyncActionInvoker))
   4:             return new UnitOfWorkAsyncActionInvoker();
   5:         return null;
   6:     }, type => Enumerable.Empty<object>());

And you are golden.

Note: Except for doing a minimum of F5 in the debugger, I have neither tested nor verified this code. It appears to do what I want it to, and since I am only getting to this because a customer asked about this in the mailing list, that is about as much investigation time that I can dedicate to it.

Memory Mapped Files, File I/O & Performance

I have been testing out several approaches for writing out to files. And I thought that the results are interesting enough to share. In all cases, I was writing a 128Kb buffer of random data to a file with size of 256Mb.

The first thing that I wanted to try was the trivial managed memory map approach:

using (var mmf = MemoryMappedFile.CreateFromFile("test.bin", FileMode.Create, "test", 1024*1024*256))
    using (var accessor = mmf.CreateViewAccessor())
        for (int i = 0; i < accessor.Capacity; i += buffer.Length)
            accessor.WriteArray(i, buffer, 0, buffer.Length);

This completed in 3.871 seconds.

Next, I Wanted to see what would happen if I were using direct memory access, and used CopyMemory to do that:

[DllImport("kernel32.dll", EntryPoint = "RtlMoveMemory")]
static extern void CopyMemory(byte* dst, byte* src, long size);

using (var mmf = MemoryMappedFile.CreateFromFile("test.bin", FileMode.Create, "test", 1024*1024*256))
    using (var accessor = mmf.CreateViewAccessor())
        byte* p = null;
        accessor.SafeMemoryMappedViewHandle.AcquirePointer(ref p);
        fixed (byte* src = buffer)
            for (int i = 0; i < accessor.Capacity; i += buffer.Length)
                CopyMemory(p + i, src, buffer.Length);

As you can see, this is somewhat more complex, and require unsafe code. But this completed in 2.062 seconds. Nearly twice as fast.

Then I decided to try with raw file IO:

using (var f = new FileStream("test.bin",FileMode.Create))
    for (int i = 0; i < f.Length; i += buffer.Length)
        f.Write(buffer, 0, buffer.Length);

This is about the most trivial code that you can think of, and this completed in about 1.956 seconds. Slightly faster, but within the margin of error (note, in repeated tests, they were consistently very close, and the file I/O was very near).

So, in other words, the accessor code adds a lot of overhead when using Memory Mapped Files.