Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,503
Comments: 51,091
Privacy Policy · Terms
filter by tags archive
time to read 15 min | 2925 words

As it turns out, doing work on big data sets is quite hard. To start with, you need to get the data, and it is… well, big. So that takes a while. Instead, I decided to test my theory on the following scenario. Given 4GB of random numbers, let us find how many times we have the number 1.

Because I wanted to ensure a consistent answer, I wrote:

   1: public static IEnumerable<uint> RandomNumbers()
   2: {
   3:     const long count = 1024 *  1024 * 1024L  * 1;
   4:     var random = new MyRand();
   5:     for (long i = 0; i < count; i++)
   6:     {
   7:         if (i % 1024 == 0)
   8:         {
   9:             yield return 1;
  10:             continue;
  11:         }
  12:         var result = random.NextUInt();
  13:         while (result == 1)
  14:         {
  15:             result = random.NextUInt();
  16:         }
  17:         yield return result;
  18:     }
  19: }
  21: /// <summary>
  22: /// Based on Marsaglia, George. (2003). Xorshift RNGs.
  23: ///  http://www.jstatsoft.org/v08/i14/paper
  24: /// </summary>
  25: public class MyRand
  26: {
  27:     const uint Y = 842502087, Z = 3579807591, W = 273326509;
  28:     uint _x, _y, _z, _w;
  30:     public MyRand()
  31:     {
  32:         _y = Y;
  33:         _z = Z;
  34:         _w = W;
  36:         _x = 1337;
  37:     }
  39:     public uint NextUInt()
  40:     {
  41:         uint t = _x ^ (_x << 11);
  42:         _x = _y; _y = _z; _z = _w;
  43:         return _w = (_w ^ (_w >> 19)) ^ (t ^ (t >> 8));
  44:     }
  45: }

I am using a custom Rand function because it is significantly faster than System.Random. This generate 4GB of random numbers, at also ensure that we get exactly 1,048,576 instances of 1. Generating this in an empty loop takes about 30 seconds on my machine.

For fun, I run the external sort routine in 32 bits mode, with a buffer of 256MB. It is currently processing things, but I expect it to take a while. Because the buffer is 256 in size, we flush it every 128 MB (while we still have half the buffer free to do more work). The interesting thing is that even though we generate random number, sorting then compressing the values resulted in about 60% compression rate.

The problem is that for this particular case, I am not sure if that is a good thing. Because the values are random, we need to select a pretty high degree of compression just to get a good compression rate. And because of that, a significant amount of time is spent just compressing the data. I am pretty sure that for real world scenario, it would be better, but that is something that we’ll probably need to test. Not compressing the data in the random test is a huge help.

Next, external sort is pretty dependent on the performance of… sort, of course. And sort isn’t that fast. In this scenario, we are sorting arrays of about 26 million items. And that takes time. Implementing parallel sort cut this down to less than a minute per batch of 26 million.

That let us complete the entire process, but then it halts with the merge. The reason for that is that we push all the values into a heap, and there are 1 billion of them. Now, the heap never exceed 40 items, but those are still 1 billion * O(log 40) or about 5.4 billion comparisons that we have to do, and we do this sequentially, which takes time. I tried thinking about ways to parallel, but I am not sure how that can be done. We have 40 sorted files, and we want to merge all of them.

Obviously we can sort each 10 files set in parallel, then sort the resulting 4, but the cost we have now is the actual sorting cost, not I/O. I am not sure how to approach this.

For what is it worth, you can find the code for this here.

time to read 2 min | 250 words

I mentioned several times that the entire point of the exercise was to just see how this works, not to actually do anything production worthy. But it is interesting to see how we could do better here.

In no particular order, I think that there are at least several things that we could do to significantly improve the time it takes to sort. Right now we defined 2 indexes on top of a 1GB file, and it took under 1 minute to complete. That gives us a runtime of about 10 days over a 15TB file.

Well, one of the reason for this performance is that we execute this in a serial fashion, that is, one after another. But we have to completely isolated indexes, there is no reason why we can’t parallelize the work between them.

For that matter, we are buffering in memory up to a certain point, then we sort, then we buffer some more, etc. That is pretty inefficient. We can push the actual sorting to a different thread, and continue parsing and adding to a buffer while we are adding to the buffer.

We wrote to intermediary files, but we wrote to those using plain file I/O. But it is usually a lot more costly to write to disk than to compress and then write to disk.  We are writing sorted data, so it is probably going to compress pretty well.

Those are the things that pop to mind. Can you think of additional options?

time to read 5 min | 921 words

I have completed my own exercise, and while I did wanted to try it with “few allocations” rule, it is interesting to see just how far out there the code is.

This isn’t something that you can really for anything except as a basis to see how badly you are doing. Let us start with the index format. It is just a CSV file with the value and the position in the original file.

That means that any search we want to do on the file is actually a binary search, as discussed in the previous post. But doing a binary search like that is an absolute killer for performance.

Let us consider our 15TB data set. In my tests, a 1GB file with 4.2 million rows produced roughly 80MB index. Assuming the same is true for the larger file, that gives us a 1.2 TB file.

In my small index, we have to do 24 seeks to get to the right position in the file. And as you should know, disk seeks are expensive. They are in the order of 10ms or so. So the cost of actually searching the index is close to quarter of a second.  Now, to be fair, there is going to be a lot of caching opportunities here, but probably not that many if we have a lot of queries to deal with ere.

Of course, the fun thing about this is that even with a 1.2 TB file, we are still talking about less than 40 seeks (the beauty of O(logN) in action), but that is still pretty expensive. Even worse, this is what happens when we are running on a single query at a time.

What do you think will happen if we are actually running this with multiple threads generating queries. Now we will have a lot of seeks (effective random) that would generate a big performance sink. This is especially true if we consider that any storage solution big enough to store the data is going to be composed of an aggregate of HDD disks. Sure, we get multiple spindles, so we get better performance overall, but still…

Obviously, there are multiple solutions for this issue.

B+Trees solve the problem by packing multiple keys into a single page, so instead of doing a O(log2N), you are usually doing O(log36N) or O(log100N). Consider those fan outs, we will have 6 – 8 seeks to do to get to our data. Much better than the 40 seeks required using plain binary search. It would actually be better than that in the common case, since the first few levels of the trees are likely to reside in memory (and probably in L1, if we are speaking about that).

However, given that we are storing sorted strings here, one must give some attention to Sorted Strings Tables. The way those work, you have the sorted strings in the file, and the footer contains two important bits of information. The first is the bloom filter, which allows you to quickly rule out missing values, but the more important factor is that it also contains the positions of (by default) every 16th entry to the file. This means that in our 15 TB data file (with 64.5 billion entries), we will use about 15GB just to store pointers to the different locations in the index file (which will be about 1.2 TB). Note that the numbers actually are probably worse. Because SST (note that when talking about SST I am talking specifically about the leveldb implementation) utilize many forms of compression, it is actually that the file size will be smaller (although, since the “value” we use is just a byte position in the data file, we won’t benefit from compression there). Key compression is probably a lot more important here.

However, note that this is a pretty poor way of doing things. Sure, the actual data format is better, in the sense that we don’t store as much, but in terms of the number of operations required? Not so much. We still need to do a binary search over the entire file. In particular, the leveldb implementation utilizes memory mapped files. What this ends up doing is rely on the OS to keep the midway points in the file in RAM, so we don’t have to do so much seeking. Without that, the cost of actually seeking every time would make SSTs impractical. In fact, you would pretty much have to introduce another layer on top of this, but at that point, you are basically doing trees, and a binary tree is a better friend here.

This leads to an interesting question. SST is probably so popular inside Google because they deal with a lot of data, and the file format is very friendly to compression of various kinds. It is also a pretty simple format. That make it much nicer to work with. On the other hand, a B+Tree implementation is a lot more complex, and it would probably several orders of magnitude more complex if it had to try to do the same compression tricks that SSTs do. Another factor that is probably as important is that as I understand it, a lot of the time, SSTs are usually used for actual sequential access (map/reduce stuff) and not necessarily for the random reads that are done in leveldb.

It is interesting to think about this in this fashion, at least, even if I don’t know what I’ll be doing with it.

time to read 29 min | 5758 words

The index I created for the exercise is just a text file, sorted by the indexed key. When doing a search by a human, that make it very easy to work with. Much easier than trying to work with a binary file, it also helps debugging.

However, it does make it running a binary search on the data a bit harder. Mostly because there isn’t a nice way to say “give me the #th line”. Instead, I wrote the following:

   1: public void SetPositionToLineAt(long position)
   2: {
   3:     // now we need to go back until we either get to the start of the file
   4:     // or find a \n character
   5:     const int bufferSize = 128;
   6:     _buffer.Capacity = Math.Max(bufferSize, _buffer.Capacity);
   8:     var charCount = _encoding.GetMaxCharCount(bufferSize);
   9:     if (charCount > _charBuf.Length)
  10:         _charBuf = new char[Utils.NearestPowerOfTwo(charCount)];
  12:     while (true)
  13:     {
  14:         _input.Position = position - (position < bufferSize ? 0 : bufferSize);
  15:         var read = ReadToBuffer(bufferSize);
  16:         var buffer = _buffer.GetBuffer();
  17:         var chars = _encoding.GetChars(buffer, 0, read, _charBuf, 0);
  18:         for (int i = chars - 1; i >= 0; i--)
  19:         {
  20:             if (_charBuf[i] == '\n')
  21:             {
  22:                 _input.Position = position - (bufferSize - i) + 1;
  23:                 return;
  24:             }
  25:         }
  26:         position -= bufferSize;
  27:         if (position < 0)
  28:         {
  29:             _input.Position = 0;
  30:             return;
  31:         }
  32:     }
  33: }

This code starts at an arbitrary byte position, and go backward until it find the new line character ‘\n’. This give me the ability to go to a rough location and get the line oriented input.

Once I have that, the rest is pretty easy. Here is the binary search:

   1: while (lo <= hi)
   2: {
   3:     position = (lo + hi) / 2;
   4:     _reader.SetPositionToLineAt(position);
   6:     bool? result;
   7:     do
   8:     {
   9:         result = _reader.ReadOneLine();
  10:     } while (result == null); // skip empty lines
  12:     if (result == false)
  13:         yield break; // couldn't find anything
  15:     var entry = _reader.Current.Values[0];
  16:     match = Utils.CompareArraySegments(expectedIndexEntry, entry);
  18:     if (match == 0)
  19:     {
  20:         break;
  21:     }
  22:     if (match > 0)
  23:         lo = position + _reader.Current.Values.Sum(x => x.Count) + 1;
  24:     else
  25:         hi = position - 1;
  26: }
  28: if (match != 0)
  29: {
  30:     // no match
  31:     yield break;
  32: }

The idea is that this positions us on the location of the index that has an entry with a value that is equal to what we are searched on.

We then write the following to actually get the data from the actual data file:

   1: // we have a match, now we need to return all the matches
   2: _reader.SetPositionToLineAt(position);
   4: while(true)
   5: {
   6:     bool? result;
   7:     do
   8:     {
   9:         result = _reader.ReadOneLine();
  10:     } while (result == null); // skip empty lines
  12:     if(result == false)
  13:         yield break; // end of file
  15:     var entry = _reader.Current.Values[0];
  16:     match = Utils.CompareArraySegments(expectedIndexEntry, entry);
  17:     if (match != 0)
  18:         yield break; // out of the valid range we need
  20:     _buffer.SetLength(0);
  21:     _data.Position = Utils.ToInt64(_reader.Current.Values[1]);
  23:     while (true)
  24:     {
  25:         var b = _data.ReadByte();
  26:         if (b == -1)
  27:             break;
  28:         if (b == '\n')
  29:         {
  30:             break;
  31:         }
  32:         _buffer.WriteByte((byte)b);
  33:     }
  35:     yield return _encoding.GetString(_buffer.GetBuffer(), 0, (int)_buffer.Length);
  36: }

As you can see, we are moving forward in the index file, reading one line at a time. Then we take the second value, the position of the relevant line in the data file, and read that.

We continue to do so as long as the indexed value is the same. Pretty simple, all told. But it comes with its own set of problems. I’ll discuss that in my next post.

time to read 8 min | 1484 words

The interesting thing about this problem is that I was very careful in how I phrased things. I said what I wanted to happen, but didn’t specify what needs to be done. That was quite intentional. For that matter, the fact that I am posting about what is going to be our acceptance criteria is also intentional. The idea is to have a non trivial task, but something that should be very well understood and easy to research. It also means that the candidate needs to be able to write some non trivial code. And I can tell a lot about a dev from such a project. At the same time, this is a very self contained scenario. The idea is that this is something that you can do in a short amount of time.

The reason that this is an interesting exercise is that this is actually at least two totally different but related problems. First, in a 15TB file, we obviously cannot rely on just scanning the entire file. That means that we have to have an index. And that mean that we have to build it. Interestingly enough, an index being a sorted structure, that means that we have to solve the problem of sorting more data than can fit in main memory.

The second problem is probably easier, since it is just an implementation of external sort, and there are plenty of algorithms around to handle that. Note that I am not really interested in actual efficiencies for this particular scenario. I care about being able to see the code. See that it works, etc. My solution, for example, is a single threaded system that make no attempt at parallelism or I/O optimizations. It clocks at over 1 GB / minute and the memory consumption is at under 150MB. Queries for a unique value return the result in 0.0004 seconds. Queries that returned 153K results completed in about 2 seconds.

When increasing the used memory to about 650MB, there isn’t really any difference in performance, which surprised me a bit.

Then again, the entire code is probably highly inefficient. But that is good enough for now.

The process is kicked off with indexing:

   1: var options = new DirectoryExternalStorageOptions("/path/to/index/files");
   2: var input = File.OpenRead(@"/path/to/data/Crimes_-_2001_to_present.csv");
   3: var sorter = new ExternalSorter(input, options, new int[]
   4: {
   5:     1,// case number
   6:     4, // ICHR
   8: });
  10: sorter.Sort();

I am actually using the Chicago crime data for this. This is a 1GB file that I downloaded from the Chicago city portal in CSV format. This is what the data looks like:


The ExternalSorter will read and parse the file, and start reading it into a buffer. When it gets to a certain size (about 64MB of source data, usually), it will sort the values in memory and output them into temporary files.

Those file looks like this:


Initially, I tried to do that with binary data, but it turns out that that was too complex to be easy, and writing this in a human readable format made it much easier to work with. The format is pretty simple, you have the value of the left, and on the right you have start position of the row for this value.

We generate about 17 such temporary files for the 1GB file. One temporary file per each 64 MB of the original file. This lets us keep our actual memory consumption very low, but for larger data sets, we’ll probably want to actually do the sort every 1 GB or maybe more. Our test machine has 16 GB of RAM, so doing a sort and outputting a temporary file every 8 GB can be a good way to handle things. But that is beside the point.

The end result is that we have multiple sorted files, but they aren’t sequential. In other words, in file #1 we have values 1,4,6,8 and in file #2 we have 1,2,6,7. We need to merge all of them together. Luckily, this is easy enough to do. We basically have a heap that we feed entries from the files into. And that pretty much takes care of this. See merge sort if you want more details about this.

The end result of merging all of those files is… another file, just like them, that contains all of the data sorted. Then it is time to actually handle the other issue, actually searching the data.

We can do that using simple binary search, with the caveat that because this is a text file, and there is no fixed size records or pages, it is actually a big hard to figure out where to start reading.

In effect, what I am doing is to select an arbitrary byte position, then walk backward until I find a ‘\n’. Once I found the new line character, I can read the full line, check the value, and decide where I need to look next. Assuming that I actually found my value, I can now go to the byte position of the value in the original file and read the original line, giving it to the user.

Assuming an indexing rate of 1 GB / minute a 15 TB file would take about 10 days to index. But there are ways around that as well, but I’ll touch on them in my next post.

What all of this did was bring home just how much we usually don’t have to worry about such things. But I consider this research well spent, we’ll be using this in the future.


No future posts left, oh my!


  1. Recording (14):
    19 Jun 2024 - Building a Database Engine in C# & .NET
  2. re (33):
    28 May 2024 - Secure Drop protocol
  3. Meta Blog (2):
    23 Jan 2024 - I'm a JS Developer now
  4. Production postmortem (51):
    12 Dec 2023 - The Spawn of Denial of Service
  5. Challenge (74):
    13 Oct 2023 - Fastest node selection metastable error state–answer
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats