Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,495
Comments: 51,046
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 259 words

I wonder what is says about what I am doing right now that I really wish that I could have the OS give me more control over virtual memory allocation. At any rate, the point of this post is to point out something quite important to people writing databases, especially databases that make use of virtual memory a lot.

There isn’t quite as much of it as I thought it would be. Oh, on 32 bits the 4GB limits is really hard to swallow. But on 64 bits, the situation is much better, but still constrained.

On Windows, using x64, you are actually limited to merely 8TB of address space. In Windows 8.1, (and I assume, but couldn’t verify, Windows 2012 R2) you can use up to 128TB of virtual address space. With Linux, at least since 2.6.32, and probably earlier, the limit is 128TB per process.

Implications for Voron, by the way, is that the total size of of all databases in a single process can be up to 8TB (probably somewhat less than that, the other stuff will also need memory). Currently the biggest RavenDB database that I am aware of was approaching the 1.5 – 2.0 TB mark last I checked (several months ago), and Esent, our current technology, is limited to 16TB per database.

So it isn’t great news, but it is probably something that I can live with. And at least I can give proper recommendations. In practice, I don’t think that this would be an issue. But that is good to know.

time to read 9 min | 1684 words

Dan Liebster has been kind enough to send me a real world time series database. The data has been sanitized to remove identifying issues, but this is actually real world data, so we can learn a lot more about this.

This is what this looks like:


The first thing that I did was take the code in this post, and try it out for size. I wrote the following:

   1: int i = 0;
   2: using (var parser = new TextFieldParser(@"C:\Users\Ayende\Downloads\TimeSeries.csv"))
   3: {
   4:    parser.HasFieldsEnclosedInQuotes = true;
   5:    parser.Delimiters = new[] {","};
   6:    parser.ReadLine();//ignore headers
   7:    var startNew = Stopwatch.StartNew();
   8:    while (parser.EndOfData == false)
   9:    {
  10:        var fields = parser.ReadFields();
  11:        Debug.Assert(fields != null);
  13:        dts.Add(fields[1], DateTime.ParseExact(fields[2], "o", CultureInfo.InvariantCulture), double.Parse(fields[3]));
  14:        i++;
  15:        if (i == 25*1000)
  16:        {
  17:            break;
  18:        }
  19:        if (i%1000 == 0)
  20:            Console.Write("\r{0,15:#,#}          ", i);
  21:    }
  22:    Console.WriteLine();
  23:    Console.WriteLine(startNew.Elapsed);
  24: }

Note that we are using a separate transaction per line, which means that we are really doing a lot of extra work. But this simulate very well incoming events coming one at a time. We were able to process 25,000 events in 8.3 seconds. At a rate of just over 3 events per millisecond.

Now, note that we have in here the notion of “channels”. From my investigation, it seems clear that some form of separation is actually very common in time series data. We are usually talking about sensors or some such, and we want to track data across different sensors over time. And there is little if any call for working over multiple sensors / channels at the same time.

Because of that, I made a relatively minor change in Voron, that allows it to have an infinite number of separate trees. That means that I can use as many trees as you want, and we can model a channel as a tree in Voron. I also changed things so we instead of doing a single transaction per line, we will do a transaction per 1000 lines. That dropped the time to insert 25,000 lines to 0.8 seconds. Or a full order of magnitude faster.

That done, I inserted the full data set, which is just over 1,096,384 records. That took 36 seconds. In the data set I have, there are 35 channels.

I just tried, and reading all the entries in a channel with 35,411 events takes 0.01 seconds. That allows doing things like doing averages over time, comparing data, etc.

You can see the code implementing this in the following link.

time to read 1 min | 145 words

I just had what I think is a great idea. As you know by now, we are going to do a full blown conference to celebrate RavenDB 3.0

We are going to do a lot of talking about RavenDB, but I think that a more active experience would also be good. So what I’m planning is to have at least three RavenDB Core Dev Team present and do something. Right now I am torn between doing something that the people showing up want to do and build an application using RavenDB and doing a major new feature live with everyone pitching in. Or any combination of the two.

Thoughts and suggestions are welcome…

time to read 2 min | 308 words

I mentioned that a quick way to setup things for me to think that a candidate is a bad idea is to send us a UI project. This is usually a very strong indication that the candidate doesn’t really have any idea what they are doing. They have been doing Win Forms projects, so they write the code for the task at hand in buttom1_Click event handler. Or inside the Page_Load code in an ASP.Net WebForms application if they are “web developers”.

On the other hand, here is a strong counter example. We had a candidate send in a WinForms project, as I said, that is usually a bad sign. But then I actually looked at his code:


And here is a single method:


This code is several levels of too complex for the task.  It can be easily simplified to a great degree quite easily.

But the key point from this, and the reason that this candidate has an interview later this week, is that this demonstrate a bunch of things:

  • Understanding of separation of concerns.
  • Code that actually does what it is supposed to do.
  • Proper integration between UI & backend code (for example, we are working with large files, so we have progress bars and off-the-UI-thread work).
  • The UI doesn’t look like it was put together by a hiccupping monkey.

I can work with this. There are things that need to be improved for what we do, but there appears to be a SOLID foundation here.

Fail, fail, fail

time to read 9 min | 1737 words

Sometimes, reading candidates answer is just something that I know is going to piss me off.

We have a question that goes something like this (the actual question is much more detailed):

We have a 15TB csv file that contains web log, the entries are sorted by date (since this is how they were entered). Find all the log entries within a given date range. You may not read more than 32 MB.

A candidate replied with an answered that had the following code:

   1: string line = string.Empty;
   2: StreamReader file;
   4: try
   5: {
   6:     file = new StreamReader(filename);
   7: }
   8: catch (FileNotFoundException ex)
   9: {
  10:     Console.WriteLine("The file is not found.");
  11:     Console.ReadLine();
  12:     return;
  13: }
  15: while ((line = file.ReadLine()) != null)
  16: {
  17:     var values = line.Split(',');
  18:     DateTime date = Convert.ToDateTime(values[0]);
  19:     if (date.Date >= startDate && date.Date <= endDate)
  20:         output.Add(line);
  22:     // Results size in MB
  23:     double size = (GetObjectSize(output) / 1024f) / 1024f;
  24:     if (size >= 32)
  25:     {
  26:         Console.WriteLine("Results size exceeded 32MB, the search will stop.");
  27:         break;
  28:     }
  29: }

My reply was:

The data file is 15TB in size, if the data is beyond the first 32MB, it won't be found.

The candidate then fixed his code. It now includes:

   1: var lines = File.ReadLines(filename);

Yep, this is on a 15TB file.

Now I’m going to have to lie down for a bit, I am not feeling so good.

time to read 3 min | 414 words

Sometimes it takes very little time to know that a candidate is going to be pretty horrible. As you can probably guess, the sort of questions we ask tend to be “find me this data in this sort of file”.

Probably the fastest indication is when they send me projects like this:



Now, it is possible that someone skilled will send us real projects like that, but the experience so far has been that this isn’t going to be the case. If you have someone sending a UI project, it usually indicates that they can’t think about it in any other way.

The code they send pretty much justify this concern. Some code snippets from those projects:


Yup, this is the kind of error handling I want to see. Just for fun, if there hasn’t been an error, this function would return a comma separated string of values.

Which make it just slightly worse than:


And then we have this:


I guess someone really like O(N**2) on 15 TB files.

And then there is this:


I guess we have different definitions on what configurable means.

And then there was this person:


Yes, they did send me code inside a PDF file. That was the only way that they could find to send code around, I’m guessing.

time to read 15 min | 2925 words

As it turns out, doing work on big data sets is quite hard. To start with, you need to get the data, and it is… well, big. So that takes a while. Instead, I decided to test my theory on the following scenario. Given 4GB of random numbers, let us find how many times we have the number 1.

Because I wanted to ensure a consistent answer, I wrote:

   1: public static IEnumerable<uint> RandomNumbers()
   2: {
   3:     const long count = 1024 *  1024 * 1024L  * 1;
   4:     var random = new MyRand();
   5:     for (long i = 0; i < count; i++)
   6:     {
   7:         if (i % 1024 == 0)
   8:         {
   9:             yield return 1;
  10:             continue;
  11:         }
  12:         var result = random.NextUInt();
  13:         while (result == 1)
  14:         {
  15:             result = random.NextUInt();
  16:         }
  17:         yield return result;
  18:     }
  19: }
  21: /// <summary>
  22: /// Based on Marsaglia, George. (2003). Xorshift RNGs.
  23: ///  http://www.jstatsoft.org/v08/i14/paper
  24: /// </summary>
  25: public class MyRand
  26: {
  27:     const uint Y = 842502087, Z = 3579807591, W = 273326509;
  28:     uint _x, _y, _z, _w;
  30:     public MyRand()
  31:     {
  32:         _y = Y;
  33:         _z = Z;
  34:         _w = W;
  36:         _x = 1337;
  37:     }
  39:     public uint NextUInt()
  40:     {
  41:         uint t = _x ^ (_x << 11);
  42:         _x = _y; _y = _z; _z = _w;
  43:         return _w = (_w ^ (_w >> 19)) ^ (t ^ (t >> 8));
  44:     }
  45: }

I am using a custom Rand function because it is significantly faster than System.Random. This generate 4GB of random numbers, at also ensure that we get exactly 1,048,576 instances of 1. Generating this in an empty loop takes about 30 seconds on my machine.

For fun, I run the external sort routine in 32 bits mode, with a buffer of 256MB. It is currently processing things, but I expect it to take a while. Because the buffer is 256 in size, we flush it every 128 MB (while we still have half the buffer free to do more work). The interesting thing is that even though we generate random number, sorting then compressing the values resulted in about 60% compression rate.

The problem is that for this particular case, I am not sure if that is a good thing. Because the values are random, we need to select a pretty high degree of compression just to get a good compression rate. And because of that, a significant amount of time is spent just compressing the data. I am pretty sure that for real world scenario, it would be better, but that is something that we’ll probably need to test. Not compressing the data in the random test is a huge help.

Next, external sort is pretty dependent on the performance of… sort, of course. And sort isn’t that fast. In this scenario, we are sorting arrays of about 26 million items. And that takes time. Implementing parallel sort cut this down to less than a minute per batch of 26 million.

That let us complete the entire process, but then it halts with the merge. The reason for that is that we push all the values into a heap, and there are 1 billion of them. Now, the heap never exceed 40 items, but those are still 1 billion * O(log 40) or about 5.4 billion comparisons that we have to do, and we do this sequentially, which takes time. I tried thinking about ways to parallel, but I am not sure how that can be done. We have 40 sorted files, and we want to merge all of them.

Obviously we can sort each 10 files set in parallel, then sort the resulting 4, but the cost we have now is the actual sorting cost, not I/O. I am not sure how to approach this.

For what is it worth, you can find the code for this here.

time to read 2 min | 250 words

I mentioned several times that the entire point of the exercise was to just see how this works, not to actually do anything production worthy. But it is interesting to see how we could do better here.

In no particular order, I think that there are at least several things that we could do to significantly improve the time it takes to sort. Right now we defined 2 indexes on top of a 1GB file, and it took under 1 minute to complete. That gives us a runtime of about 10 days over a 15TB file.

Well, one of the reason for this performance is that we execute this in a serial fashion, that is, one after another. But we have to completely isolated indexes, there is no reason why we can’t parallelize the work between them.

For that matter, we are buffering in memory up to a certain point, then we sort, then we buffer some more, etc. That is pretty inefficient. We can push the actual sorting to a different thread, and continue parsing and adding to a buffer while we are adding to the buffer.

We wrote to intermediary files, but we wrote to those using plain file I/O. But it is usually a lot more costly to write to disk than to compress and then write to disk.  We are writing sorted data, so it is probably going to compress pretty well.

Those are the things that pop to mind. Can you think of additional options?

time to read 5 min | 921 words

I have completed my own exercise, and while I did wanted to try it with “few allocations” rule, it is interesting to see just how far out there the code is.

This isn’t something that you can really for anything except as a basis to see how badly you are doing. Let us start with the index format. It is just a CSV file with the value and the position in the original file.

That means that any search we want to do on the file is actually a binary search, as discussed in the previous post. But doing a binary search like that is an absolute killer for performance.

Let us consider our 15TB data set. In my tests, a 1GB file with 4.2 million rows produced roughly 80MB index. Assuming the same is true for the larger file, that gives us a 1.2 TB file.

In my small index, we have to do 24 seeks to get to the right position in the file. And as you should know, disk seeks are expensive. They are in the order of 10ms or so. So the cost of actually searching the index is close to quarter of a second.  Now, to be fair, there is going to be a lot of caching opportunities here, but probably not that many if we have a lot of queries to deal with ere.

Of course, the fun thing about this is that even with a 1.2 TB file, we are still talking about less than 40 seeks (the beauty of O(logN) in action), but that is still pretty expensive. Even worse, this is what happens when we are running on a single query at a time.

What do you think will happen if we are actually running this with multiple threads generating queries. Now we will have a lot of seeks (effective random) that would generate a big performance sink. This is especially true if we consider that any storage solution big enough to store the data is going to be composed of an aggregate of HDD disks. Sure, we get multiple spindles, so we get better performance overall, but still…

Obviously, there are multiple solutions for this issue.

B+Trees solve the problem by packing multiple keys into a single page, so instead of doing a O(log2N), you are usually doing O(log36N) or O(log100N). Consider those fan outs, we will have 6 – 8 seeks to do to get to our data. Much better than the 40 seeks required using plain binary search. It would actually be better than that in the common case, since the first few levels of the trees are likely to reside in memory (and probably in L1, if we are speaking about that).

However, given that we are storing sorted strings here, one must give some attention to Sorted Strings Tables. The way those work, you have the sorted strings in the file, and the footer contains two important bits of information. The first is the bloom filter, which allows you to quickly rule out missing values, but the more important factor is that it also contains the positions of (by default) every 16th entry to the file. This means that in our 15 TB data file (with 64.5 billion entries), we will use about 15GB just to store pointers to the different locations in the index file (which will be about 1.2 TB). Note that the numbers actually are probably worse. Because SST (note that when talking about SST I am talking specifically about the leveldb implementation) utilize many forms of compression, it is actually that the file size will be smaller (although, since the “value” we use is just a byte position in the data file, we won’t benefit from compression there). Key compression is probably a lot more important here.

However, note that this is a pretty poor way of doing things. Sure, the actual data format is better, in the sense that we don’t store as much, but in terms of the number of operations required? Not so much. We still need to do a binary search over the entire file. In particular, the leveldb implementation utilizes memory mapped files. What this ends up doing is rely on the OS to keep the midway points in the file in RAM, so we don’t have to do so much seeking. Without that, the cost of actually seeking every time would make SSTs impractical. In fact, you would pretty much have to introduce another layer on top of this, but at that point, you are basically doing trees, and a binary tree is a better friend here.

This leads to an interesting question. SST is probably so popular inside Google because they deal with a lot of data, and the file format is very friendly to compression of various kinds. It is also a pretty simple format. That make it much nicer to work with. On the other hand, a B+Tree implementation is a lot more complex, and it would probably several orders of magnitude more complex if it had to try to do the same compression tricks that SSTs do. Another factor that is probably as important is that as I understand it, a lot of the time, SSTs are usually used for actual sequential access (map/reduce stuff) and not necessarily for the random reads that are done in leveldb.

It is interesting to think about this in this fashion, at least, even if I don’t know what I’ll be doing with it.

time to read 1 min | 200 words

We have been working on this for a while now, and now I am very proud to announce that we are having the first RavenDB conference in April 7 – 11 in Raleigh, NC. You can register to the conference right now and please do so, since we have limited number of places.

This conference is also going to be the platform for RavenDB 3.0 launch, and we’ll expose a lot of new things about the new stuff there. In additional to that, the conference will also offer many talks by RavenDB experts and users, deep guidance on how to get the best out of it and a lot more.

We are going to have two days days of talks, and a 3 days in depth workshop that will give you everything you need to know about RavenDB 3.0.

As I said, we have a limited amount of places for the conference, so please register soon.

The conference alone (excluding the 3 days workshop) is going to cost you 89$. And just to make sure that you won’t have any price issues, we will give you a 90$ coupon for a RavenDB purchase.


No future posts left, oh my!


  1. Recording (13):
    05 Mar 2024 - Technology & Friends - Oren Eini on the Corax Search Engine
  2. Meta Blog (2):
    23 Jan 2024 - I'm a JS Developer now
  3. Production postmortem (51):
    12 Dec 2023 - The Spawn of Denial of Service
  4. Challenge (74):
    13 Oct 2023 - Fastest node selection metastable error state–answer
  5. Filtering negative numbers, fast (4):
    15 Sep 2023 - Beating memcpy()
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats