Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,598
|
Comments: 51,230
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 59 words

As I mentioned earlier, I actually gave that talk in USI Events in Paris a short while ago. The recording for this talk is now available, you can find it here:

http://www.usievents.com/en/conferences/12-paris-usi-2013/sessions/1090-why-you-should-never-write-your-own-database

I am afraid that I had a very short time to go through the details, which is why I am also doing the blog post series.

time to read 3 min | 515 words

We got the following question on the mailing list:

…say you have reference data in one document like prices, formulas, etc.

When the user is editing or creating a document you would like to ensure that their edits are based on the current info at the time of edit.

So the document being edited or created is dependent on the other reference document and the commit should raise a concurrency error if that reference document has changed in between the time the edit started and it is committed.

Of course the reference document is not changed with the document being edited/created.

This question is interesting enough to be worth a blog post. Mostly because this was my reaction when I read the post.

Basically, the premise is wrong. The user have a document like “config/tax-rates”, and when they create an order, they need to make sure that the tax rate hasn’t changed. I am assuming here that “time between edit started & committed” is relatively low, counted in milliseconds, and that there isn’t any human involvement in this process.

The reason this is wrong is the following sequence of events (T+1 is time + 1 ms):

Scenario #1 Scenario #2
  • T +0 – read referenced doc
  • T +1 – create order
  • T +2 – config/tax-rates changed
  • T +3 – save order
  • T +4 – concurrency error because the config/tax-rates changed
  • T +0 – read referenced doc
  • T +1 – create order
  • T +2 – save order
  • T +3 – save order completed
  • T +4 – config/tax-rates changed

According to the question above, both scenarios are absolutely correct ways for the system to behave. In practice, scenario #2 is wrong (probably).

Udi Dahan has wrote about this a lot, but I can’t find the appropriate article at this time. Update: http://www.udidahan.com/2010/08/31/race-conditions-dont-exist/

If it actually matter enough for scenario #1 to require an error, scenario #2 is also invalid. This isn’t a technical issue, it is a business issue. If you failed to update the tax rate after you were notified, you are responsible for it, right? Except that there is no way that this works. Tax changes usually take place at midnight, and are announced several days / weeks in advanced, at a minimum. And even if you were in a total news blackout, you are still responsible for the rate change, even if you never heard about it.

The way this is setup is simply wrong. You don’t try to stop orders from being processed, because you are going to miss the just completed order. Instead, you are going to do compensation logic to fix any changes that apply.

You can also tell that this is the case by the fact that this is actually pretty hard to do. Documents are independent from one another. The best way to work with such a scenario is to record the time (or the etag) when you made a decision based on a related document, then at a later point in time, you can make a decision based on whatever it changed, how old it is, etc.

That gives you proper frame of mind, not trying to count milliseconds.

time to read 2 min | 371 words

I decided to resume my attempt to figure out the LMDB codebase. Okay, I did some more figuring out and I think that I know where I went wrong. The codebase appears to be trying small enough to fit into the CPU caches, in order to do that, they decided to make it hard enough to make sure it won’t fit into my cache. But I’m trying to work through that.

One of the things that I figured out was that I made a mistake, I kept thinking about databases in the same way you’ll see in RavenDB or SQL Server, as complete separate items. Instead, LMDB does something quite different. It basically uses a single file for everything. This is called the environment, and that can contains named dbs. If you have multiple databases, all of them are going to reside in the same file. The way it works, it creates a file and then map it into memory. On Windows, it actually allocate the space for the file upfront (which doesn’t appears to be necessary).

But the basic idea is that you need to specify upfront the size of the file you want, and that is the maximum data size you have. This can change, but only if you close the environment and start it up again with a higher value. This also explains why you have to specify the number of databases you want to have per environment.  Once you created the environment, everything else become an issue of just managing the pages. In particular, because LMDB only allows a single writer, it doesn’t really have to worry about concurrency issues. It has mapped all the data into memory, and then it is just a matter of creating the appropriate data structure.

A lot of the code is dedicated to managing pages of data, and now that I have gone through enough of the codebase to be sure that I sort of have a vague idea about what is going on I can still say that I think that this is way denser than it should. And I shudder to think what it would take to make any sort of change to this codebase.

time to read 4 min | 659 words

I decided to resume my attempt to figure out the LMDB codebase. The previous attempt ended when my eyes started bleeding a little, but I think that I can get a bit further now.

One thing that I figured out already is that because this is C code, memory management is really painful. I noted previous how good it was from leveldb to use std::string for most allocations, since it frees both caller and lib from having to worry about the memory. LMDB only shows how important that was. Take a look at the following method:

image

Quick, what do you think it does. As it turn out, what is does is update the key to point to the key that the cursor is currently pointing at. The reason for that is that the library doesn’t want to own the memory for key (since then it would have to provide a way to free it).

Now, to be fair, this appears to be an internal method, only used by the lib itself, but such things are pretty pervasive in the codebase, and it is really annoying that the code is all in one spot, with no real structure, since I am still trying to figure out some sort of hierarchy for the code so I could get a grip on what is actually going on.

I decided to just go to the start and figure out how it opens a database:

image

It is interesting that you need a transaction to open the database. However, the opened database survives the transaction closing, and from the docs, if you open a db in a transaction, that is the only thing you can do in that transaction.

I think that I just figured out something, the code in mdb.c is actually written in reverse depth order. So the closer to the end of the file a function is, the more visible it is.

Okay, it appears that there is the notion of multiple dbs, but there is also the notion of the main db. It looks like the main db is also used for tracking things for child dbs. In particular, when opening a named db, we have the following:image

We search for the db information in the main db. As you can see, mdb_cursor_set is actually setting the data on the cursor to the value we passed in. More interesting is what happens if the db is not there. It appears to merely create the db data and set it in the main db. Okay, looking at this function docs, it appears that this is just about allocating a database handle, not about opening the actual database.

I am not really sure that I understand the reasoning behind it, but a new database is actually created in mdb_cursor_put. So only after the first time you create a value, will the db actually be created.

image

I guess that explains why this function goes on for over 400 lines of code and has no less than 10 goto(!) and 7 goto sections!

image

I get that this is C, but come on, seriously!

time to read 1 min | 147 words

On the 30 June, I had just about 30K subscribers to this blog. With the death of Google Reader, I dropped down to less than 10% of that.

This sucks, but it also means that this is a much smaller audience. Which means that it is easier to interact with. In particular, I would like to know what sort of blog posts do you, as a reader, like.

  • Features, like “see how I can do this cool thing in RavenDB”?
  • Reviews for applications, like “cringe at how horrible the code is”?
  • Challenges, like “can you figure out what is wrong with this code”?
  • Mystery codebase, like “let us read a codebase in a language I don’t know and try to figure it out”?
  • Architecture, like “let us see how we should resolve this problem”?

Or, you know, something else. I would appreciate your feedback.

time to read 14 min | 2667 words

I am pretty sure that it would surprise you, but the World’s Smallest No SQL Database has a well defined concurrency model. Basically, it is using Last Write Wins. And we are safe from any concurrency issues. That is pretty much it, right?

Well, not really.  In a real world system, you actually need to do a lot more with concurrency. Some obvious examples:

  • Create this value only if it doesn’t exists already.
  • Update this value only if it didn’t change since I last saw it.

Implementing those is actually going to be pretty simple. All you need to do is to have a metadata field, version, that is incremented on every change. Here is the change that we need to make:

   1: public class Data
   2: {
   3:     public byte[] Value;
   4:     public int Version;
   5: }
   6:  
   7: static readonly ConcurrentDictionary<string, Data> data = 
   8:    new ConcurrentDictionary<string, Data>(StringComparer.InvariantCultureIgnoreCase); 
   9:  
  10:  public HttpResponseMessage Get(string key)
  11:  {
  12:      Data value;
  13:      if(data.TryGetValue(key, out value) == false)
  14:          return new HttpResponseMessage(HttpStatusCode.NotFound);
  15:  
  16:      return new HttpResponseMessage
  17:          {
  18:              Headers = { "Version", value.Version },
  19:             Content = new ByteArrayContent(value.Value)
  20:          };
  21:  }
  22:  
  23: public void Put(string key, [FromBody]byte[] value, int version)
  24: {
  25:     data.AddOrUpdate(key, () => 
  26:     { // create
  27:        if(version != 0)
  28:            throw new ConcurrencyException();
  29:        return new Data{ Value = value, Version = 1 };
  30:     }, (_, prev) => 
  31:     { // update
  32:         if(prev.Version != version)
  33:           throw new ConcurrencyException();
  34:         return new Data{ Value = value, Version = prev.Version +1 };
  35:     });
  36: }

As you can see, it merely doubled the amount of code that we had to write, but it is pretty obvious how it works. RavenDB actually uses something very similar to that for concurrency control for writes, although the RavenDB ETag mechanism is alos doing a lot more.

But the version system that we have above is actually not enough, it only handle concurrency control for updates. What about concurrency controls for reads?

In particular, how are we going to handle non repeatable reads or phantom reads?

  • Non repeatable reads happen when you are reading a value, it is then deleted, and when you try to read it again, it is gone.
  • Phantom read is the other way around, first you tried, but didn’t find anything, then it was created, and you read it again and find it.

This is actually interesting, because you only care about those for the duration of a single operation / transaction / session. As it stand now, we actually have no way to handle either issue. This can lead to… interesting bugs that only happen under very specific scenarios.

With RavenDB, we actually handle both cases. In a session lifetime, you are guaranteed that if you saw a document, you’ll continue to see this document until the end of the session, which deals with the issue of non repeatable read. Conversely, if you didn’t see a document, you will continue to not see it until the session is closed. This is done for Load, queries are a little bit different.

Another aspect of concurrency that we need to deal with is Locking. Sometimes a user has a really good reason why they want to lock a record for a period of time. This is pretty much the only way to handle “checking out” of a record in a scenario where you have to multiple users wanting to make changes to a record concurrently. Locks can be Write Or ReadWrite locks. A Write lock allows users to read the data, but prevent them from changing that. When used in practice, this is usually going to immediately fail an operation, rather than make you wait for it.

The reasoning behind immediate fail for write is that if you encountered a record with a write lock, it means that it was either already written to or is about to be written to. At that case, your write is going to be operating on stale data, so we might was well fail you immediately. For ReadWrite locks, the situation is a bit different. In this case, we want to also prevent readers from moving on. This is usually done to ensure consistent state system wise, and basically, any operation on the record would have to wait until the lock is removed.

In practice,ReadWrite locks can cause a lot of issues. The moment that you have people start placing locks, you have to deal with lock expiration, manual unlocking, abandoned lock detection, lock maintenance, etc. About the only thing that they are good for is to allow the user to make a set of changes and present them as one unit, if we don’t have better transaction support. But I’ll discuss that in another post. In the meantime, just keep in mind that from my point of view, ReadWrite locks are pretty useless all around.

time to read 2 min | 320 words

The first item to go over with the World’s smallest No SQL database is how it uses memory. The codebase is really simple, which is pretty much the point, but can you think about the implications of the current codebase with regards to memory?

All the data is kept in a ConcurrentDictionary<string,byte[]>, and there is an unspoken assumption. Both keys and values are assumed to be small. What happens if that isn’t the case? Well, in .NET, you are going to get memory fragmentation pretty quickly. That means that on 32 bits, you are probably going to be able to store a lot less than the 2 GB of virtual memory that you get from the OS.

We will ignore that as well and move to other aspects of the code. What about how the system actually work in production? There is a lot of work that goes into ensuring that the CPU will be able to access memory efficiently. In particular, most dbs uses some form of an in memory data structure that take advantage of the L1, L2, L3 cache of the CPU. Usually something like a B+Tree. How does the dictionary compare here?

For that matter, what happens when we grow really big? What about when we are paging? How would that work? Can we just go over the keys in the dictionary without having to page through the values as well?

And…

I think that you get my drift by now. And this is literally just me throwing stuff off the top of my head. There are a lot of other things that we need to worry about. For example, if we implement the sensible notion of a buffer pool, we might actually have a buffer that was released by a write but is currently being sent to the client, so we can’t just use buffers, we have to use ref counting as well.

time to read 3 min | 518 words

Continuing in my quest to learn more, I decided to go over the LMDB codebase.

LMDB is:

LMDB is an ultra-fast, ultra-compact key-value data store developed by Symas for the OpenLDAP Project. It uses memory-mapped files, so it has the read performance of a pure in-memory database while still offering the persistence of standard disk-based databases.

It has some interesting feature set, and a really small codebase. I am anxious to see how they managed to do so much.

Interestingly, the data model for LMDB is quite different from the usual append-only / transaction log. Instead, it allows only a single concurrent writer, and modify the data in place. There also appears to be a lot of dire warnings regarding usage of long transactions, since they would result in increased file size, presumably because the db couldn’t find the pages the scavenge because they are locked by ongoing transactions.

One thing that I should note already, the code (I am currently about 1/3 of the way of lmdb.h file) is very well commented, and it explains a lot about what is going on. If the rest of the code is like that, this is going to be really nice to read. Okay, I take it back. The main files seems to be lmdb.h and mdb.c. The first one is 1,300 lines and the second is over 7,500 lines. Admittedly, this is impressive in the sense that pretty much everything is done there, and there are a lot of docs. But damn, I wish this was better organized. Right now my head feels like I need to pop up and take a breath.

I read ~2,000 lines of code so far, and I haven’t found anything that does something. It is all headers or comments or macros up until now. I skipped over to the code, and it is really hard to understand what is going on.

Take this code snippet:

image

It shows a lot of the things that make it hard to work with. effectively random naming convention (under_score, Pascal_naming, SHOUT_NAMING, etc), the goto trick is used WAY too much, lovely variable names such as n2. This is part of a method that goes on for something like 150 – 200 lines. And it include the following code:

image

Quick, can you tell me how many variables are declared here? And note that they are all local variables. I counted it twice, once getting to 17 and once getting 18.

That is just too much, I am not going to go any deeper. The leveldb codebase was easy to follow, it had structure. This codebase is just a code dump. It might be a really good codebase, for what it needs to do, but I literally can’t follow it.

time to read 1 min | 103 words

So, what is the point of the code that I have shown here: http://ayende.com/blog/162691/worlds-smallest-no-sql-database

The point was to show where you start, and how easy it is, but all the details that you need to handle along the way. I decided that it would probably make for an interesting blog series about the sort of things that this example exposes.

In particular, I want to talk about:

  • I/O
  • Memory
  • Concurrency
  • ACID
  • Transactions
  • Searching
  • Scale up
  • Scale out
  • Aggregation
  • High availability
  • Backups
  • Monitoring

I might have a few more items along the way, but those are probably the most important ones.

time to read 2 min | 263 words

Yesterday I showed the following profiler results.

image

The first place someone would look at optimizing is the CaseInsensitiveComparator.Compare call, right? That is clearly taking a really long time.

In fact, we re-wrote that method to use pointer magic in order to get it to work faster. But that isn’t the issue. Look at the numbers. Compare is called 68 million times. That means that under the profiler, it is still completing an operation in 0.000073 ms. I am not even sure what time unit that is.

However, we can see that the major caller for Compare was FindGreaterOrEqual, right? And that is called only 11 thousand times. What does this tells you?

That tells me we have a Schlemiel the Painter's algorithm. So I double checked, and SkipList insert perf should be O(logN) on average. But what we were seeing is O(N) costs .Indeed, it appeared that we had a bug in which our skip list implementation would never do skips, essentially turning it into a sorted linked list, with insert perf of O(N). Once we fixed that bug… well, things got better, a lot better.

image

As you can see, the cost of this went down really hard. And now we can focus on other stuff.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  2. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  3. Webinar (7):
    05 Jun 2025 - Think inside the database
  4. Recording (16):
    29 May 2025 - RavenDB's Upcoming Optimizations Deep Dive
  5. RavenDB News (2):
    02 May 2025 - May 2025
View all series

Syndication

Main feed ... ...
Comments feed   ... ...
}