Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,565
|
Comments: 51,184
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 488 words

Even for a first attempt, World’s Smallest No  SQL Database is actually pretty good. We got 3 of the 4.

The DB is:

  • Atomic, since change will either go in or be rejected.
  • Consistent, you can’t see changes half way.
  • Isolated, one change can’t modify/view another.

It isn’t durable, because everything is in memory.

It even has transactions, in the sense that a change goes in or not in an atomic fashion.

But, I think you’ll agree, this is very much a poor man’s solution. And those only apply just as long as you need to work with just a single item. In practice, you would usually want to make a lot more than that.

The basic properties you want is the ability to do multi item transactions, so you can modify two items at the same time, and they either go in or not at all. And that is where it gets really  complex, because at that point we need to build our own way to handle atomic updates to multiple values, how to keep them isolated and how to handle this consistently.

There aren’t really simple way to handle that, especially if we want to handle concurrent transactions. You would have to make a lot of decisions all of a sudden. For example, what happens if you have two concurrent transactions trying to update the same value. Is one of them rejected? And if so, at what point? Initial write? Commit time?

And that is just the start.

  • Do you allow a transaction to live across multiple requests?
  • How do you handle transaction expiration?
  • Do you need to have a transaction span more than one node? How do you handle that?

And we haven’t even talked about making transactions durable. There is no point, since we don’t do any persistence. But we will touch on persistence in a future post, so let talk about durability right now.

In general, there are two major ways to handle transactions. Either an append only model or a transaction log. In both cases, concurrent transactions make it… interesting to figure out how to properly write to the log. And the moment that you have either option, you have to build in some form of a recovery tool for system or db crashes. You have better also love fsync, since it is a very important tool in making sure that you can actually get durable transactions.

Oh, and of course, I forgot, if we allow long running transactions…

  • Who can read the yet uncommitted value? Everyone? Just the current transaction? Can you explicitly ask to see uncommitted value?
  • How do you allow it to be read?
  • Is it available for queries?

This is usually the most touchy part in the system, since is so critical. It is also something that you have to take into account in pretty much the entire codebase.

time to read 1 min | 59 words

As I mentioned earlier, I actually gave that talk in USI Events in Paris a short while ago. The recording for this talk is now available, you can find it here:

http://www.usievents.com/en/conferences/12-paris-usi-2013/sessions/1090-why-you-should-never-write-your-own-database

I am afraid that I had a very short time to go through the details, which is why I am also doing the blog post series.

time to read 2 min | 371 words

I decided to resume my attempt to figure out the LMDB codebase. Okay, I did some more figuring out and I think that I know where I went wrong. The codebase appears to be trying small enough to fit into the CPU caches, in order to do that, they decided to make it hard enough to make sure it won’t fit into my cache. But I’m trying to work through that.

One of the things that I figured out was that I made a mistake, I kept thinking about databases in the same way you’ll see in RavenDB or SQL Server, as complete separate items. Instead, LMDB does something quite different. It basically uses a single file for everything. This is called the environment, and that can contains named dbs. If you have multiple databases, all of them are going to reside in the same file. The way it works, it creates a file and then map it into memory. On Windows, it actually allocate the space for the file upfront (which doesn’t appears to be necessary).

But the basic idea is that you need to specify upfront the size of the file you want, and that is the maximum data size you have. This can change, but only if you close the environment and start it up again with a higher value. This also explains why you have to specify the number of databases you want to have per environment.  Once you created the environment, everything else become an issue of just managing the pages. In particular, because LMDB only allows a single writer, it doesn’t really have to worry about concurrency issues. It has mapped all the data into memory, and then it is just a matter of creating the appropriate data structure.

A lot of the code is dedicated to managing pages of data, and now that I have gone through enough of the codebase to be sure that I sort of have a vague idea about what is going on I can still say that I think that this is way denser than it should. And I shudder to think what it would take to make any sort of change to this codebase.

time to read 4 min | 659 words

I decided to resume my attempt to figure out the LMDB codebase. The previous attempt ended when my eyes started bleeding a little, but I think that I can get a bit further now.

One thing that I figured out already is that because this is C code, memory management is really painful. I noted previous how good it was from leveldb to use std::string for most allocations, since it frees both caller and lib from having to worry about the memory. LMDB only shows how important that was. Take a look at the following method:

image

Quick, what do you think it does. As it turn out, what is does is update the key to point to the key that the cursor is currently pointing at. The reason for that is that the library doesn’t want to own the memory for key (since then it would have to provide a way to free it).

Now, to be fair, this appears to be an internal method, only used by the lib itself, but such things are pretty pervasive in the codebase, and it is really annoying that the code is all in one spot, with no real structure, since I am still trying to figure out some sort of hierarchy for the code so I could get a grip on what is actually going on.

I decided to just go to the start and figure out how it opens a database:

image

It is interesting that you need a transaction to open the database. However, the opened database survives the transaction closing, and from the docs, if you open a db in a transaction, that is the only thing you can do in that transaction.

I think that I just figured out something, the code in mdb.c is actually written in reverse depth order. So the closer to the end of the file a function is, the more visible it is.

Okay, it appears that there is the notion of multiple dbs, but there is also the notion of the main db. It looks like the main db is also used for tracking things for child dbs. In particular, when opening a named db, we have the following:image

We search for the db information in the main db. As you can see, mdb_cursor_set is actually setting the data on the cursor to the value we passed in. More interesting is what happens if the db is not there. It appears to merely create the db data and set it in the main db. Okay, looking at this function docs, it appears that this is just about allocating a database handle, not about opening the actual database.

I am not really sure that I understand the reasoning behind it, but a new database is actually created in mdb_cursor_put. So only after the first time you create a value, will the db actually be created.

image

I guess that explains why this function goes on for over 400 lines of code and has no less than 10 goto(!) and 7 goto sections!

image

I get that this is C, but come on, seriously!

time to read 14 min | 2667 words

I am pretty sure that it would surprise you, but the World’s Smallest No SQL Database has a well defined concurrency model. Basically, it is using Last Write Wins. And we are safe from any concurrency issues. That is pretty much it, right?

Well, not really.  In a real world system, you actually need to do a lot more with concurrency. Some obvious examples:

  • Create this value only if it doesn’t exists already.
  • Update this value only if it didn’t change since I last saw it.

Implementing those is actually going to be pretty simple. All you need to do is to have a metadata field, version, that is incremented on every change. Here is the change that we need to make:

   1: public class Data
   2: {
   3:     public byte[] Value;
   4:     public int Version;
   5: }
   6:  
   7: static readonly ConcurrentDictionary<string, Data> data = 
   8:    new ConcurrentDictionary<string, Data>(StringComparer.InvariantCultureIgnoreCase); 
   9:  
  10:  public HttpResponseMessage Get(string key)
  11:  {
  12:      Data value;
  13:      if(data.TryGetValue(key, out value) == false)
  14:          return new HttpResponseMessage(HttpStatusCode.NotFound);
  15:  
  16:      return new HttpResponseMessage
  17:          {
  18:              Headers = { "Version", value.Version },
  19:             Content = new ByteArrayContent(value.Value)
  20:          };
  21:  }
  22:  
  23: public void Put(string key, [FromBody]byte[] value, int version)
  24: {
  25:     data.AddOrUpdate(key, () => 
  26:     { // create
  27:        if(version != 0)
  28:            throw new ConcurrencyException();
  29:        return new Data{ Value = value, Version = 1 };
  30:     }, (_, prev) => 
  31:     { // update
  32:         if(prev.Version != version)
  33:           throw new ConcurrencyException();
  34:         return new Data{ Value = value, Version = prev.Version +1 };
  35:     });
  36: }

As you can see, it merely doubled the amount of code that we had to write, but it is pretty obvious how it works. RavenDB actually uses something very similar to that for concurrency control for writes, although the RavenDB ETag mechanism is alos doing a lot more.

But the version system that we have above is actually not enough, it only handle concurrency control for updates. What about concurrency controls for reads?

In particular, how are we going to handle non repeatable reads or phantom reads?

  • Non repeatable reads happen when you are reading a value, it is then deleted, and when you try to read it again, it is gone.
  • Phantom read is the other way around, first you tried, but didn’t find anything, then it was created, and you read it again and find it.

This is actually interesting, because you only care about those for the duration of a single operation / transaction / session. As it stand now, we actually have no way to handle either issue. This can lead to… interesting bugs that only happen under very specific scenarios.

With RavenDB, we actually handle both cases. In a session lifetime, you are guaranteed that if you saw a document, you’ll continue to see this document until the end of the session, which deals with the issue of non repeatable read. Conversely, if you didn’t see a document, you will continue to not see it until the session is closed. This is done for Load, queries are a little bit different.

Another aspect of concurrency that we need to deal with is Locking. Sometimes a user has a really good reason why they want to lock a record for a period of time. This is pretty much the only way to handle “checking out” of a record in a scenario where you have to multiple users wanting to make changes to a record concurrently. Locks can be Write Or ReadWrite locks. A Write lock allows users to read the data, but prevent them from changing that. When used in practice, this is usually going to immediately fail an operation, rather than make you wait for it.

The reasoning behind immediate fail for write is that if you encountered a record with a write lock, it means that it was either already written to or is about to be written to. At that case, your write is going to be operating on stale data, so we might was well fail you immediately. For ReadWrite locks, the situation is a bit different. In this case, we want to also prevent readers from moving on. This is usually done to ensure consistent state system wise, and basically, any operation on the record would have to wait until the lock is removed.

In practice,ReadWrite locks can cause a lot of issues. The moment that you have people start placing locks, you have to deal with lock expiration, manual unlocking, abandoned lock detection, lock maintenance, etc. About the only thing that they are good for is to allow the user to make a set of changes and present them as one unit, if we don’t have better transaction support. But I’ll discuss that in another post. In the meantime, just keep in mind that from my point of view, ReadWrite locks are pretty useless all around.

time to read 2 min | 320 words

The first item to go over with the World’s smallest No SQL database is how it uses memory. The codebase is really simple, which is pretty much the point, but can you think about the implications of the current codebase with regards to memory?

All the data is kept in a ConcurrentDictionary<string,byte[]>, and there is an unspoken assumption. Both keys and values are assumed to be small. What happens if that isn’t the case? Well, in .NET, you are going to get memory fragmentation pretty quickly. That means that on 32 bits, you are probably going to be able to store a lot less than the 2 GB of virtual memory that you get from the OS.

We will ignore that as well and move to other aspects of the code. What about how the system actually work in production? There is a lot of work that goes into ensuring that the CPU will be able to access memory efficiently. In particular, most dbs uses some form of an in memory data structure that take advantage of the L1, L2, L3 cache of the CPU. Usually something like a B+Tree. How does the dictionary compare here?

For that matter, what happens when we grow really big? What about when we are paging? How would that work? Can we just go over the keys in the dictionary without having to page through the values as well?

And…

I think that you get my drift by now. And this is literally just me throwing stuff off the top of my head. There are a lot of other things that we need to worry about. For example, if we implement the sensible notion of a buffer pool, we might actually have a buffer that was released by a write but is currently being sent to the client, so we can’t just use buffers, we have to use ref counting as well.

time to read 3 min | 518 words

Continuing in my quest to learn more, I decided to go over the LMDB codebase.

LMDB is:

LMDB is an ultra-fast, ultra-compact key-value data store developed by Symas for the OpenLDAP Project. It uses memory-mapped files, so it has the read performance of a pure in-memory database while still offering the persistence of standard disk-based databases.

It has some interesting feature set, and a really small codebase. I am anxious to see how they managed to do so much.

Interestingly, the data model for LMDB is quite different from the usual append-only / transaction log. Instead, it allows only a single concurrent writer, and modify the data in place. There also appears to be a lot of dire warnings regarding usage of long transactions, since they would result in increased file size, presumably because the db couldn’t find the pages the scavenge because they are locked by ongoing transactions.

One thing that I should note already, the code (I am currently about 1/3 of the way of lmdb.h file) is very well commented, and it explains a lot about what is going on. If the rest of the code is like that, this is going to be really nice to read. Okay, I take it back. The main files seems to be lmdb.h and mdb.c. The first one is 1,300 lines and the second is over 7,500 lines. Admittedly, this is impressive in the sense that pretty much everything is done there, and there are a lot of docs. But damn, I wish this was better organized. Right now my head feels like I need to pop up and take a breath.

I read ~2,000 lines of code so far, and I haven’t found anything that does something. It is all headers or comments or macros up until now. I skipped over to the code, and it is really hard to understand what is going on.

Take this code snippet:

image

It shows a lot of the things that make it hard to work with. effectively random naming convention (under_score, Pascal_naming, SHOUT_NAMING, etc), the goto trick is used WAY too much, lovely variable names such as n2. This is part of a method that goes on for something like 150 – 200 lines. And it include the following code:

image

Quick, can you tell me how many variables are declared here? And note that they are all local variables. I counted it twice, once getting to 17 and once getting 18.

That is just too much, I am not going to go any deeper. The leveldb codebase was easy to follow, it had structure. This codebase is just a code dump. It might be a really good codebase, for what it needs to do, but I literally can’t follow it.

time to read 1 min | 103 words

So, what is the point of the code that I have shown here: http://ayende.com/blog/162691/worlds-smallest-no-sql-database

The point was to show where you start, and how easy it is, but all the details that you need to handle along the way. I decided that it would probably make for an interesting blog series about the sort of things that this example exposes.

In particular, I want to talk about:

  • I/O
  • Memory
  • Concurrency
  • ACID
  • Transactions
  • Searching
  • Scale up
  • Scale out
  • Aggregation
  • High availability
  • Backups
  • Monitoring

I might have a few more items along the way, but those are probably the most important ones.

time to read 18 min | 3457 words

I used the following in a lecture called “Why you should never write your own database”. It has never been run, tested, or anything, but it serves as a good way to discuss the challenges involved in building real world  databases.

Here is the server side code:

   1: public class NoSqlDbController : ApiController
   2: {
   3:     static readonly ConcurrentDictionary<string, byte[]> data = 
   4:         new ConcurrentDictionary<string, byte[]>(StringComparer.InvariantCultureIgnoreCase); 
   5:  
   6:     public HttpResponseMessage Get(string key)
   7:     {
   8:         byte[] value;
   9:         if(data.TryGetValue(key, out value) == false)
  10:             return new HttpResponseMessage(HttpStatusCode.NotFound);
  11:  
  12:         return new HttpResponseMessage
  13:             {
  14:                 Content = new ByteArrayContent(value)
  15:             };
  16:     }
  17:  
  18:     public void Put(string key, [FromBody]byte[] value)
  19:     {
  20:         data.AddOrUpdate(key, value, (_, __) => value);
  21:     }
  22:  
  23:     public void Delete(string key)
  24:     {
  25:         byte[] value;
  26:         data.TryRemove(key, out value);
  27:     }
  28: }

And the client side code:

   1: public class NoSqlDbClient
   2: {
   3:     private readonly HttpClient[] clients;
   4:  
   5:     public NoSqlDbClient(string[] urls)
   6:     {
   7:         clients = new HttpClient[urls.Length];
   8:         for (var i = 0; i < urls.Length; i++)
   9:         {
  10:             clients[i] = new HttpClient { BaseAddress = new Uri(urls[i]) };
  11:         }
  12:     }
  13:  
  14:     public Task PutAsync(string key, byte[] data)
  15:     {
  16:         var client = clients[key.GetHashCode()%clients.Length];
  17:         return client.PutAsync("?key=" + key, new ByteArrayContent(data));
  18:     }
  19:  
  20:     public Task DeleteAsync(string key, byte[] data)
  21:     {
  22:         var client = clients[key.GetHashCode() % clients.Length];
  23:         return client.DeleteAsync("?key=" + key);
  24:     }
  25:  
  26:     public async Task<byte[]> GetAsync(string key)
  27:     {
  28:         var client = clients[key.GetHashCode() % clients.Length];
  29:         var r = await client.GetAsync("?key=" + key);
  30:         return await r.Content.ReadAsByteArrayAsync();
  31:  
  32:     }
  33: }

And yes, that is a fully functional, scale out capable, sharding enabled No SQL Key/Value store in less than 60 lines of code.

time to read 2 min | 249 words

imageAs the author of a schema less database, I find myself in the strange position of the barefoot shoemaker. I need to explain a bit. Our current storage engines, Esent and Munin (which was designed mostly to be like Esent) have rigid schemas. There are tables, and indexes, etc. This means that features that touch the storage layer tend to be much more complex. They require migrations, adding new features that require storage means that we have to create new storage tables, or modify indexes, or any number of a bunch of stuff that we developed RavenDB so our users wouldn’t have to.

I have been working with our managed implementation of LevelDB quite a lot lately. In order to do more than merely write tests for this, I tried to create a feature complete feature, an aggregation engine. The code is not production worthy (yet!), but what struck me quite hard was the fact that except for the fact that the storage layer is full of bugs (well, that is why I was writing stuff on top of it, to expose it), I had a blast actually working with it.

I could make modifications & changes with hardly any friction, and it was a real pleasure to start working with things in this fashion.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
  2. RavenDB (13):
    02 Apr 2025 - .NET Aspire integration
  3. RavenDB 7.1 (6):
    18 Mar 2025 - One IO Ring to rule them all
  4. RavenDB 7.0 Released (4):
    07 Mar 2025 - Moving to NLog
  5. Challenge (77):
    03 Feb 2025 - Giving file system developer ulcer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}