Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,919 | Comments: 49,398

filter by tags archive
time to read 5 min | 960 words

After my post yesterday, I dug a lot deeper into extendible hashing. There is a wealth of information on the topic. What is more interesting, from my point of view, is just how freaking elegant this data structure is.

I spent a few hours implementing it, because I don’t really get a data structure until I actually sat down and writing it in code. I decided to write it in C, because I haven’t written C in a while. It took about 200 lines of code, and I could see how it works.

Well, I saw see, but I learned something very important a few years ago when implementing complex data structures. One of the first things that you want to do is to make sure that you have a visualization of the data. That can be absolutely invaluable when debugging.

Here is what this looked like when we have 31 entries in the hash table, for example:

I’m outputting Graphviz text and then rendering it. This means that I can very easily and visually see how things are working. Very helpful to understand where I messed up.

You can find the whole source here. There is a bit more there then I can comfortably discuss in a blog post, but what I found most useful is to insert data, then see the before / after pictures of the table.

What you can see above is a hash table that is split to 4 sections, however, it only has three pages. This is because of the way the hash table works. As shown, all values that ends with 00 binary suffix will go to the left most page. All values that end with 10 will go on the right most.  Not coincidentally, these pages are pointed to from the 0 (00) and 2 (10) positions on the hash table buckets’ list.

We also have something a lot more interesting, though. We have the middle page, which is pointed to by both the 1 (01) and 3 (11) positions. In fact, you can see that in that page, we have a depth of 1, so we are actually mixed in this location. As we currently stand, I didn’t do anything interesting with regards to how to find a value inside the page (I’m doing simple linear search), but I expect that to be relatively straightforward to treat the page as a hash table as well.

I need this feature to be able to use as a Map<int64, int64>, which really simplify my life. The key observation is that I don’t actually need to do any hashing here. The can be no collisions, after all, given that I’m using 64 bits keys inside the hash table. Using extendible hashing, I also don’t need to mix the values. If I’m going to get values that are clustered together to the same place, I’m going to end up splitting the page and solving the problem naturally. Of course, that would eventually cause me to have to double my directory size, so there is some cost for collisions, but I don’t think that this is going to be a problem. I tested this with a million consecutive numbers (with no hash mixing) and got a directory size of 32KB. That doesn’t seems likely to become a problem.

I also ought to mention linear hashing, which uses a different approach. The linked article does a great job explaining how this works. There isn’t any recent work on comparing linear hashing to extendible hashing. The one I kept running into is from the 80s. The advantage then was using linear hashing on machines with small memories. Remember, this is the 80s that we are talking about. I’m not sure what they though about as small memory in that time frame. The article talks about limiting the amount of memory used for extendible hashing’s directory to 1000 entries. That translates (assuming 32 bits values) to less than 8KB of data. That isn’t something that I really think I need to worry about.

Let’s do some basic math, shall we?

Assuming an 8KB page size and 64 bits pointers, we can have a directory page for extendible hashing that holds 1,024 buckets. Each one of them is going to be 8KB in turn, so at full utilization, we are looking at a single directory page pointing to 8MB of buckets. In each bucket, we are storing key/value that are both 64 bits in size, which give us 512,288 entries. In practice, I haven’t account for metadata, so we can probably expect to be able to address about half a million entries per 8MB.  If we need to store a hundred million records, you’ll need less than 800MB using this scheme.

Given current hardware, that seems like a pretty good idea for me. There are other advantages to take into account, though. Some of the advantages of linear hashing is that you can skip the directory part. Indeed, that can grow quite big. In the case of the 100 million records I mentioned earlier? You’ll need about 12 MB just for the directory metadata. Linear hashing can skip this cost (but has other storage costs related to overflow pages, that I’m ignoring here). However, that assumes that you can actually compute the address of a page from the key. That isn’t really something that we can do if we haven’t dedicated the whole file for this purpose. In Voron, for example, you can’t assume that you have a raw range of pages to work with. Pages are allocated as needed, and may be mixed by other users of the disk space.

I can’t see any real advantages to linear hashing at this time. I would be happy to hear about scenarios where it make sense.

time to read 4 min | 700 words

There is a specific scenario that I run into that could be really helped by an O(1) lookup cost on a disk persistent data structure. Voron, our storage engine library, is built on top of a whole big pile of B+Trees, which has an O(logN) lookup cost. I could use that, but I wanted to see if we could do better.

The natural thing to do when you hear about O(1) costs is to go fetch the nearest hash table, so I spent some time thinking about how to build a hash table that would be persisted to disk. My needs are relatively simple, I believe:

  • O(1) lookup (at least in the common case).
  • Able to support greater than memory size.
  • Mutable (writes & deletes)
  • Keys and values are limited to int64.

It doesn’t sound hard, right?

But when I started thinking about it, I run into a pretty hard limit. If the size of the data is greater than memory, then we have to take into account data access costs. A simple approach here would be to allocate a section in the file for the hash table and use a hash to get to the right location in the file. That works, if you don’t need to support mutations. But when you do, you run into a big problem. At some point, the load factor of the hash table is going to increase to the point where you need to regrow it. At that point, you may need to re-hash the entire thing.

Assume that the hash table size at this point is 4 GB, you need to re-hash it to 8GB and you have just 1 GB available. That is going to take some time and be a pretty horrible process all around. That is as far as I got when I considered directly translating in memory hash table to disk based one. I’m pretty lucky that I don’t have to do that, because there is a wealth of research on the matter.

These go back to before I was born, although B+Trees predate them by a decade or so. They key here is to use extensible hashing. The Wikipedia article is pretty good, I especially liked the Python code showing how things work there. The original paper on the topic is also quite interesting and is of interest to people who care about the details of how storage engines work.

I believe that my next step is going to be playing with some codebases that implement these ideas. I decided to look at how this is done with the DBM family of systems. They are old, some of them are probably direct implementations of the extensible hashing paper, but I’m mostly interested in seeing how things fit together at this point.

All of that said, I run into a lot of red flags along the way.

Modern B-Tree Techniques discuss the issue of B-Trees vs. Hashes Indexes and come to the conclusion that you shouldn’t bother. They cover quite a few aspects of this issue, from complexity of implementation to usage scenarios.

The Berkley DB documentation states that for anything requiring locality of reference, B-Trees are the way to go. However, for large amount of data, their Hash implementation uses less metadata, so might be better. That said, this doesn’t match my expectation for the way the system will behave. Looking at this StackOverflow answer, it seems very likely that if you have a working set that is greater than memory, the hash implementation will hit page faults all the time and the B-Tree implementation will be able to keep at least most of its metadata in memory, benefiting greatly from that.

Indeed, we have this interesting quote from Berkley DB as well:

Hash access method is appropriate for data sets so large that not even the Btree indexing structures fit into memory. At that point, it's better to use the memory for data than for indexing structures. This trade-off made a lot more sense in 1990 when main memory was typically much smaller than today.

All in all, this seems like a nice area for me to look into. I’ll go and read some code now, and maybe I’ll write about it.

time to read 2 min | 313 words

I run into this blog post talking about how to handle optimistic concurrency in MongoDB and it brought to mind a very fundamental difference in the design philosophy between RavenDB and MongoDB.

If you’ll read the associated blog post, you’ll see guidance on how to build a simple optimistic concurrency using the MongoDB API. It looks like a relatively straightforward thing, but there is a lot of complexity going on here.

With RavenDB, we have decided that the responsibility of such tasks is on us, and not our users. Here is how you’ll write the same thing in RavenDB:

session.Advanced.OptimisticConcurrency = true;

And you are done. There are also options to set it globally (for all actions), for a particular session, as shown above or for a particular document or documents in a bigger transaction. About the only thing that we don’t handle is retries if the update failed, to allow you to re-run your business logic.

The reason I’m writing this is actually at the very end of the post:

This works just fine if I "remember" to include that Where clause correctly, but there's a better way if we want a general solution. For that, I'd do pretty much what I would have in the Life Beyond Distributed Transactions series - introduce a Repository, Unit of Work, and Identity Map.

This is exactly right. It looks trivial to do something like that when you are looking into a trivial scenario, but put it in a real application and the complexity sprouts. For example, try doing the same thing with multiple documents that need to change together. You have to implement quite a lot of code to do so (identity map, unit of work, hopefully not a repository Smile).

With RavenDB, all of that is just there and available for you. No need to do anything, It Just Works.

time to read 10 min | 1985 words

imageA couple of weeks ago I started to talk about the implementation details of building a persistent data structure in RavenDB. As it turns out, I had some evenings available and I was able to sit down and actually write out the code for it. The current state of things is that a few tests work and the overall structure is in place. I run into some hurdles along the way, which is why I preferred to wait till I have something at hand before writing about it.

Please note, I’m going to be doing a lot of low level talk here. Mostly about how we allocate space and manage bits and bytes in Voron. If you aren’t interested in such details, you can skip all the gory stuff and get to the design details.

If you want to go straight for the code, you can find it here. Just note that this version has been created as a proof of concept and hasn’t yet been through the same process we usually take our code through.

The first thing to understand is what I’m trying to write. The reason I need this data structure is for my series about writing search engines. That means that I want to use this to store posting lists. But given my experience with building such systems, I know that there are quite a few different use cases that I need to cover. A posting list is the list of documents matching a term in a search index.

Let’s consider a few examples, shall we?

The above represent fields and values for fields in a full text search index. There are a few things to note. We can usually assume that the Email field will be unique or nearly so. In other words, the number of documents where the Email field will match oren@example.com is going to be one (or very nearly so). This is a very frequent scenario when indexing data and it deserves optimal behavior.

The Hobbies field, however, is very different. Quite a few people likes Dogs, for example, so we can assume that we’ll have a lot of documents that are matched to this term. That mean that we need to optimize for very large number of matches, the exact opposite of how we need to behave for the Email field.

Sometimes, it is easier to understand when looking at the type signature. If I was writing this in memory, I would use:

Map<FieldName, Map<Term, Set<DocumentId>> InvertedIndex;

That is the conceptual model that we’ll be working with here. After implementing the actual data structure, we have the following API:

Once we have the data stored, we can now query on it. For example, to find all users that like dogs, you’ll write:

Actually building realistic queries on top of this is a tedious, but fairly straightforward matter. It will also likely be the topic of another post. For now, I want to focus on how I actually built the implementation of this feature.

At this point, Voron features are mostly built on top of… Voron features Smile. That is, we don’t need to build complex data structure from scratch, but can usually use a fair bit of the underlying infrastructure that we already have.

In this case, we need to understand one of the most basic building blocks in Voron: The Tree. This versatile data structure is the core of pretty much everything in Voron. It is a B+Tree that can hold arbitrary keys and values, keeping them in sorted order.

In particular, the Tree uses a byte string as its key, and its value can be either a raw value or a complex type. Going back to the type signature, the Tree would be:

SortedMap<ByteString, (byte[] RawValue, Tree NestedTree, FixedSizeTree NestedFixedSizeTree)> Tree;

Note that the value can be a raw value, a nested tree or a fixed size tree (there are other options, but we’ll focus on those). A raw value is simple, it is just a buffer that is stored and retrieved.  The two nested tree options is just using recursion to its fullest potential. The difference between Tree and FixedSizeTree is one of optimizations. A Tree can use any byte string as its key, but a fixed size tree can only use an int64 for its key. And as you can expect from the name, its values are also fixed in size. That means that it needs less metadata than its Tree sibling and can be somewhat simpler to implement.

Voron also has the notion of raw data sections. These allow you to allocate / write to the disk directly and are usually paired with another data structure to manage them. You can think about the raw data section as the malloc() of persistent data structures.

I’m going over these details because they are important to how I built the underlying data structure. Here are the rules that I had in mind while building this:

  • Optimize for both storage space and computational power
  • Zero managed allocations for reading
  • Reduce / eliminate managed allocations for writing
  • Document ids are int64
  • Handle single use terms (Email)
  • Handle multiple use terms (Hobbies)

We’ll start from the simple scenario, storing a document id for a particular email address:

emailField.Set("oren@example.com", 1L);

The backing store of the Roaring Set is a Voron Tree, and we’ll use the term as the key, and store the document id (1L, in this case) as the value. That is probably the absolutely simplest way to go about building this feature. Except that we are actually wasting space. 1L (long set to one, basically) takes 8 bytes to store data that can be stored in a single byte. That means that we’ll waste space, quite a lot of it, in fact.

So we aren’t going to store the data as raw int64. Instead, we are going to use varints, instead. In this way, a value such as 1L can be stored in a single byte.

What happen if we have another value for the same field and term?

emailField.Set("oren@example.com", 3L);

At this point, we’ll encode the next value using varint as well, but instead of recording the actual value, we’ll record the difference from the previous value. We’ll continue to do so until the size of the buffer we need to record the data reach 32 bytes.

The idea is that in most cases, we’ll have a single value or very few of them. We have a compact way of representing this information, which works quite nicely for small set of values.

Here is how you can read such an encoding:

As you can see, there is nothing really interesting going on here. There are implementation details that I’m not getting into, such as the fact that we are storing the values sorted (which maximize the delta encoding from keeping just the difference from the previous number), but that doesn’t actually matter to the core concept.

I mentioned before that this is limited to 32 bytes, right? So what happens when we get beyond that level? This is where things become interesting / complicated.

Instead of using a raw value for the values, we will move to a more complex structure. This is suitable when we have enough values to justify the extra effort. The idea here is to make use of Roaring Bitmaps, which is an efficient way to store bit maps. A bit map is simply an array of bits that are either set or cleared. I’m using them to hold a set of values. In other words, consider a Set<int64>, where the implementation is using a bitmap to figure out if a value exists or not.

Of course, storing such a set using standard bitmaps would be incredibly wasteful in space, but that is what roaring bitmaps are for. I’ll let you go to the actual site for a description of them, but you can think about them as a sparse map. You only need to hold the bits that you care about. That said, the way roaring bitmaps are usually used, they are using 8KB ranges. That is, each such range is capable of holding 65,536 bits. However, when looking into how I’ll be using this in Voron, I run into an issue.

A Voron page is 8KB in size, and we have to allocate some space for the page header, we can’t easily store an 8KB value there. I thought about using 4KB, instead, but that just made things awkward. I’ll be losing half a page, after all. After some experimentation, I ended up with each roaring set segment using 256 bytes. This is small, but has several advantages for my needs.

A Voron page has a 64 bytes header, which means that I can use 8,128 bytes for real data. Using 256 bytes for the roaring segment size, I also need to account for some metadata per segment, so that turns out to be 260 bytes total. That gives me a total of 30 segments that I can squeeze into a single page. I actually have a total of additional 10 bytes that I can use per segment, without impacting the total number of values that can be stored into in a page.

A single segment represent the state of the bits with a range of 2048 bits. And there are other advantages to the small size, though. This is planned as a persistent and mutable data structure. Having a smaller segment size means that I have easier time modifying just a single segment. Following the roaring bitmap rules, we have three types of segments:

  • Small (128 or less bits set) – stored as an array of int16 (up to 256 bytes) holding the offsets of set bits in the range.
  • Medium (up to 1920 bits set) – stored as a bitmap value (taking 256 bytes).
  • Large (more than 1920 bits set) – stored as an array of int16 (up to 256 bytes) holding the offsets of cleared bits in the range.

Roaring Bitmaps tend to perform much better than the alternative (even though this is the 8KB version).

Just having the segments isn’t good enough, though. I need to also have a way to search for a segment. After all, the whole idea is that we’ll have a sparse data structure. This is done using a Fixed Size Tree. Each segment gets a key, made up of the segment range (54 bits) and the number of set bits in the range (10 bits). Together, they make up the key that we can use to look up a particular segment. The value for the Fixed Size Tree is the position of the actual segment in the data file.

You can think about this as:

SortedMap<SegmentKey(Range: 54 bits, NumOfSetBits: 10 bits), FileOffset> Segments;

In other words, the total metadata cost for a segment is actually 270 bytes (counting also currently unused space) for the segment as well as 16 bytes for the key/value in the fixed size tree. In other words, to hold about 10 million values, we’ll need roughly 2.8 MB or so. On the other if we stored the offsets directly as int64, 10 million values would be around 76MB. The numbers aren’t quite that bad, because for roaring bitmap we pay per segment, while for a simple array of int64, we’ll pay for each set value.

I think that this post has gone on long enough. You can look at the code, which has all the details (and I would love to get feedback / questions on this), but I now need to face another challenge in this road. Tying all of this together so we can create a useful search API. Just having the code I’ve shown at the top of this post is not sufficient, we need to be able to provide more metadata around tying values together. I’ll get to that in another post.

time to read 4 min | 672 words

Trevor asked a really interesting question in the mailing list. Assume that we have the following model:

image

And what we want to do is to be able to add a note to a book. The API looks like so:

public void SubmitNote(string isbn, string note);

The requirements are simple:

  • There is one book per ISBN
  • Notes on the book shouldn’t be duplicated
  • The Book document may be large, and we don’t actually care about it, just want to add it
  • If there is a note on an non existent book, we need to create it

The key observation here is that Trevor doesn’t want to load the document, modify it and save it back. What he is looking for is a way to send the change to the database server and have it happen there.

Luckily, RavenDB has the right set of features for this, the Patching API. Here is how you’ll write the code to update the document without having to load it:

We can send the update to the server, have the change happen there, with no need to load and save the document. And we get strongly typed API and compiler validation, joy all around.

This is great, but it misses something. If we’ll run this code twice, we’ll have a duplicated comment, which is something that we don’t want to do.

Luckily, we have more options. The strongly typed API we just wrote is sitting on top of the actual patch API, which is much more powerful. Let’s see how we can tackle this requirement, shall we?

Now, we send a script to the server, which will execute it. If the note already exists, that means that we will not modify the document. So we got that out of the way. But we are still missing a piece of the puzzle, as you can probably see from the title of the post. What happens if the Book document does not exists?

Luckily, we thought about this issue and RavenDB is ready to help here as well. When you send a patch request to RavenDB, you can send a single script, or two. The second one will be executed if the document does not exists, like so:

If the document exists, we will run the patch script and modify the document. If the document does not exists, we will create an empty document and then run the patchIfMissing script to populate it. The code above handles all of the stated requirements, and we can call it a day.

But as long as we are here, let’s add another requirement. The only information we have from the SubmitNote(isbn, note) call is the ISBN of the book. Presumably we want to do things to this book, for example, figure out what the title is. Using this mechanism, how do we do this? When we return from the SaveChanges call, there is no way to tell if the document was newly created or already there?

The answer here is to ask RavenDB to do so. And we can modify our patchIfMissing a bit to do so. Note the changes in that are in the code:

If we need to execute the missing patch code, we’ll create the new document and in the same transaction, we’ll also create a task document to fetch additional information about this book. Your code will usually subscribe to the Tasks collection and execute additional work as a result of that.

And this is it, everything we need to, in a single operation. Note that in many cases, you don’t actually have a single such call, though. Let’s say that you are getting a batch of work all at once. Your API will actually look like:

public void SumbitNotes(Dictionary<string, string> isbnToNotes);

In this case, we are going to execute the code above for all the items that we got in the call, but call SaveChanges once. All of them would operate as a single transaction and a single round trip to the server.

time to read 1 min | 103 words

After build an R client for RavenDB, I decided to see what it would take to build a basic RavenDB client for PHP in the same manner. It turned out to be fairly simple, and you can find the relevant code here.

Here are some basic CRUD operations:

As you can see, the API is quite straightforward to use. It isn’t the full blown API that we usually provide, but it is more than enough to get you going.

Incidentally, we also published our REST API documentation recently, so you can see how you expand this code to do even more for you.

time to read 3 min | 535 words

imageI was reminded recently that the RavenDB documentation aren’t putting enough emphasis on the fact that RavenDB can run as an in memory database.  In fact, topologies that other databases seem to think are fancy are trivial in RavenDB. You can run your cluster in a mixed mode, with a couple of nodes that are persistent and write to disk, but having other nodes that are using pure in memory storage. Combined with RavenDB’s routing capabilities, you’ll end up with a cluster where the nodes you’ll usually interact with are going to be purely in memory, but with the backend pushing data to other nodes that are persisting to disk. On the face of it, you have both the speed of in memory database with the persistence that your data craves.

So why aren’t we making a whole lot of a big deal out of this? This is a nice feature, and surely it can be used as a competitive advantage, no?

The problem is that it isn’t going to play out as you would expect it to be.

Consider the case where you dataset* is larger than memory. In that case, you are going to have to go to disk anyway. At this point, it doesn’t really matter whatever you are swapping to the page file or reading from a data file. On the other hand, if the dataset you work with can fit entirely in memory, you are going to see significant speedups. Except that with RavenDB, you won’t.

That sound bad, so let me try to express this better. If you dataset can fit into memory, RavenDB is already going to be serving it completely from memory. You don’t need to do anything to avoid disk I/O, by default, RavenDB is already going to do that for you. And if you dataset is larger than memory, RavenDB is going to ensure that we make only the minimum amount of I/O in order to serve your requests.

In other words, because of the way RavenDB is architected, you aren’t going to see any major advantages by going the pure in memory route. We are already providing most of them out of the box, while still maintain ACID guarantees as well as on disk persistence.

There are some advantages of running in memory only mode. Transactions are somewhat faster, but we have spent a lot of time optimizing our transactional hot path, you can get to hundreds of thousands of individual writes on a single node while maintaining full persistence and ACID compliance. In the vast majority of the cases, you simply don’t need the additional boost. It costs too much on to give up persistence.

So the answer is that you can run RavenDB purely in memory, and you can also do that in mixed mode cluster, but for the most part, it just doesn’t give you enough bang for the buck. You are going to be as fast for reads and almost as fast for writes (almost certainly faster than what you actually need) anyway.

* Well, working set, at least, but I’m being fast and loose with the terms here.

time to read 2 min | 319 words

R is a popular environment for working with data, mostly for statistical analysis and exploration. It is widely used by data scientists, statistician and people who get a pile of data and need to figure out how to get something out of it.

RavenDB stores data and it can be nice to go through it using R. And now you can quite easily, as you can see here:

image

Inside your R environment, load (or save locally) using:

And you are read to R(ock) Smile.

In order to set things up, you’ll need to tell R where to find your server, you can do this using:

Note that you can access both secured and unsecured servers, but you need to be aware of how where your R script is running. If this is running on Windows, you’ll need to install the PFX and provide the thumbprint. On Linux, you’ll need to provide the paths to the cert.key and cert.crt files, instead. This is because on Windows, R is compiled against schannel and… you probably don’t care, right?

Now that you have everything setup, you can start having fun with R. To issue a query, just call: rvn$query(), as shown above.

Note that you can write any query you’ll like here. For example, let’s say that I wanted to analyze the popularity of products, I can do it using:

And the result would be:

image

Doesn’t seem like something pop up from the data, but I’m not a data scientist.

You can also manipulate data using:

And here is the result in RavenDB:

image

And now, go forth and figure out what this all means.

time to read 2 min | 218 words

RavenDB 5.0 will come out with support for time series. I talked about this briefly in the past, and now we are the point where we are almost ready for the feature to stand on its own. Before we get to that point, I have a few questions before the design is set. Here is what a typical query is going to look like:

We intend to make the queries as obvious as possible, so I’m not going to explain it. If you can’t figure out what the query above is meant to do, I would like to know, though.

What sort of queries would you look to do with the data? For example, here is something that we expect users to want to do, compare and contrast different time periods, which you’ll be able to do with the following manner:

The output of this query will give you the daily summaries for the last two months, as well as a time based diff between the two (meaning that it will match on the same dates, ignoring missing values, etc).

What other methods for the “timeseries.*” would you need?

The other factor that we want to get feedback on is what sort of visualization do you want to see on top of this data in the RavenDB Studio?

FUTURE POSTS

  1. Optimizing access patterns for extendible hashing - about one day from now
  2. Building extendible hash leaf page - 2 days from now

There are posts all the way to Nov 19, 2019

RECENT SERIES

  1. re (24):
    12 Nov 2019 - Document-Level Optimistic Concurrency in MongoDB
  2. Voron’s Roaring Set (2):
    11 Nov 2019 - Part II–Implementation
  3. Searching through text (3):
    17 Oct 2019 - Part III, Managing posting lists
  4. Design exercise (6):
    01 Aug 2019 - Complex data aggregation with RavenDB
  5. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats