Ayende @ Rahien

May 05 2016

The design of RavenDB 4.0Physically segregating collections

time to read 5 min | 915 words

Tags:

When we started writing RavenDB, the idea of collection was this notion of “just a way to say that those documents are roughly similar”. We were deep in the schemaless nature of the system, and it made very little sense at the time to split different documents. By having all documents in the same location (and by that, I mean that they were all effectively stored in the same physical format and the only way to tell the difference between a User document and an Order document is by reading their metadata), we were able to do some really cool things. Indexes could operate over multiple collections easily, replication was simple, exporting documents was very natural operation, etc.

Over time, we learned by experience that most of the time, documents in separate collections are truly separate. They are processed differently, behave differently, and users expect to be able to operate on them differently. This is mostly visible when users have a large database and try to define an index on a small collection, and are surprised when it can take a while to index. The fact that we need to go over all the documents (because we can’t tell them apart before we read them) is not something that is in the mental model for most users.

We have work around most of that by utilizing our own indexing structure. The Raven/DocumentsByEntityName index is used to do quite a lot. For example, we often are able to optimize the “small collection, new index” scenario using the Raven/DocumentsByEntityName, deletion / patching of collections through the studio is using it, etc.

In RavenDB 4.0 we decided to see what it would take to properly segregate collections. As it turned out, this is actually quite hard to do, because of the etag property.

The etag property of RavenDB goes basically like this: Etags are numbers (128 bits in RavenDB up to 3.x, 64 bits in RavenDB 4.0 and onward) that are always increasing, and each document change will result in a higher etag being generated. You can ask RavenDB to give you all documents since a particular etag, and by continually doing this, you’ll get all documents in the database, including all updates.

This properly is the key for quite a lot of stuff internally. Replication, indexing, subscriptions, exports, the works.

But by putting documents in separate physical locations, that means that we won’t have an easy way to scan through all of them. In RavenDB 3.0, we effectively have an index of [Etag, Document], and the process of getting all documents after a particular etag is extremely simple and cheap. But if we segregate collections, we’ll need to merge the information from multiple locations, which can be non trivial, and has a complexity of O(N logN).

There is also the issue of the document keys namespace, which is global (so you can’t have a User with the document key “users/1” and an Order with the document key “users/1”).

Remember the previous post about choosing Voron as our storage engine for RavenDB 4.0? This is one of the reasons why. Because we control the storage layer, we we able to come up with an elegant solution to the problem.

Each of our collections is going to be stored in a separate physical structure, with its own location on disk, its own indexes, etc. But at the same time, all of those separate collections are also going to share a pair of indexes (document key and document etag). In this manner, we effectively index the document etag twice. At first it is indexed in the global index, along all the documents in the database, regardless of which collection they are on. This index will be used for replication, exports, etc. And it is indexed again in a per collection index, which is what we’ll use for indexing, patch by colelction, etc. In the same manner, the documents key index is going to allow us to lookup documents by their key without needing to know what collection they are on.

Remember, those indexes actually store the an id that gives us an O(1) access to the data, which means that processing them is going to be incredibly cheap.

This has a bunch of additional advantages. To start with, it means that we can drop the Raven/DocumentsByEntityName index ,it is not longer required, since all its functioned are now handled by those internal storage indexes.

Loaded terminology term: Indexes in RavenDB can refer to either the indexes that users define and are familiar with (such as the good ol` Raven/DocuemntsByEntityName) and also storage indexes, which are internal structures inside the RavenDB engine and aren’t exposed externally.

That has the nice benefit of making sure that all collection data are now transactional and is updated as part of the write transactions.

So we had to implement a relatively strange internal structure to support this segregation. But aside from the “collections are physically separated from one another”, what does this actually gives us?

Well, it make certain tasks, such as indexing, subscriptions, patching, etc that work on a per collection basis much easier. You don’t need to scan all documents and filter the stuff that isn’t relevant. Instead, you can just iterate over the entire result set directly. And that has its own advantages. Because we are storing documents in separate internal structures per collection, there is a much stronger chance that documents in the same collection will reside nearby one another on the disk. Which is going to increase performance, and opens up some interesting optimization opportunities.

May 04 2016

RavenDB 3.5 whirl wind tourI’ll find who is taking my I/O bandwidth and they SHALL pay

time to read 2 min | 326 words

Tweet Share Share 0 comments

Tags:

raven

I previously mentioned that a large part of what we need to do as a database is to actively manage our resources, things like CPU usage and memory are relatively easy to manage (to a certain extent), but one of the things that we trip over again and again is the issue of I/O.

Whatever it is a cloud based system with an I/O rate of a an old IBM XT after being picked up from a garage after a flood to users that pack literally hundreds of extremely active database on the same physical storage medium to a user that is reading the entire database (through subscriptions) on every page load, I/O is not something that you can ever have enough of. We spend an incredible amount of time trying to reduce our I/O costs, and still we run into issues.

So we decided to approach it from the other side. RavenDB 3.5 now packages Raven.Monitor.exe, which is capable of monitoring the actual I/O and pin point who is to blame, live. Here is what this looks like after 1 minute run in a database that is currently having data imported + some minor indexing.

The idea is that we can use this ability to do two things. We can find out who is consuming the I/O on the system, and even narrow down to exactly what is consuming it, but we can also use it to find how much resources a particular database is using, and can tell based on that whatever we are doing good job of utilizing the hardware properly.

As a reminder, we have the RavenDB Conference in Texas in a few months, which would be an excellent opportunity to see RavenDB 3.5 in all its glory.

May 03 2016

The design of RavenDB 4.0Making Lucene reliable

time to read 4 min | 647 words

Tweet Share Share 16 comments

Tags:

I don’t like Lucene. It is an external dependency that works in somewhat funny ways, and the version we use is a relatively old one that has been mostly ported as-is from Java. This leads to some design decisions that are questionable (for example, using exceptions for control flow in parsing queries), or just awkward (by default, an error in merging segments will kill your entire process). Getting Lucene to run properly in production takes quite a bit of work and effort. So I don’t like Lucene.

We have spiked various alternatives to Lucene multiple times, but it is a hard problem, and most solutions that we look at lead toward pretty much the same approach that Lucene does it.By now, we have been working with Lucene for over eight years, so we have gotten good in managing it, but there are still quite a bit of code in RavenDB that is decided to managing Lucene’s state, figuring out how to recover in case of errors, etc.

Just off the top of my head, we have code to recover from aborted indexing, background processes that takes regular backups of the indexes, so we’ll be able to restore them in the case of an error, etc. At some point we had a lab of machines that were dedicated to testing that our code was able to manage Lucene properly in the presence of hard resets. We got it working, eventually, but it was hard. And we still get issues from users that into trouble because Lucene can tie itself into knots (for example, a disk full error midway through indexing can corrupt your index and require us to reset it). And that is leaving aside the joy of I/O re-ordering does to you when you need to ensure reliability.

So the problem isn’t with Lucene itself, the problem is that it isn’t reliable. That led us to the Lucene persistence format. While Lucene persistent mode is technically pluggable, in practice, this isn’t really possible. The file format and the way it works are very closely tied to the idea of files. Actually, the idea of process data as a stream of bytes. At some point, we thought that it would be good to implement a Transactional NTFS Lucene Directory, but that idea isn’t really viable, since that is going away.

It was at this point that we realized that we were barking at the entirely wrong tree. We already have the technology in place to make Lucene reliable: Voron!

Voron is a low level storage engine that offers ACID transactions. All we need to do is develop VoronLuceneDirectory, and that should handle the reliability part of the equation. There are a couple of details that needs to be handled, in particular, Voron needs to know, upfront, how much data you want to write, and a single value in Voron is limited to 2GB. But that is fairly easily done. We write to a temporary file from Lucene, until it tells us to commit. At which point we can write it to Voron directly (potentially breaking it to multiple values if needed).

Voila, we have got ourselves a reliable mechanism for storing Lucene’s data. And we can do all of that in a single atomic transaction.

When reading the data, we can skip all of the hard work and file I/O and serve it directly from Voron’s memory map. And having everything inside a single Voron file means that we can skip doing things like the compound file format Lucene is using, and chose a more optimal approach.

And with a reliable way to handle indexing, quite large swaths of code can just go away. We can now safely assume that indexes are consistent, so we don’t need to have a lot of checks on that, startup verifications, recovery modes, online backups, etc.

Improvement by omission indeed.

May 02 2016

RavenDB 3.5 whirl wind tourYou want all the data, you can’t handle all the data

time to read 2 min | 240 words

Tweet Share Share 0 comments

Tags:

raven

Another replication feature in RavenDB is the ability to replicate only specific collections to a sibling, and even the ability to transform the outgoing data as it goes out.

Here is an example of how this looks in the UI, which will probably do it much better justice than a thousand words from me.

There are several reasons why you'll want to use this feature. In terms of deployment, it gives you an easy way to replicate parts of a database's data to a number of other databases, without requiring you to send the full details. A typical example would be a product catalog that is shared among multiple tenants, but where each tenant can modify the products or add new ones.

Another example would be to have the production database replicate backward to the staging database, with certain fields that are masked so the development / QA team can work with a realistic dataset.

An important consideration with filtered replication is that because the data is filtered, a destination that is using filtered replication isn't a viable fallback target, and it will not be considered as such by the client. If you want failover, you need to have multiple replicas, some with the full data set and some with the filtered data.

Oren Eini

Oren Eini

CEO of RavenDB

The design of RavenDB 4.0Physically segregating collections

RavenDB 3.5 whirl wind tourI’ll find who is taking my I/O bandwidth and they SHALL pay

The design of RavenDB 4.0Making Lucene reliable

RavenDB 3.5 whirl wind tourYou want all the data, you can’t handle all the data

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed