Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,124 | Comments: 45,479

filter by tags archive

The design of RavenDB 4.0Making Lucene reliable

time to read 4 min | 647 words

I don’t like Lucene. It is an external dependency that works in somewhat funny ways, and the version we use is a relatively old one that has been mostly ported as-is from Java. This leads to some design decisions that are questionable (for example, using exceptions for control flow in parsing queries), or just awkward (by default, an error in merging segments will kill your entire process). Getting Lucene to run properly in production takes quite a bit of work and effort. So I don’t like Lucene.

We have spiked various alternatives to Lucene multiple times, but it is a hard problem, and most solutions that we look at lead toward pretty much the same approach that Lucene does it.By now, we have been working with Lucene for over eight years, so we have gotten good in managing it, but there are still quite a bit of code in RavenDB that is decided to managing Lucene’s state, figuring out how to recover in case of errors, etc.

Just off the top of my head, we have code to recover from aborted indexing, background processes that takes regular backups of the indexes, so we’ll be able to restore them in the case of an error, etc. At some point we had a lab of machines that were dedicated to testing that our code was able to manage Lucene properly in the presence of hard resets. We got it working, eventually, but it was hard. And we still get issues from users that into trouble because Lucene can tie itself into knots (for example, a disk full error midway through indexing can corrupt your index and require us to reset it). And that is leaving aside the joy of I/O re-ordering does to you when you need to ensure reliability.

So the problem isn’t with Lucene itself, the problem is that it isn’t reliable. That led us to the Lucene persistence format. While Lucene persistent mode is technically pluggable, in practice, this isn’t really possible. The file format and the way it works are very closely tied to the idea of files. Actually, the idea of process data as a stream of bytes. At some point, we thought that it would be good to implement a Transactional NTFS Lucene Directory, but that idea isn’t really viable, since that is going away.

It was at this point that we realized that we were barking at the entirely wrong tree. We already have the technology in place to make Lucene reliable: Voron!

Voron is a low level storage engine that offers ACID transactions. All we need to do is develop VoronLuceneDirectory, and that should handle the reliability part of the equation. There are a couple of details that needs to be handled, in particular, Voron needs to know, upfront, how much data you want to write, and a single value in Voron is limited to 2GB. But that is fairly easily done. We write to a temporary file from Lucene, until it tells us to commit. At which point we can write it to Voron directly (potentially breaking it to multiple values if needed).

Voila, we have got ourselves a reliable mechanism for storing Lucene’s data. And we can do all of that in a single atomic transaction.

When reading the data, we can skip all of the hard work and file I/O and serve it directly from Voron’s memory map. And having everything inside a single Voron file means that we can skip doing things like the compound file format Lucene is using, and chose a more optimal approach.

And with a reliable way to handle indexing, quite large swaths of code can just go away. We can now safely assume that indexes are consistent, so we don’t need to have a lot of checks on that, startup verifications, recovery modes, online backups, etc.

Improvement by omission indeed.

RavenDB 3.5 whirl wind tourYou want all the data, you can’t handle all the data

time to read 2 min | 240 words

Another replication feature in RavenDB is the ability to replicate only specific collections to a sibling, and even the ability to transform the outgoing data as it goes out.

Here is an example of how this looks in the UI, which will probably do it much better justice than a thousand words from me.

image

There are several reasons why you'll want to use this feature. In terms of deployment, it gives you an easy way to replicate parts of a database's data to a number of other databases, without requiring you to send the full details. A typical example would be a product catalog that is shared among multiple tenants, but where each tenant can modify the products or add new ones.

Another example would be to have the production database replicate backward to the staging database, with certain fields that are masked so the development / QA team can work with a realistic dataset.

An important consideration with filtered replication is that because the data is filtered, a destination that is using filtered replication isn't a viable fallback target, and it will not be considered as such by the client. If you want failover, you need to have multiple replicas, some with the full data set and some with the filtered data.

RavenDB 3.5 whirl wind tourA large cluster goes into a bar and order N^2 drinks

time to read 3 min | 443 words

Imagine that you have a two nodes cluster, setup as master-master replication, and then you write a document to one of them. The node you wrote the document to now contacts the 2nd node to let it knows about the new document. The data is replicated, and everything is happy in this world.

But now let us imagine it with three nodes. We write to node 1, which will then replicate to nodes 2 and 3. But node 2 is also configured to replicate to node 3, and given that we have a new document in, it will do just that. Node 3 will detect that it already have this document and turn that into a no-op. But at the same time that node 3 is getting the document from node 2, it is also sending the document it got from node 1 to node 2.

This work, and it evens out eventually, because replicating a document that was already replicated is safe to do. And on high load systems, replication is batched, so you typically don't see a problem until you get to bigger cluster sizes.

Let us take the following six way cluster. In this graph, we are going to have 15 round trips across the network on each replication batch.*

* Nitpicker corner, yes, I know that the number of connections is ( N * (N-1) ) / 2, but N^2 looks better in the title.

mes-topology

The typical answer we have for this is to change the topology, instead of having a fully connected graph, with each node talking to all other nodes, we use something like this:

tree-topology

Real topologies typically have more than a single path, and it isn't a hierarchy, but this is to illustrate a point.

This work, but it requires the operations team to plan ahead when they deploy, and if you didn't allow for breakage, a single node going down can disconnect large portion of your cluster. That is not ideal.

So in RavenDB 3.5 we have taken steps to avoid it, nodes are now much smarter about the way they go about talking about their changes. Instead of getting all fired up and starting to send replication message all over the place, potentially putting some serious pressure on the network, the nodes will be smarter about it, and wait a bit to see if their siblings already got the documents from the same source. In which case, we now only need to ping them periodically to ensure that they are still connected, and we saved a whole bunch of bandwidth.

The design of RavenDB 4.0The implications of the blittable format

time to read 6 min | 1134 words

I have written extensively about the blittable format already, so I’ll not get into that again. But what I wanted to do in this post is to discuss the implication of the intersection of two very important features:

  • The blittable format requires no further action to be useful.
  • Voron is based on a memory mapped file concept.

Those two, brought together, are quite interesting.

To see why, let us consider the current state of affairs. In RavenDB 3.0, we store the data as json directly. Whenever we need to read a document, we need to load the document from disk, parse the json, load it into .NET objects, and only then do something with it. When we just got started with RavenDB, it didn’t actually matter to us. Our main concern was I/O, and that dominated all our costs. We spent multiple releases improving on that, and the solution was the prefetcher.

  • Prefetcher will load documents from the disk and make them ready to be indexed.
  • The prefetcher is running concurrently to indexing, so we can parallelize I/O and CPU work.

That allow us to reduce most of the I/O wait times, but it still left us with problems. If two indexes are working, and they each use their own prefetcher, then we have double the I/O cost, double the parsing cost, double the memory cost, double the GC cost. So in order to avoid that, we group indexes together that are roughly at the same space in their indexing. But that lead to a different set of problems, if we have one slow index, that would impact all the other indexes, so we need to have a way to “abandon” an index while it is indexing, to let the other indexes in the group the chance to run.

There is also another issue, when inserting documents into the database, we want to index them, but it seems stupid to take the index, write it to the disk, only to then load them from the disk, parse them, etc. So when we insert a new document, we add it to the prefetcher directly, saving us some work in the common case where indexes are caught up and only need to index new things. That, too, have a cost, it means that the lifetime of such objects tend to be much longer, which means that they are more likely to be pushed into Gen1 or Gen2, so they will not be collected for a while, and when they do, it will be a more expensive collection run.

Oh, and to top it off, all of the structure above need to consider available memory, load on the server, time for indexing batch, I/O rates, liveliness and probably a dozen other factors that don’t pop to mind right now. In short, this is complex.

With RavenDB 4.0, we set out to remove all of this complexity. A large part of the motivation for the blittable format and using Voron are driven by the reasoning below.

If we can get to a point where we can just access the values, and reading documents won’t incur a heavy penalty in CPU/memory, we could radically shift the cost structure. Let us see how. Now, the only cost for indexing is going to be pure I/O, paging the documents to memory when we access them. Actually indexing them is done by merely access the mapped memory directly, so we don’t actually need to allocate much memory during indexing.

Optimizing the actual I/O is pretty easily done by just asking the operating system, we can do that explicitly using PrefetchVirtualMemory or madvise(MADV_WILLNEED), or just let the OS handle that based on actual access pattern. So those are two separate issues that just went away completely. And without needing to spread the cost of loading the documents among all indexes, we no longer have a good reason to go with grouping indexes. So that is out the window, as well as all the complexity that is required to handle a slow index slowing down everyone.

And because newly written documents are likely to be memory resident (they have just been accessed, after all), we can just skip the whole “let us remember recently written documents for the indexes”, because by the time we index them, we are expecting them to still be in memory.

What is interesting here is that by using the right infrastructure we have been able to remove quite a lot of code. Now, the major part here is that being able to remove a lot of code is almost always great, the major change here is that all of the code we removed had to deal with a very large number of factors (if new documents are coming in, but indexing isn’t caught up to them, we need to stop putting the new documents into the perfetcher cache and clear it) that are hard to predict and sometimes interact in funny ways. By moving a lot of that complexity to “let us manage what parts of the file are memory resident”, we can simplify a lot of that complexity and even push much of it directly to the operation system.

This has other implications, because we now no longer need to run indexes in groups, and they can each run and do their own thing, we can now split them so each index has their own dedicated thread. Which mean, in turn, that if we have a very busy index, it is going to be very easy to point which one is the culprit. It also make it much easier for us to handle priorities. Because each index is a thread, it means that we can now rely on the OS prioritization. If you have an index that you really care about running as soon as possible, we can bump its priority higher. And by default, we can very easily mark the indexing thread as lower priority, so we can prioritize answer incoming requests over processing indexes.

Doing it in this manner means that we are able to ask the OS to handle the problem of starvation in the system, where an index doesn’t get to run because it has a lower priority. All of that is already handled in the OS scheduler, so we can lean on that.

Probably the hardest part in the design of RavenDB 4.0 is that we are thinking very hard about how to achieve our goals (and in many cases exceed them) not by writing code, but by not writing code. But by arranging things so the right thing would happen. Architecture and optimization by omission, so to speak.

As a reminder, we have the RavenDB Conference in Texas in a few months, which would be an excellent opportunity to learn about RavenDB 4.0 and the direction in which we are going.

image

RavenDB 3.5 whirl wind tourI’m the admin, and I got the POWER

time to read 2 min | 294 words

It looks like I’m on a rule for administrators and operations features in RavenDB 3.5, and in this case, I want to introduce the Administrator JS Console.

This is a way for a server administrator to execute arbitrary code on a running system. Here is a small such example:

image

This is 100% the case of giving the administrator the option for running with scissors.

image Yes, I’m feeling a bit nostalgic Smile.

The idea here is that we have the ability to query and modify the state of the server directly. We don’t have to rely on prepared-ahead-of-time end points, and only being able to do whatever it is we thought of beforehand.

If we run into a hard problem, we’ll actually be able to ask the server directly, what ails you, and when we find out, we can tell it, so let us fix that. For example, interrogating the server about the state of its memory, then telling it to flush a particular cache, or changing (at runtime, without any restarts or downtime) the value of a particular configuration option, or giving the server additional information it is missing.

We already saw three cases where having this ability would have been a major time saver, so we are really excited about this feature, and what it means to our ability to more easily support RavenDB in production.

The design of RavenDB 4.0Voron takes flight

time to read 4 min | 771 words

RavenDB has been using the low level Esent as our storage engine from day 1. We toyed with building our own storage engine in Munin, but it was only in 2013 that we started pay serious attention to that. A large part of that was the realization that Esent is great, but it isn’t without its own issues and bugs (it is relatively easy to get it into referencing invalid memory, for example, if you run multiple defrag operations), that we have to work around. But two major reasons were behind our decision to invest a lot of time and effort into building Voron.

When a customer has an issue, they don’t care that this is some separate component that is causing it, we need to be able to provide them with an answer, fast. And escalating to Microsoft works, but it is cumbersome and in many cases result in a game of telephone. This can be amusing when you see kids do it, but not so much fun when it is an irate customer with a serious problem on the phone.

The second reason is that Esent was not designed for our workload. It has done great by us, but it isn’t something that we can take and tweak and file away all the rough edges in our own usage. In order to provide the level of performance and flexibility we need, we have to own every critical piece in our stack, and Voron was the result of that. Voron was released as part of RavenDB 3.0, and we have some customers running the biggest databases on it, battle testing it in production for the past two+ years.

With RavenDB 4.0, we have made Voron the only storage engine we support, and have extended it further. In particular, we added low level data structures that changed some operations from O(M * logN)  to O(logN), push more functionality to the storage engine itself and simplified the concurrency model.

In Voron 3.0, the only data structure that we had was the B+Tree, we had multiple of those, and you could recurse them, but that was it. Data was stored in the following manner:

  • Documents tree (key: document id, value: document json)
  • Etag tree (key: etag, value: document key)

We had one B+Tree as the primary, which contained the actual data, and a number of additional indexes, which allow us to search on additional fields, then find the relevant data by lookin up in the data tree.

This had two issues. The first was that our code needed to manually make sure that we were updating all the index trees whenever we updated/deleted/created any values. The second was that a scan over an index would result in the code first doing an O(logN) search over the index tree, then for each matching result it do another lookup to the actual data tree.

In practice, this only showed up as a problem when you have over 200 million entries, in which case the performance cost was noticeable. But for that purpose, and for a bunch of other (which we’ll discuss in the next post) we made the following changes to Voron.

We now have 4 data structures supported.

  • B+Tree – same as before, variable size key & value.
  • Fixed Sized B+Tree – int64 key, and fixed size (can be configured at creation time) size of value. Same as the one above, but let us take advantage of various optimization when we know what the total size is.
  • Raw data section – allow to store data, and give an opaque id to access the data later.
  • Table – combination of raw data sections with any number of indexes.

The raw data section is interesting, because it allows us to just ask it to store a bunch of data, and get an id back (int64), and it has an O(1) cost for accessing those values using the id.

We then use this id as the value in the B+Tree, which means that our structure now looks like this:

  • Raw data sections – [document json, document json, etc]
  • Documents tree (key: document id, value: raw data section id)
  • Etags tree (key: etag, value: raw data section id)

So now getting the results from the etags tree is an seek into the tree O(logN), and then just O(1) cost for each document that we need to get from there.

Another very important aspect is the fact that Voron is based on memory mapped files, which allows some really interesting optimization opportunities. But that will be the subject of the next post.

RavenDB 3.5 whirl wind tourCan you spare me a server?

time to read 2 min | 302 words

One of the most important features in RavenDB is replication, and the ability to use a replica node to get high availability and load balancing across all nodes.

And yes, there are users who choose to run on a single instance, or want to have a hot backup, but not pay for an additional license because they don’t need the high availability mode.

Hence, RavenDB 3.5 will introduce the Hot Spare mode:

image

A hot spare is a RavenDB node that is available only as a replication target. It cannot be used for answering queries or talking with clients, but it will get all the data from your server and keep it safe.

If there is a failure of the main node , an admin can light up the hot spare, which will turn the hot spare license into a normal license for a period of 96 hours. At that point, the RavenDB server will behave normally as a standard server, clients will be able to failover to it, etc. After the 96 hours are up, it will revert back to hot spare mode, but the administrator will not be able to activate it again without purchasing another hot spare license.

The general idea is that this license is going to be significantly cheaper than a full RavenDB license, but provides a layer of protection for the application. We would still recommend that you’ll get for a full cluster, but for users that just want the reassurance that they can hit the breaker and switch, without going into the full cluster investment, that is a good first step.

We’ll announce pricing for this when we release 3.5.

The design of RavenDB 4.0Over the wire protocol

time to read 4 min | 617 words

So the first thing to tackle is the over the wire protocol. RavenDB is a REST based system, working purely within HTTP. A lot of that was because at the time of conception, REST was pretty much the thing, so it was natural to go ahead with that. For many reasons, that has been the right decision.

The ability to easily debug the over the wire protocol with a widely available tool like Fiddler makes it very easy to support in production, script it using curl or wget and in general make it easier to understand what is going on.

On the other hand, there are a bunch of things that we messed up on. In particular, RavenDB is using the HTTP headers to pass the the document metadata. At the time, that seemed like an elegant decision, and something that is easily in line with how REST is supposed to work. In practice, this limited the kind of metadata that we can use to stuff that can pass through HTTP headers, and forced some constraints on us (case insensitivity, for example), and there have been several cases where proxies of various kind inserted their own metadata that ended up in RavenDB, sometimes resulting in bad things happening.

When looking at RavenDB 4.0, we made the following decisions:

  • We are still going to use HTTP as the primary communication mechanism.
  • Unless we have a good reason to avoid it, we are going to be human readable over the wire.

So now instead of sending document metadata as HTTP headers, we send them inside the document, and use headers only to control HTTP itself. That said, we have taken the time to analyze our protocol, and we found multiple places where we can do much better. In particular, the existence of web sockets means that there are certain scenarios that just became tremendously simpler for us. RavenDB has several needs for bidirectional communications. Bulk inserts, Change API, subscriptions, etc.

The fact that web sockets are now pretty much available across the board means that we have a much easier time dealing with those scenarios. In fact, the presence of web sockets in general is going to have a major impact on how we are doing replication, but that will be the topic of another post.

Beyond the raw over the wire protocol, we also looked at a common source for bugs, issues and trouble: IIS. Hosting under IIS means that you are playing by IIS rules, which can be really hard to do if you aren’t a classic web application. In particular, IIS has certain ideas about size of requests, their duration, the time you have to shut down or start up, etc. In certain configurations, IIS will just decide to puke on you (if you have a long URL, it will return 404, with no indication why, for example), resulting in escalated support calls.  One particular support call happened because a database took too long to open (as a result of a rude shutdown by IIS), which resulted in IIS aborting the thread and hanging the entire server because of an abandoned mutex. Fun stuff like that.

In RavenDB 4.0 we are going to be using Kestrel exclusively, which simplify our life considerably. You can still front it with IIS, of course (or ngnix, etc), for operational, logging, etc reasons. But this means that RavenDB will be running its own process, worry about its own initialization, shutdown, etc. It makes our life much easier.

That is about it for the protocol bits, in the next post, I’ll talk about the most important aspect of a database, how the data actually gets stored on disk.

RavenDB 3.5 whirl wind tourConfiguring once is best done after testing twice

time to read 2 min | 222 words

One of the things that happens when you start using RavenDB is that you start using more of it, and more of its capabilities. The problem there is that often users end up with a common set of stuff that they use, and they need to apply it across multiple databases (and across multiple servers).

This is where the Global Configuration option comes into play. This allows you to define the behavior of the system once, and have it apply to all your databases, and using the RavenDB Cluster option, you can apply it across all you nodes in on go.

image

As you can see in the image, we are allowing to configure this globally, for the entire server (or multiple servers), instead of having to remember to configure it for each. The rest of the tabs are pretty much the same manner. You can configure global behavior (and override it on individual databases, of course). Aside from the fact that it is available cluster-wide, this isn’t a major feature, or a complex one, but it is one that is going to make it easier for the operations team to work with RavenDB.

FUTURE POSTS

  1. RavenDB 3.5 whirl wind tour: I’ll find who is taking my I/O bandwidth and they SHALL pay - one day from now
  2. The design of RavenDB 4.0: Physically segregating collections - about one day from now
  3. RavenDB 3.5 Whirlwind tour: I need to be free to explore my data - 3 days from now
  4. RavenDB 3.5 whirl wind tour: I'll have the 3+1 goodies to go, please - 6 days from now
  5. The design of RavenDB 4.0: Voron has a one track mind - 7 days from now

And 12 more posts are pending...

There are posts all the way to May 30, 2016

RECENT SERIES

  1. RavenDB 3.5 whirl wind tour (14):
    02 May 2016 - You want all the data, you can’t handle all the data
  2. The design of RavenDB 4.0 (13):
    03 May 2016 - Making Lucene reliable
  3. Tasks for the new comer (2):
    15 Apr 2016 - Quartz.NET with RavenDB
  4. Code through the looking glass (5):
    18 Mar 2016 - And a linear search to rule them
  5. Find the bug (8):
    29 Feb 2016 - When you can't rely on your own identity
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats