Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,467 | Comments: 47,703

filter by tags archive

You are judged on what is above the waterline

time to read 6 min | 1182 words

image

This is written a day before the Jewish New Year, so I suppose that I’m a bit introspective. We recently had a discussion in the office about priorities and what things are important for RavenDB.  This is a good time to do so, just after the RC release, we can look back and see what we did right and what we did wrong.

RavenDB is an amazing database, even if I say so myself, but one of the things that I find extremely frustrating is that so much work is going into things that have nothing to do with the actual database. For example, the studio. At any given point, we have a minimum of 6 people working on the RavenDB Studio.

Here is a confession, as far as I’m concerned, that studio is a black box. I go and talk to the studio lead, and things happen. About the only time that I actually go into the studio code is when I did something and that broke the studio build. But I’ll admit that I usually go to one of the studio guys and have them fix it for me Smile.

Yep, at this stage, I can chuck that “full stack” title right off the edge. I’m a backend guy now, almost completely. To be fair, when I started writing HTML (it wasn’t called web apps, and the fancy new stuff was called DHTML) the hot new thing was the notion that you shouldn’t use <blink/> all the time and we just got <table> for layout.  I’m rambling a bit, but I’ll get there, I think.

I find the way web applications are built today to be strange, horribly complex and utterly foreign (I find just the size of the node_modules folder is scary). But this isn’t a post about an old timer telling you how good were the days when you had 16 colors to play with and users knew their place. This post is about how a product is perceived.

I mentioned that RavenDB is amazing, right? It is certainly the most complex project that I have ever worked on and it is choke full of really good stuff. And none of that matters unless it is in the studio. In fact, that team of people working on the studio? That includes only the people actually working on the studio itself. It doesn’t include all the work done to support the studio on the server side. So it would be pretty fair to say that over half of the effort we put into RavenDB is actually spent on the studio at this point.

And it just sounds utterly crazy, right? We literally spent more time on building the animation for the cluster state changes so you can see them as they happen then we have spent writing the cluster behavior. Given my own inclinations, that can be quite annoying.

It is easy to point at all of this hassle and say: “This is nonsense, surely we can do better”. And indeed, a lot of our contemporaries get away with this for their user interface:

I’ll admit that this was tempting. It would free up so much time and effort that it is very tempting.

It would also be quite wrong. For several reasons. I’ll start from the good of the project.

Here is one example from the studio (rough cost, a week for the backend work, another week for various enhancement / fixes for that, about two weeks for the UI work, some issues still pending for this) showing the breakdown of the I/O operations made on a live RavenDB instance.

image

There is also some runtime cost to actually tracking this information, and technically speaking we would be able to get the same information with strace or similar.  So why would we want to spend all this time and effort on something that a user would already likely have?

Well, this particular graph is responsible for us being able to track down and resolve several extremely hard to pin down performance issues with how we write to disk. Here is how this look like at a slightly higher level, the green is writes to the journal, blue are flushes to the data file and orange are fsyncs.

image

At this point, I have a guy in the office that stared at theses graphs so much that he can probably tell me the spin rate of the underlying disk just by taking a glance. Let us be conservative and call it a mere 35% improvement in the speed of writing to disk.

Similarly, the cluster behavior graph that I complained about? Just doing QA on the graph part allowed us to find several issues that we didn’t notice because they were suddenly visible and there.

That wouldn’t justify the amount of investment that we put into them, though. We could have built diagnostics tools much more cheaply then that, after all. But they aren’t meant for us, all of these features are there for actual customer use in production, and if they are useful for us during development, they are going to be invaluable for users trying to diagnose things in production. So while I may consider the art of graph drawing black magic of the highest caliber I can most certainly see the value in such features.

And then there is the product aspect. Giving a user a complete package, were they can hit the ground running and feel that they have a good working environment is a really good way to keep said user. There is also the fact that as a database, things like the studio are not meant primarily for developers. In fact, the heaviest users of the studio are the admin stuff managing RavenDB, and they have a very different perspective on things. Making the studio useful for such users was an interesting challenge, luckily handled by people who actually know what they are doing in this regard much better than me.

And last, but not least, and tying back to the title of this post. Very few people actually take the time to do a full study of any particular topic, we use a lot of shortcuts to make decisions. And seeing the level of investment put into the user interface is often a good indication of overall production quality. And yes, I’m aware of the slap some lipstick on this pig and ship it mentality, there is a reason is works, even if it is not such a good idea as a long term strategy. Having a solid foundation and having a good user interface it a lot more challenging, but far better end result.

And yet, I’m a little bit sad (and mostly relieved) that I have areas in the project that I can neithe work nor understand.

When disk and hardware fall…

time to read 5 min | 956 words

animal-1299573_640When your back is against the wall, and your only hope is for black magic (and alcohol).

The title of this post is taken from this song. The topic of this post is a pretty sad one, but a mandatory discussion when dealing with data that you don’t want to lose. We are going to discuss hard system failures.

The source can be things like actual physical disk errors to faulty memory causing corruption. The end result is that you have a database that is corrupted in some manner. RavenDB actually have multiple levels of protections to detect such scenarios. All the data is verified with checksums on first load from the disk, and the transaction journal is verified when applying it as well. But stuff happens, and thanks to Murphy, that stuff isn’t always pleasant.

One of the hard criteria for the Release Candidate was a good story around catastrophic data recovery. What do I mean by that? I mean that something corrupted the data file in such a way that RavenDB cannot load normally. So sit on tight and let me tell you this story.

We first need to define what we are trying to handle. The catastrophic data recovery feature is meant to:

  1. Recover user data (documents, attachments, etc) stored inside a RavenDB file.
  2. Recover as much data as possible, disregarding its state, letting user verify correctness (i.e, may recover deleted documents).
  3. Does not include indexed data, configuration, cluster settings, etc. This is because these can be quite easily handled by recreating indexes or setting up a new cluster.
  4. Does not replace high availability, backups or proper preventive maintenance.
  5. Does not attempt to handle malicious corruption of the data.

Basically. the idea is that when you are shit creek, we can hand you paddle. That said, you are still up in shit creek.

I mentioned previously that RavenDB go to quite some length to ensure that it knows when the data on disk is messed up. We also did a lot of work into making sure that when needed, we can actually do some meaningful work to extract your data out. This means that when looking at the raw file format, we actually have extra data there that isn’t actually used for anything in RavenDB except by the recovery tools. That reason (the change to the file format) was why it was a Stop-Ship priority issue.

Given that we are already in catastrophic data recovery mode, we can make very little assumption about the state of the data. A database is a complex beast, involving a lot of moving parts and the on disk format is very complex and subject to a lot of state and behavior. We are already in catastrophic territory, so we can’t just use the data as we would normally would. Imagine a tree where following the pointers to the lower level might at some cases lead to garbage data or invalid memory. We have to assume that the data has been corrupted.

Some systems handle this by having two copies of the master data records. Given that RavenDB is assumed to run on modern file systems, we don’t bother this. ReFS on Windows and ZFS on Linux handle that task better and we assume that production usage will use something similar. Instead, we designed the way we store the data on disk so we can read through the raw bytes and still make sense of what is going on inside it.

In other words, we are going to effectively read one page (8KB) at a time, verify that the checksum matches the expected value and then look at the content. If this is a document or an attachment, we can detect that and recover them, without having to understand anything else about the way the system work. In fact, the recovery tool is intentionally limited to a basic forward scan of the data, without any understanding of the actual file format.

There are some complications when we are dealing with large documents (they can span more than 8 KB) and large attachments (we support attachments that are more then 2GB in size) can requite us to jump around a bit, but all of this can be done with very minimal understanding of the file format. The idea was that we can’t rely on any of the complex structures (B+Trees, internal indexes, etc) but can still recover anything that is still recoverable.

This also led to an interesting feature. Because we are looking at the raw data, whenever we see a document, we are going to write it out. But that document might have actually been deleted. The recovery tool doesn’t have a way of checking (it is intentionally limited) so it just write it out. This means that we can use the recovery tool to “undelete” documents. Note that this isn’t actually guaranteed, don’t assume that you have an “undelete” feature, depending on the state of the moon and the stomach content of the nearest duck, it may work, or it may not.

The recovery tool is important, but it isn’t magic, so some words of caution are in order. If you have to use the catastrophic data recovery tool, you are in trouble. High availability features such as replication and offsite replica are the things you should be using, and backups are so important I can’t stress it enough.

The recommended deployment for RavenDB 4.0 is going to be in a High Availability cluster with scheduled backups. The recovery tool is important for us, but you should assume from the get go that if you need to use it, you aren’t in a good place.

Dynamic nodes distribution in RavenDB 4.0

time to read 3 min | 522 words

This is another cool feature that is best explained by showing, rather than explaining.

Consider the following cluster, made of three nodes.

image

On this cluster, we created the following database:

image

This data is hosted on node B and C, so we have a replication factor of 2, here is how the database topology looks like:

image

And now we are going to kill node B.

The first thing that is going to happen is that the cluster is going to mark this node as suspect.  In this case, the M on node C stands for Member, and the R on node B stands for Rehab.

image

Shortly afterward, we’ll detect that it is actually down:

image

And this leads to a problem, we were told that we need to have a replication factor of 2 for this database, but one of them is down. There is a grace period, of course, but the cluster will eventually decide that it needs to be proactive about it and not wait around for nodes long past, hoping they’ll call, maybe.

So the cluster is going to take steps and do something about it:

image

And it does, it added node A to the database group. Node A is added as a promotable member, and node C is now responsible for catching it up with all the data that is in the database.

Once that is done, we can promote node A to be a full member in the database. We’ll remove node B from the database entirely, and when it come back up, we’ll delete the data from that node.

image

On the other hand, if node B was able to get back on its feet, it would be now be a race between node A and node B, to see who would be the first node to become up to do and eligible to become full member in the cluster again.

In most cases, that would be the pre-existing node, since it has a head start on the new node.

Clients, by the way, are not really aware that any of this is happening, the cluster will manage their topology and let them know which nodes they need to talk to at any given point.

Oh, and just to let you know, these images, they are screenshots from the studio, you can actually see this happening live when we introduce failure into the system and watch how the cluster recovers from it.

Distributed work assignment in RavenDB 4.0

time to read 2 min | 278 words

I can talk about this feature for quite some time, but it is really hard to explain via text. I even prepared an animated GIF that I was going to show here, but I draw the link at <blink/> tag (and yes, I know that dates me).

Honestly, the best way to see this feature is in action, and we’ll be doing some videos doing just that once RavenDB is in Release Candidate mode, but I just couldn’t sit on this feature any longer. I actually talked about it a few times in the blog, but I don’t think anyone really realized what it means.

Let us consider the following three nodes cluster:

image

It has a few responsibilities,  we need to do a daily backup, and there are a couple of offsite replicas that needs to be maintained. The cluster assign the work to the various nodes and they are each responsible for doing their part. However, what happens when a node goes down?

In this case, the cluster detect that and dynamically redistribute the work among the other nodes in the cluster.

image 

Work can be things like backups, replicating outside of the cluster or managing batch jobs.

In fact, because we have a node failure, the cluster has assigned C to be the mentor of node A when it will come up again, letting it know about all the changes in the database while it was down, but I’ll talk about that feature on another post.

The etag is dead, long live the change vector

time to read 4 min | 601 words

A few months ago I wrote about the etag simplification in RavenDB 4.0, but ironically, very shortly after that post, we had to retire the etag almost entirely. It is now relegated to semi retirement within the scope of a single node. Instead, we have the notion of a change vector. Those of you familiar with distributed system design may also know that by the term vector clock.

The basic problem is that an etag is not sufficient for change tracking in a distributed environment, and it doesn’t provide enough details to know whatever a particular version is the parent of another version or a divergent conflict. One of the core scenarios that we wanted to enable in 4.0 is the ability to just move roles around between servers, and relying on etags for many things make it much harder. For example, concurrency control, caching and conflict detection all benefit greatly from a change vector.

But what is it?

Let us take a simple example of a database with three nodes. Here is the change vector of one such document.

image

You can see that this document has entries for all three nodes in the cluster. We can tell at what point in history it relates to each of the nodes. The change vector reflect the cluster wide point in time where something happened. We establish this change vector by gossiping about the state of the nodes in the cluster and incrementing our own counter whenever a document is changed locally.

In this case, this document was changed on node A, but this node knows that nodes B and C’s etag at this point are 1060, so it can include their etag in the timeline as well. This gives us the ability to roughly sort changes in a timeline in a distributed fashion, without relying on synchronized clocks. This also means that when looking at a change vector, I can unambiguously say that this change is after “A:1022, B: 391, C: 1060” and “A:1040, B:819, C: 1007”. You might also note that I can also tell that this later two changes are in conflict with one another (since the first contains a higher etag value for C and the second contains a higher change value for A) but that the document that I have in the picture is more up to date then they are and subsumed both updates.

Actually, the representation above is a lie, here is what this actually looks like:

image

This is the full change vector, include the unique database id, not just the node identifier. This means that change vectors are also available when you are working with multiple clusters and serve much the same function there.

With this, I can go to a server and give it a change vector and see if a document has changed (for caching or cluster wide optimistic concurrency check). This make a lot of things much simpler.

Although I probably should note that with regards to optimistic concurrency, we are just checking in this node didn’t have any updates on top of the last version the client has seen. RavenDB don’t attempt to do any form of cross cluster concurrency, favoring multi master design and merging conflicts if needed. This mode allow us to remain up and functioning even with network splits / major failure, at the cost of having to deal with potentially merging conflicts down the road.

Solving a problem in a way that doesn’t generate dozens more

time to read 2 min | 393 words

Yesterday I showed off some of the new features in RQL, and in particular, two very cool features, the ability to declare functions and the ability to project an object literal directly from the select.

Both of these features are shown here, including usage demonstration:

image

Take a minute to read the query, then consider what we are doing here.

One of the things we want to enable is to have a powerful query capabilities. We started with what SQL syntax and very quickly hit the wall. Leaving aside the issue of working with just rectangular format of SQL, there is a huge problem here.

Something like N1QL or the Noise syntax have both seems that you can deal with non rectangular formats with enough effort. But the problem is that there is really no really good way to express complex imperative stuff easily. For example, N1QL has about 30 functions just to deal with arrays, and 40 just for dates. And those are just the first I looked up that were easy to count.

Going with something with a SQL syntax, including subqueries, complex nesting of expressions, etc would have led to a really complex language. Both from the point of view of implementation and from the point of view of the users.

On the other hand, with RQL, you get the familiar query language, and if you want to just get some data to display, you can do that. When you need to do more, you move to using the object literal syntax and if you need to apply more logic, you write a small function for that and call it.

Imagine write the code above in SQL (just the fact that the Nicks array is on a separate table means that you’ll have a correlated subquery there), and this is a pretty simple query. Choosing this approach means that we actually have very little expected support burden.

That is important, because with the other option I could absolutely see a lot of requests for “can you add a function to do X”, “how do I add my own function” popping up all the time in the support channels. This way, we give it all to you already and the solution is self contained and extensible as needed very easily.

Optimizing JavaScript and solving the halting problemPart II

time to read 3 min | 544 words

In the previous post I talked about the issues we had with wanting to run untrusted code and wanting to use Jurassic to do so. The problem is that when the IL code is generated, it is then JITed and run on the machine directly, we have no control over what it is doing. And that is a pretty scary thing to have. A simple mistake causing an infinite loop is going to take down a thread, which requires us to Abort it, which means that we are now in a funny state. I don’t like funny states in production systems.

So we were stuck, and then I realized that Jurrasic is actually generating IL for the scripts. If it generates the IL, maybe we can put something in the IL? Indeed we can, here is the relevant PR for Jurrasic code. Here is the most important piece of code here:

The key here is that we can provide Jurrasic with a delegate that will be called at specific locations in the code. Basically, we do that whenever you evaluate a loop condition or just before you jump using goto. That means that our code can be called, which means that we can then check whatever limits such as time / number of iterations / memory used have been exceeded and abort.

Now, the code is written in a pretty funny way. Instead of invoking the delegate we gave to the engine, we are actually embedding the invocation directly in each loop prolog. Why do we do that? Calling a delegate is something that has a fair amount of cost when you are doing that a lot. There is a bit of indirection and pointer chasing that you must do.

The code above, however, is actually statically embedding the call to our methods (providing the instance pointer if needed). If we are careful, we can take advantage of that. For example, if we mark our method with aggressive inlining, that will be taken into account. That means that the JavaScript code will turn into IL and then into machine code and our watchdog routine will be there as well. Here is an example of such a method:

We register the watchdog method with the engine. You’ll notice that the code smells funny. Why do we have a watchdog and then a stage 2 watchdog? The reason for that is that we want to inline the watchdog method, so we want it to be small. In the case above, here are the assembly instructions that will be added to a loop prolog.

These are 6 machine instructions that are going to have great branch prediction and in general be extremely lightweight. Once in a while we’ll do a more expensive check. For example, we can check the time or use GC.GetAllocatedBytesForCurrentThread() to see that we didn’t allocate too much. The stage 2 check is still expected to be fast, but by making it only rarely, it means that we can actually afford to do this in each and every place and we don’t have to worry about the performance of that.

This means that our road is opened to now move to Jurassic fully, since the major limiting factor, trusting the scripts, is now handled.

Optimizing JavaScript and solving the halting problemPart I

time to read 3 min | 537 words

RavenDB is a JSON document database, and the natural way to process such documents is with JavaScript. Indeed, there is quite a lot of usage of JS scripts inside RavenDB. They are used for ETL, Subscription filtering, patching, resolving conflicts, managing administration tasks and probably other stuff that I’m forgetting.

The key problem that we have here is that some of the places where we need to call out to JavaScript are called a lot. A good example would be a request to patch request on a query. The result can be a single document modified of 50 millions. If this is the later case, given our current performance profile, it turns out that the cost of evaluating JavaScript is killing our performance.

We are using Jint as the environment that runs our scripts. It works, it is easy to understand and debug and it is an interpreter. That means that it is more focused on correctness then performance. Over the years, we were able to extend it a bit to do all sort of evil things to our scripts ,but the most important thing is that it isn’t actually executing machine code directly, it is always Jint code running and handling everything.

Why is that important? Well, these scripts that we are talking about can be quite evil. I have seen anything from 300 KB of script (yes, that is 300 KB to modify a document that was considerably smaller) to just plain O(N^6) algorithms (document has 6 collections iterate on each of them while iterating on each of the others). These are just the complex ones. The evil ones do things like this:

We have extended Jint to count the number of operations are abort after a certain limit has passed as well as prevent stack overflow attacks. This means that it is much easier to deal with such things. They just gonna run for a while and then abort.

Of course , there is the perf issue of running an interpreter. We turned out eyes to Jurrasic, which took a very different approach, it generate IL on the fly and then execute it. That means that as far as we are concerned, most operations are going to end up running as fast as the rest of our code. Indeed, benchmarks show that Jurrasic is significantly faster (as in, order of magnitude or more). In our own scenario, we saw anything from simply doubling the speed to order of magnitude performance improvement.

However, that doesn’t help very much. The way Jurrasic works, code like the one above is going to hang or die. In fact, Jurassic documentation calls this out explicitly as an issue and recommend dedicating a thread or a thread pool for this and calling Thread.Abort if need. That is not acceptable for us. Unfortunately, trying to fix this in a smart way take us to the halting problem, and I think we already solved that mess.

This limiting issue was the reason why we kept using Jint for a long time. Today we finally had a breakthrough an were able to fix this issue. Here is the PR, but it tells only part of the story, the rest I’ll leave for tomorrow’s post.

Where do I put the select?

time to read 1 min | 137 words

We have a design issue with the RavenDB Query Language. Consider the following queries:

There are two different ways to express the same concept. The first version is what we have now, and it is modeled after SQL. The problem with that is that it makes it very hard to build good intellisense for this.

The second option is much easier, because the format of the query also follow the flow of writing, and by the time you are in the select, you already know what you are querying on. This is the reason why C# & VB.Net has their Linq syntax in this manner.

The problem is that it is pretty common to want to define aliases for fields in the select, and then refer to them afterward. That now become awkward in the second option.

Any thoughts?

The cost of an extension point

time to read 3 min | 450 words

Rejohnny-automatic-running-home-300pxcently we had a discussion on server side extension points in RavenDB 4.0, currently, outside of the ability to load custom analyzers for Lucene, we have none. That is in stark contrast for the architecture we have in 3.x.

That is primarily because we found out several things about our extensions based architecture, over the years.

Firstly, the extension based approach allowed us to develop a lot of standard extensions (called bundles) in 3.x. In fact, replication, versioning, unique constraints and a lot of other cool features are all implemented as extensions. On the one hand, it is a very convenient manner to add additional functionality, but it is also led to what I can’t help but term an isolationist attitude. In other words, whenever we build a feature, we tended to focus on that particular feature on its own. And a lot of trouble we have seen has been the result of feature intersection. Versioning and replication in a multi master cluster, for example, can have a fairly tricky behavior.

And the only way to resolve such issues is to actually teach the different pieces about each other and how they should interact with one another. That sort of defeated the point of writing independent extensions. Another an issue is that the interface that we have to expose are pretty generic. In many cases, that means that we have to do extra work (such as materializing values) in order for us to send the right values to the extension points. Even if in most cases, that extra work isn’t used.

Another issue is that in RavenDB 4.0, we have been treating all the pieces as a single whole, instead of tacking them on after the fact. Versioning, for example, as a core feature, has implications for maintaining transaction boundaries in a distributed environment, and that requires tight integration with the replication code and other parts on the environment. It is much easier to do so when we are building it as a single coherent piece.

There is another aspect here, not having external extensions also make it much easier to lay down the law with regards to what can and cannot happen in our codebase. We can make assumptions about what things will do, and this can result in much better code and behavior.

Now, there are things that we are going to have to do (allow you to customize your indexes with additional code, for example) and we can allow that by allowing you to upload additional code to be compiled with your indexes, but that is a very narrow use case.

FUTURE POSTS

  1. Writing SSL Proxy: Part II, delegating authentication - 17 hours from now
  2. 0.1x or 10x, time matters - about one day from now
  3. RavenDB 4.0 Unsung Heroes: Map/reduce - 3 days from now
  4. RavenDB 4.0 Unsung Heroes: Indexing related data - 6 days from now
  5. RavenDB 4.0 Unsung Heroes: The indexing threads - 7 days from now

And 2 more posts are pending...

There are posts all the way to Oct 05, 2017

RECENT SERIES

  1. RavenDB 4.0 Unsung Heroes (5):
    22 Sep 2017 - Field compression
  2. re (18):
    26 Apr 2017 - Writing a Time Series Database from Scratch
  3. Writing SSL Proxy (2):
    26 Sep 2017 - Part I, routing
  4. RavenDB 4.0 (13):
    11 Sep 2017 - Support options
  5. Optimizing select projections (5):
    01 Sep 2017 - Part IV–Understand, don’t do
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats