Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,481 | Comments: 47,769

filter by tags archive

Security analysis on error reporting

time to read 5 min | 973 words

When talking about the security errors we generate in RavenDB 4.0, we got some really good comments, which are worth discussing.  The following are some of the interesting tidbits from the comments there.

  • can this behavior help some malevolent user trying to hack the system?
  • at least make it an admin setting so by default it just gives a 403 and you have to turn on the detailed error reporting.
  • I'm not questioning the technical decision, but simply the fact to returning info to the user. simply logging this error server-side, or activate this behavior to a specific admin setting as Paul suggests, looks more "safe" to me.
  • often when you log to an account with wrong user or password, error message don't specify of password or username is wrong. just an example.

Let me start from scratch and outline the problem. We are using a x509 certificate to authenticate against the server. The user may use the wrong / expired / invalid certificate, the user may use no certificate or the user may use a valid certificate but attempt to do something that they don’t have access to.

By default, we can just reject the connection attempt, which will result in the following error:

image

We consider such an error utterly unacceptable. So we have to accept the connection, figure out what the higher level protocol is (HTTP, usually, but sometimes can be our own internal TCP connection) and send an error that the user can understand. In particular, we send a 403 HTTP error back with a message saying what the error is.

The worry is that there is some information disclosure inherent in the error message. Let us analyze that:

  • a request without a certificate, we error and announce that they didn’t use a certificate, and that one is required. There is no information here that the user is not already in possession of. They may not be aware of it, but they are in possession of the fact that they aren’t using a certificate.
  • a request with an expired certificate, we error and let the user know that. The user already have the certificate, and therefor can already figure out that it is expired. We don’t disclose any new information here, except that the user may try to use this to figure out what the server time is. This is generally not sensitive information and it can be safe to assume that it is close to what the rest of the world believe to be the current time, so I don’t see anything here that should worry us.
  • a request with a certificate that the server doesn’t know about. This result in an error saying that the server doesn’t know about the certificate. Here we get into actual information disclosure issue. We let the user know that the certificate isn’t known. But what can they gain from this?

In the case of username / password, we always say that the username or password is incorrect, we shouldn’t let the user know if a username already exists and that just the password is wrong. This is because it provide the user with information that it didn’t have (that the username is correct). Now, in practical terms, that is almost never the case, since you have password reset links and they tend to tell you if the email / username you want to reset the password to is valid.

However, with a certificate, there aren’t two pieces of information here, there is just one, so we don’t provide any additional information.

A request with a certificate to do an operation that the certificate isn’t authorized to do. This will result in one of the following errors:

  • image
  • image

There are a few things to note here. First, we don’t disclose any information beyond what the user has already provided us. The operation and the database are both provided from the user, and we use the FriendlyName so the user can tell what the certificate in question was.

Note that this check run before we check whatever the database actually exists on the server, so there isn’t any information disclosed about the existence of databases either.

Given that the user tried to perform an operation that they are not allowed to, we need to reject that operation (honey pot behavior is a totally separate issue and probably shouldn’t be part of any sane API design). Given that we reject the operation, we need to provide clear and concise errors for the user, rather than the CONNECTION REFUSED error. Given the kind of errors we generate, I believe that they provide sufficient information for the user to figure out what is going on.

As to whatever we should log this / hide this behind a configuration setting. That is really confusing to me. We are already logging rejected connections, because the admin may be interested in reviewing this. But requiring to call the admin and look at some obscure log file is a bad design in terms of usability. The same is true for hiding behind a configuration option. Either this is a secured solution, and we can report these errors, or we have to put the system into a known unsecured state (in production) just to be able to debug a connection issue. I would be far more afraid from that scenario, especially since that would be the very first thing that an admin would do, any time that there is a connection issue.

So this is the default (and only) behavior, because we have verified that this is both a secured solution and a sane one.

Unique constraints didn’t make the cut for RavenDB 4.0

time to read 3 min | 504 words

Unique Constraints is a bundle in RavenDB 3.x that was allowed you to… well, define unique constraints.  Here is the classic example:

image

It was always somewhat awkward to use (you had to mess around with configuration on both the server and the client side, but it worked pretty well. As long as you were running on a single server.

With RavenDB 4.0, we put a lot more emphasis on running in a cluster, and when the time came to discuss how we are going to handle unique constraints in 4.0 it was very obvious that this is going to be neither easy nor performant.

The key here is that in a distributed database using multi masters, there is no real good way to provide a unique constraint. Imagine the following database topology:

image

Now, what will happen if we’ll create the two different User documents with the same username in node C? It is easy for node C to be able to detect and reject this, right?

But what happen if we create one document in node B and another in node A? This is now a much harder problem to deal with. And this is without even getting into the harder aspects of how to deal with failure conditions.

The main problem here is that it is very likely that at the point we’ve discovered that we have a duplicate, there were already actions taken based on that information, which is generally not a good thing.

As a result, we were left in somewhat of a lurch. We didn’t want to have  feature that would work only on a single node, or contain a hidden broken invariant. The way to handle this properly in a distributed environment is to use some form of consensus algorithm. And what do you know, we actually have a consensus algorithm implementation at hand, the Raft protocol that is used to manage the RavenDB cluster.

That, however, led to a different problem. The process of using a unique constraint would now broken into two distinct parts. First, we would verify that the value is indeed unique, then we would save the document. This can lead to issues if there is a failure just between these two operations, and it puts a heavy burden on the system to always check the unique constraint across the cluster on every update.

The interesting thing about unique constraints is that they rarely change once created. And if they do, they are typically part of very explicit workflow. That isn’t something that is easy to handle without a lot of context. Therefor, we decided that we can’t reliably implement them and dropped the feature.

However… reliable and atomic distributed operations are now on the table, and they allow you to achieve the same thing, and usually in a far better manner. The full details will be in the next post.

reDifferent I/O Access Methods for Linux

time to read 6 min | 1063 words

The “Different I/O Access Methods for Linux, What We Chose for Scylla, and Why” is quite fascinating. It is a pleasure to be able to read in depth into another database implementation strategy and design decisions. In particular where they don’t match what we are doing, because we can learn from the differences.

The article is good both in terms of discussing the general I/O approaches on Linux in general and the design decisions and implications for ScyllaDB in particular.

ScyllaDB chose to use AIO/DIO (async direct I/O) using a dedicated library called Seastar. And they are effectively managing their own memory and I/O scheduling internally, skipping the kernel entirely. Given how important I/O is for a database, I can say that I strongly resonate with this approach, and it is something that we have tried for a while with RavenDB. That included paying careful attention to how we are sending I/O, controlling our own caching, etc.

We are in a different position from Scylla, of course, since we are using managed code, which introduced a different set of problems (and benefits), but during the design of RavenDB 4.0, we chose to go in a very different direction.

But first, let me show you the single statement in the post that caused me to write this blog post:

The great advantage of letting the kernel control caching is that great effort has been invested by the kernel developers over many decades into tuning the algorithms used by the cache. Those algorithms are used by thousands of different applications and are generally effective. The disadvantage, however, is that these algorithms are general-purpose and not tuned to the application. The kernel must guess how the application will behave next, and even if the application knows differently, it usually has no way to help the kernel guess correctly.

This statement is absolutely correct. The kernel caching algorithms (and the kernel behavior in general) is tailored to suit a generic set of requirements, which means that if you deviate from the way the kernel expects, it is going to be during extra work and you can experience significant problems (performance and otherwise).

Another great resource that I want to point you to is the following article: “You’re Doing It Wrong” which had a major effect on the way RavenDB 4.0 is designed.

What we did, basically, is to look at how the kernel is doing things, and then see how we can fit our own behavior to what the kernel is expecting. Actually, I’m lying here, because there is no one kernel here. RavenDB runs on Windows, Linux and OSX (Darwin kernel). So we have three very different systems with wildly different optimizations that we need to run optimally on. Actually, to be fair, we consider Windows & Linux as the main targets for deployment, but that still give us very different stacks to work on.

The key here was to be predictable in all things and be sure that whatever operations we make, the kernel can successfully predict them. This can be a lot of work, but not something that you’ll usually see in the code. It involves laying out the data so it is nearby in the file and ensuring that we have hotspots that the kernel can recognize and optimize for us, etc. And it involves a lot of work with the guts of the system to make sure that we match what the kernel expects.

For example, consider this statement from the article:

…application-level caching allows us to cache not only the data read from disk but also the work that went into merging data from multiple files into a single cache item.

This is a good example of the different behavior. ScyllaDB is using LSM model, which means that in order to read data, they typically need to lookup in multiple files. RavenDB uses a different model (B+Tree with MVCC) which typically means that store all the data in a single file. Furthermore, the way we store the information, we can access it directly via memory map without doing any work to prepare it ahead of time. That means that we can lean entirely on the page cache and gain all the benefits thereof.

The ScyllaDB approach also limits them to running only on Linux and only on specific configurations. Because they rely on async direct I/O, which is a… fiddly beast in Linux at the best of times, you need to make sure that everything matches just so in order to be able to get it working. Just running this on a stock Ubuntu won’t work, since ext4 will block for many operations. Another problem in my view is that this assumes that they are the only player on the machine. If you need to run with additional software on the machine, that can cause fights over resources. For production, this is less of a problem, but for running on a developer machine, that is frequently something that you need to take into account. The kernel will already do that for you (which is useful even in production when people put SQL Server & RavenDB on the same box) so you don’t have to worry about it too much. I’m not sure that this concern is even valid for ScyllaDB, since they tend to be deployed in clusters of dedicated machines (or at least Docker instances) so they have better control over the environment. That certainly make it easier if you can dictate such things.

Another consideration for the RavenDB approach is that we want to be, as much as possible, friendly to the administrator. When we lean on what the kernel does, we usually get that for free. The administrator can usually dictate policies to the kernel and have it follow them, and good sys admins both know how and know when to do that. On the other hand, if we wrote it all ourselves, we would also need to provide the hooks to modify the behavior (and monitor it, and train users in how it works, etc).

Finally, it is not a small thing to remember that if you let the kernel cache your data, that means that that cache is still around if you restart the database (but not the machine), which means that your mostly alleviate the issue of slow cold start if you needed to do things like update configuration or the database binaries.

RavenDB 4.0 Unsung heroesThe design of the security error flow

time to read 2 min | 351 words

recipe-575434_640This is again a feature that very few people will even notice exist, but a lot of time, effort and thinking went into building. How should RavenDB handle a case when a user make a request that it is not authorize to make. In particular, we need to consider the case of a user pointing the browser to a server or database that they aren’t authorized to see or without having the x509 certificate properly registered.

To understand the problem we need to figure out what the default experience will be like, and if we require a client certificate to connect to RavenDB, and the client does not provide it, by default the response is some variation of just closing the TCP connection. That result in the client getting an error that looks like this:

TCP connection closed unexpectedly

That is not conductive for a good error experience and will typically cause a user to spend a lot of time trying to figure out what the network problem is, while everything is working just fine, the server just doesn’t want to talk to the user.

The problem is that at the TLS level, there isn’t really a good way to give back some meaningful error. We are too low level, all we can do is just terminate the connection.

Instead of doing that, RavenDB will accept the connection, regardless of whatever it has a valid certificate (or even any certificate) and pass the connection to one level up in the chain. At that point, we can check whatever the certificate is valid and if it isn’t (or if it doesn’t have the permissions to do what we want it to do we can use the protocol’s own mechanism to report errors.

With HTTP, that means we can return a 403 error to the user, including an explanation on why we rejected the connection (no certificate, expired certificate, certificate doesn’t have the right permissions, etc). This make things much easier when you need to troubleshoot permissions issues.

RavenDB 4.0 Unsung HeroesThe indexing threads

time to read 3 min | 473 words

wire-33134_640A major goal in RavenDB 4.0 is to eliminate as much as possible complexity from the codebase. One of the ways we did that is to simplify thread management. In RavenDB 3.0 we used the .NET thread pool and in RavenDB 3.5 we implemented our own thread pool to optimize indexing based on our understanding of how indexing are used. This works, is quite fast and handles things nicely as long as everything works. When things stop working, we get into a whole different story.

A slow index can impact the entire system, for example, so we had to write code to handle that, and noisy indexing neighbors can impact overall indexing performance  and tracking costs when the indexing work is interleaved is anything but trivial. And all the indexing code must be thread safe, of course.

Because of that, we decided we are going to dramatically simplify our lives. An index is going to use a single dedicated thread, always. That means that each index gets their own thread and are only able to interfere with their own work. It also means that we can have much better tracking of what is going on in the system. Here are some stats from the live system.

image

And here is another:

image

What this means is that we have fantastically detailed view of what each index is doing, in terms of CPU, memory and even I/O utilization is needed. We can also now define fine grained priorities for each index:

image

The indexing code itself can now assume that it single threaded, which free a lot of complications and in general make things easier to follow.

There is the worry that a user might want to run 100 indexes per database and 100 databases on the same server, resulting in a thousand of indexing threads. But given that this is not a recommended configuration and given that we tested it and it works (not ideal and not fun, but works), I’m fine with this, especially given the other alternative that we have today, that all these indexes will fight over the same limited number of threads and stall indexing globally.

The end result is that thread per index allow us to have fine grained control over the indexing priorities, account for memory and CPU costs as well simplify the code and improve the overall performance significantly. A win all around, in my book.

RavenDB 4.0 Unsung HeroesMap/reduce

time to read 6 min | 1060 words

Computing_In_MorningOne of the hardest things that we did in RavenDB 4.0 would probably go completely unnoticed by users. We completely re-wrote how RavenDB is processing map/reduce queries. One of my popular blog posts is still a Visual Explanation to Map/Reduce, and it still does a pretty good job of explaining what map/reduce is.

The map/reduce code in RavenDB 3.x is one of the more fragile things that we have, require you to maintain in your head several completely different states that a particular reduction can be in and how they transition between states. Currently, there are probably two guys* who still understand how it works and one guy that is still able to find bugs in the implementation. It is also not as fast as we wished it would be.

So with RavenDB 4.0 we set out to build it from scratch, based in no small part on the fact that we had also written our storage engine for 4.0 and was able to take full advantage of that. You can read about the early design in this blog post, but I’m going to do a quick recap and explain how it works now.

The first stage in map/reduce is… well, the map. We run over the documents and extract the key portions we’ll need for the next part. We then immediately apply the reduce on each of the results independently. This give us the final map/reduce results for a single document. More to the point, this also tells us what is the reduce key for the results is. The reduce key is the value that the index grouped on.

We store all of the items with the same reduce key together. And here is where its get interesting. Up until a certain point, we just store all of the values for a particular reduce key as an embedded value inside a B+Tree. That means that whenever any of the values changes, we can add that value to the appropriate location and reduce all the matching values in one go. This works quite well until the total size of all the values exceed about 4KB or so.

At this point, we can’t store the entire thing as an embedded value and we move all the values for that reduce key to its own dedicated B+Tree. This means that we start with a single 8KB page and fill it up, then split it, and so on. But there is a catch. The results of a map/reduce operation tend to be extremely similar to one another. At a minimum, they share the same properties and the same reduce key. That means that we would end up storing a lot of duplicate information. To resolve that, we also apply recursive compression. Whenever a page nears 8KB in size, we will compress all the results stored in that page as a single unit. This tend to have great compression rate and can allow us to store up to 64KB of uncompressed data in a single page.

When adding items to a map/reduce index, we apply an optimization so it looks like:

results = reduce(results, newResults);

Basically, we can utilize the recursive nature of reduce to optimize things for the append only path.

When you delete or update documents and results change or are removed, things are more complex. We handle that by running a re-reduce on the results. Now, as long as the number of results is small (this depend on the size of your data, but typically up to a thousand or so) we’ll just run the reduce over the entire result set. Because the data is always held in a single location, this means that it is extremely efficient in terms of memory access and the tradeoff between computation and storage leans heavily to the size of just recomputing things from scratch.

When we have too many results (the total uncompressed size exceeds 64KB) we start splitting the B+Tree and adding a level to the three. At this point, the cost of updating a value is now the cost of updating a leaf page and the reduce operation on the root page. When we have more data still,  we will get yet another level, and so on.

The (rough) numbers are:

  • Up to 64KB (roughly 1000 results) – 1 reduce for the entire dataset
  • Up to 16 MB – 2 reduces (1 for up to 1000 results, 1 for up to 254 results)
  • Up to 4 GB – 3 reduces (1 for up to 1000 results, 2 for up to 254 results each)
  • Up to 1 TB  - 4 reduces (1 for up to 1000 results, 3 for up to 254 results each)
  • I think you get how it works now, right? The next level up is 1 to 248 TB and will requite 5 reduces.

These numbers is if your reduce data is very small, in the order of a few dozen byes. If you have large data, this means that the tree will expand faster, and you’ll get less reduces at the first level.

Note that at the first level, if there is only an addition (new document, basically), we can process that as a single operation between two values and then proceed upward as the depth of the tree requires.There are also optimizations in place if we have multiple updates to the same reduce key, in that case, we can first apply all the updates, then do the reduce once for all of them in one shot.

And all of that is completely invisible to the users, unless you want to peek inside, which is possible using the Map/Reduce visualizer:

image

This can give you insight deep into the guts of how RavenDB is handling map/reduce operations.

The current status is that map/reduce indexing are actually faster than normal indexes, because they are almost all our code, while a large portion of the normal indexing cost is with Lucene.

* That is an exaggeration, there is one guy that know how it works. Okay, okay, I’ll admit that we can dive into the code and figure out what is going on, but it takes quite a bit of time if there is a significant issue there.

You are judged on what is above the waterline

time to read 6 min | 1182 words

image

This is written a day before the Jewish New Year, so I suppose that I’m a bit introspective. We recently had a discussion in the office about priorities and what things are important for RavenDB.  This is a good time to do so, just after the RC release, we can look back and see what we did right and what we did wrong.

RavenDB is an amazing database, even if I say so myself, but one of the things that I find extremely frustrating is that so much work is going into things that have nothing to do with the actual database. For example, the studio. At any given point, we have a minimum of 6 people working on the RavenDB Studio.

Here is a confession, as far as I’m concerned, that studio is a black box. I go and talk to the studio lead, and things happen. About the only time that I actually go into the studio code is when I did something and that broke the studio build. But I’ll admit that I usually go to one of the studio guys and have them fix it for me Smile.

Yep, at this stage, I can chuck that “full stack” title right off the edge. I’m a backend guy now, almost completely. To be fair, when I started writing HTML (it wasn’t called web apps, and the fancy new stuff was called DHTML) the hot new thing was the notion that you shouldn’t use <blink/> all the time and we just got <table> for layout.  I’m rambling a bit, but I’ll get there, I think.

I find the way web applications are built today to be strange, horribly complex and utterly foreign (I find just the size of the node_modules folder is scary). But this isn’t a post about an old timer telling you how good were the days when you had 16 colors to play with and users knew their place. This post is about how a product is perceived.

I mentioned that RavenDB is amazing, right? It is certainly the most complex project that I have ever worked on and it is choke full of really good stuff. And none of that matters unless it is in the studio. In fact, that team of people working on the studio? That includes only the people actually working on the studio itself. It doesn’t include all the work done to support the studio on the server side. So it would be pretty fair to say that over half of the effort we put into RavenDB is actually spent on the studio at this point.

And it just sounds utterly crazy, right? We literally spent more time on building the animation for the cluster state changes so you can see them as they happen then we have spent writing the cluster behavior. Given my own inclinations, that can be quite annoying.

It is easy to point at all of this hassle and say: “This is nonsense, surely we can do better”. And indeed, a lot of our contemporaries get away with this for their user interface:

I’ll admit that this was tempting. It would free up so much time and effort that it is very tempting.

It would also be quite wrong. For several reasons. I’ll start from the good of the project.

Here is one example from the studio (rough cost, a week for the backend work, another week for various enhancement / fixes for that, about two weeks for the UI work, some issues still pending for this) showing the breakdown of the I/O operations made on a live RavenDB instance.

image

There is also some runtime cost to actually tracking this information, and technically speaking we would be able to get the same information with strace or similar.  So why would we want to spend all this time and effort on something that a user would already likely have?

Well, this particular graph is responsible for us being able to track down and resolve several extremely hard to pin down performance issues with how we write to disk. Here is how this look like at a slightly higher level, the green is writes to the journal, blue are flushes to the data file and orange are fsyncs.

image

At this point, I have a guy in the office that stared at theses graphs so much that he can probably tell me the spin rate of the underlying disk just by taking a glance. Let us be conservative and call it a mere 35% improvement in the speed of writing to disk.

Similarly, the cluster behavior graph that I complained about? Just doing QA on the graph part allowed us to find several issues that we didn’t notice because they were suddenly visible and there.

That wouldn’t justify the amount of investment that we put into them, though. We could have built diagnostics tools much more cheaply then that, after all. But they aren’t meant for us, all of these features are there for actual customer use in production, and if they are useful for us during development, they are going to be invaluable for users trying to diagnose things in production. So while I may consider the art of graph drawing black magic of the highest caliber I can most certainly see the value in such features.

And then there is the product aspect. Giving a user a complete package, were they can hit the ground running and feel that they have a good working environment is a really good way to keep said user. There is also the fact that as a database, things like the studio are not meant primarily for developers. In fact, the heaviest users of the studio are the admin stuff managing RavenDB, and they have a very different perspective on things. Making the studio useful for such users was an interesting challenge, luckily handled by people who actually know what they are doing in this regard much better than me.

And last, but not least, and tying back to the title of this post. Very few people actually take the time to do a full study of any particular topic, we use a lot of shortcuts to make decisions. And seeing the level of investment put into the user interface is often a good indication of overall production quality. And yes, I’m aware of the slap some lipstick on this pig and ship it mentality, there is a reason is works, even if it is not such a good idea as a long term strategy. Having a solid foundation and having a good user interface it a lot more challenging, but far better end result.

And yet, I’m a little bit sad (and mostly relieved) that I have areas in the project that I can neithe work nor understand.

When disk and hardware fall…

time to read 5 min | 956 words

animal-1299573_640When your back is against the wall, and your only hope is for black magic (and alcohol).

The title of this post is taken from this song. The topic of this post is a pretty sad one, but a mandatory discussion when dealing with data that you don’t want to lose. We are going to discuss hard system failures.

The source can be things like actual physical disk errors to faulty memory causing corruption. The end result is that you have a database that is corrupted in some manner. RavenDB actually have multiple levels of protections to detect such scenarios. All the data is verified with checksums on first load from the disk, and the transaction journal is verified when applying it as well. But stuff happens, and thanks to Murphy, that stuff isn’t always pleasant.

One of the hard criteria for the Release Candidate was a good story around catastrophic data recovery. What do I mean by that? I mean that something corrupted the data file in such a way that RavenDB cannot load normally. So sit on tight and let me tell you this story.

We first need to define what we are trying to handle. The catastrophic data recovery feature is meant to:

  1. Recover user data (documents, attachments, etc) stored inside a RavenDB file.
  2. Recover as much data as possible, disregarding its state, letting user verify correctness (i.e, may recover deleted documents).
  3. Does not include indexed data, configuration, cluster settings, etc. This is because these can be quite easily handled by recreating indexes or setting up a new cluster.
  4. Does not replace high availability, backups or proper preventive maintenance.
  5. Does not attempt to handle malicious corruption of the data.

Basically. the idea is that when you are shit creek, we can hand you paddle. That said, you are still up in shit creek.

I mentioned previously that RavenDB go to quite some length to ensure that it knows when the data on disk is messed up. We also did a lot of work into making sure that when needed, we can actually do some meaningful work to extract your data out. This means that when looking at the raw file format, we actually have extra data there that isn’t actually used for anything in RavenDB except by the recovery tools. That reason (the change to the file format) was why it was a Stop-Ship priority issue.

Given that we are already in catastrophic data recovery mode, we can make very little assumption about the state of the data. A database is a complex beast, involving a lot of moving parts and the on disk format is very complex and subject to a lot of state and behavior. We are already in catastrophic territory, so we can’t just use the data as we would normally would. Imagine a tree where following the pointers to the lower level might at some cases lead to garbage data or invalid memory. We have to assume that the data has been corrupted.

Some systems handle this by having two copies of the master data records. Given that RavenDB is assumed to run on modern file systems, we don’t bother this. ReFS on Windows and ZFS on Linux handle that task better and we assume that production usage will use something similar. Instead, we designed the way we store the data on disk so we can read through the raw bytes and still make sense of what is going on inside it.

In other words, we are going to effectively read one page (8KB) at a time, verify that the checksum matches the expected value and then look at the content. If this is a document or an attachment, we can detect that and recover them, without having to understand anything else about the way the system work. In fact, the recovery tool is intentionally limited to a basic forward scan of the data, without any understanding of the actual file format.

There are some complications when we are dealing with large documents (they can span more than 8 KB) and large attachments (we support attachments that are more then 2GB in size) can requite us to jump around a bit, but all of this can be done with very minimal understanding of the file format. The idea was that we can’t rely on any of the complex structures (B+Trees, internal indexes, etc) but can still recover anything that is still recoverable.

This also led to an interesting feature. Because we are looking at the raw data, whenever we see a document, we are going to write it out. But that document might have actually been deleted. The recovery tool doesn’t have a way of checking (it is intentionally limited) so it just write it out. This means that we can use the recovery tool to “undelete” documents. Note that this isn’t actually guaranteed, don’t assume that you have an “undelete” feature, depending on the state of the moon and the stomach content of the nearest duck, it may work, or it may not.

The recovery tool is important, but it isn’t magic, so some words of caution are in order. If you have to use the catastrophic data recovery tool, you are in trouble. High availability features such as replication and offsite replica are the things you should be using, and backups are so important I can’t stress it enough.

The recommended deployment for RavenDB 4.0 is going to be in a High Availability cluster with scheduled backups. The recovery tool is important for us, but you should assume from the get go that if you need to use it, you aren’t in a good place.

Dynamic nodes distribution in RavenDB 4.0

time to read 3 min | 522 words

This is another cool feature that is best explained by showing, rather than explaining.

Consider the following cluster, made of three nodes.

image

On this cluster, we created the following database:

image

This data is hosted on node B and C, so we have a replication factor of 2, here is how the database topology looks like:

image

And now we are going to kill node B.

The first thing that is going to happen is that the cluster is going to mark this node as suspect.  In this case, the M on node C stands for Member, and the R on node B stands for Rehab.

image

Shortly afterward, we’ll detect that it is actually down:

image

And this leads to a problem, we were told that we need to have a replication factor of 2 for this database, but one of them is down. There is a grace period, of course, but the cluster will eventually decide that it needs to be proactive about it and not wait around for nodes long past, hoping they’ll call, maybe.

So the cluster is going to take steps and do something about it:

image

And it does, it added node A to the database group. Node A is added as a promotable member, and node C is now responsible for catching it up with all the data that is in the database.

Once that is done, we can promote node A to be a full member in the database. We’ll remove node B from the database entirely, and when it come back up, we’ll delete the data from that node.

image

On the other hand, if node B was able to get back on its feet, it would be now be a race between node A and node B, to see who would be the first node to become up to do and eligible to become full member in the cluster again.

In most cases, that would be the pre-existing node, since it has a head start on the new node.

Clients, by the way, are not really aware that any of this is happening, the cluster will manage their topology and let them know which nodes they need to talk to at any given point.

Oh, and just to let you know, these images, they are screenshots from the studio, you can actually see this happening live when we introduce failure into the system and watch how the cluster recovers from it.

Distributed work assignment in RavenDB 4.0

time to read 2 min | 278 words

I can talk about this feature for quite some time, but it is really hard to explain via text. I even prepared an animated GIF that I was going to show here, but I draw the link at <blink/> tag (and yes, I know that dates me).

Honestly, the best way to see this feature is in action, and we’ll be doing some videos doing just that once RavenDB is in Release Candidate mode, but I just couldn’t sit on this feature any longer. I actually talked about it a few times in the blog, but I don’t think anyone really realized what it means.

Let us consider the following three nodes cluster:

image

It has a few responsibilities,  we need to do a daily backup, and there are a couple of offsite replicas that needs to be maintained. The cluster assign the work to the various nodes and they are each responsible for doing their part. However, what happens when a node goes down?

In this case, the cluster detect that and dynamically redistribute the work among the other nodes in the cluster.

image 

Work can be things like backups, replicating outside of the cluster or managing batch jobs.

In fact, because we have a node failure, the cluster has assigned C to be the mentor of node A when it will come up again, letting it know about all the changes in the database while it was down, but I’ll talk about that feature on another post.

FUTURE POSTS

  1. Complex Linq queries in RavenDB 4.0 - 11 hours from now
  2. Cost centers, revenue centers and politics in a cross organization world - about one day from now
  3. Giving Demeter PTSD - 2 days from now
  4. PR Review: Code has cost, justify it - 3 days from now
  5. PR Review: Beware the things you can’t see - 6 days from now

And 2 more posts are pending...

There are posts all the way to Oct 25, 2017

RECENT SERIES

  1. PR Review (7):
    10 Aug 2017 - Errors, errors and more errors
  2. RavenDB 4.0 (15):
    13 Oct 2017 - Interlocked distributed operations
  3. re (21):
    10 Oct 2017 - Entity Framework Core performance tuning–Part III
  4. RavenDB 4.0 Unsung Heroes (5):
    05 Oct 2017 - The design of the security error flow
  5. Writing SSL Proxy (2):
    27 Sep 2017 - Part II, delegating authentication
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats