Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

, @ Q j

Posts: 6,812 | Comments: 49,041

filter by tags archive
time to read 4 min | 761 words

imageThe feature outlined in this post is a hidden behind a small bottom at a relatively obscure part of the RavenDB Studio (Database > Settings > Document Revisions). You can see how it looks like on the right. Despite its unassuming appearance, this is a pretty important feature. Revisions revert is a feature that we wish that no one use, though, which make it an interesting one.

Revision Revert allow you to take your entire database back to a particular moment in time. Documents changes will be undone, deleted documents will be restored, new documents will be removed, etc.

image

So far, this isn’t a surprising feature, being able to restore to a point in time is a feature that many other database have. How is this feature different? In most systems, a point in time restore require you to… well, restore. In a large database, that can take a lot of time. Revision Revert is an alternative to that. Instead of restoring from scratch, it utilize the revisions features in RavenDB to allow you to just hit the time machine button and go back to the desired time.

The common use case for that is immediately after the “Opps” moment. You have run an query without specifying a where clause, deployed a bad version of your app that removed important fields, etc.

Revision Revert is an online operation, you don’t need to take the database down. In fact, you can still serve reads while the process is going on. Since typically you’ll need to go back in time a relatively short period, this is a very quick process.

In a distributed system, the admin will invoke this process on one of the nodes in the system and the reverts will be applied on that node and then replicated from there to all the other nodes in the system. We have made every attempt to make what is likely to be a pretty stressful event as easy as possible.

You might have noticed the Window configuration in the screen above. What is that about?

To be honest, this is something that we expect most users to never really care about. It is there for correctness’ sake in a distributed environment. Let’s dig a little deeper into this feature.

First thing we need to talk about is time. The point in time that we’ll restore to is the user’s local time. This is converted into UTC internally and used to compute the cutoff point for the revert. In a distributed system, it is possible (even likely) that different machines on the network will have different clocks. (Note that while RavenDB will work just fine and do the Right Thing if your nodes have different timezones, we have found that really confusing. Better to keep all nodes on the same timezone and clock sync system).

This means that one problem for this feature is that changes happening on two machines at the same time may have different time stamps (in UTC, the local time is not relevant). You need to take that into account when using Revision Revert since that is what RavenDB uses to decide what stays and what go.

The second problem is that just because two updates happened at the same time, it doesn’t mean that we learned about them at the same time. What I mean here is that a change that two changes that happened at the same time on different machine may have reached a particular node at very different times. That is where the Window option come into play. We scan the revisions log for all changes to the system. And we scan them in the order that we learned about them. By default, we’ll go back 4 days until we are sure that there aren’t any revisions that we got out of order and missed.

A few additional things about this feature. Obviously, it requires that you’ll have revisions enabled (and have enough revisions capacity to go back far enough in time, naturally). It support live restores and operates nicely in a distributed environment. Note that if you are doing Revision Revert and not all your documents have revisions enabled, only those that have revisions will be reverted.

Currently we apply this revert globally, we are considering allowing you to select specific collections to revert, but I’m not sure how useful that would be in practice.

time to read 3 min | 405 words

The most common network topology for RavenDB replication is a full mesh. For example, if you have three nodes in your cluster and a database that reside on all three nodes, you’ll have a replication topology that will look like this:

image

This works great when the number of nodes that you have in your cluster is reasonably small. However, we recently got a customer question about a different kind of topology. They have a bunch of nodes, in the order of a few dozens, which cooperate to perform some non trivial task. A key part of this is that the nodes are transient and identical. So a new node may pop up, live for a while (days, weeks, months) and then go away. At any given time you might have a few dozen nodes. That kind of environment won’t really work with a full mesh topology. If we would try, it would look something like that (fully connected network with 40 nodes):

image

This has a total of 780 connections(!) in it.  You can create a topology like that, but a lot of the processing power in the network is going to be dedicated to just maintaining these connections. And you don’t actually need it. RavenDB’s replication algorithm is actually a gossip algorithm, and as you grow the number of nodes that take part in the replication, the less connection you need between nodes. In this case, we can take each of the live nodes and connect each of them to four other (random) nodes. The result would look like so:

image

Remember, each of the nodes is actually connected to a random four other nodes. RavenDB’s replication will ensure that a change to any document in any of the nodes under these conditions will propagate to all the other nodes efficiently.

This approach will also transparently handle any intermediary failures and be robust for nodes coming and leaving on the fly. RavenDB doesn’t implement gossip membership, mostly because that is very heavily dependent on the application and deployment pattern, but once you tell a node who its neighbors are, everything will proceed on its own.

time to read 2 min | 334 words

imageI’m happy to let you know that as of last week, RavenHQ has been updated with full general availability of managed RavenDB 4.1 in the cloud.

If you aren’t familiar with RavenHQ, this update gives you the ability to get a managed RavenDB cluster in minutes. In this case, you get all the usual benefits of RavenDB as well as the peace and quite of someone else managing all the other details for you.

We have been steadily working on making RavenDB a self managed database, but there are still things that it can’t do on its own. RavenHQ close the loop. If your database is large and is about to run out of disk space, RavenHQ can automatically increase the allocated storage, for example. A machine went down? RavenHQ will transparently replace it and allow the cluster to recover without having any impact on your client code.

If you have used RavenHQ in the past, there are important changes. Previously, you would use RavenHQ to provision a database. That has changed, you now provision a cluster of nodes and you can create as many databases as you want. If you need a test database for CI, you can just create one on the fly, no extra charges or need to use additional APIs over the RavenDB native client.

All the usual management features of RavenDB are available as well, including the ability to extend RavenDB on the fly using index extensions and analyzers. We went over all the feedback that we got from the community and users and have done the same thing for RavenHQ that was done in the move from RavenDB 3.5 to RavenDB 4.0. Everything should be better, in the case of RavenDB, we had a rule that things should be at least 10x better, and I believe that you’ll find the RavenHQ experience similar.

As always, we would really love your feedback.

time to read 4 min | 735 words

imageAbout a month ago I wrote about a particular issue that we wanted to resolve. RavenDB is using X509 certificates for authentication. These are highly secured and are a good answer for our clients who need to host sensitive information or are working in highly regulated environments. However, certificates have a problem, they expire. In particular, if you are following common industry best practices, you’ll replace your certificates every 2 – 3 months. In fact, the common setup of using RavenDB with Let’s Encrypt will do just that. Certificates will be replaced on the fly by RavenDB without the need for an administrator involvement.

If you are running inside a single cluster, that isn’t something you need to worry about. RavenDB will coordinate the certificate update between the nodes in such a way that it won’t cause any disruption in service. However, it is pretty common in RavenDB to have multi cluster topologies. Either because you are deployed in a geo-distributed manner or because you are running using complex topologies (edge processing, multiple cooperating clusters, etc). That means that when cluster A replaces its certificate, we need to have a good story for cluster B still allowing it access, even though the certificate has changed.

I outlined our thinking in the previous post, and I got a really good hint,  13xforever suggested that we’ll look at HPKP (HTTP Public Key Pinning) as another way to handle this. HPKP is a security technology that was widely used, run into issues and was replaced (mostly by certificate transparency). With this hint, I started to investigate this further. Here is what I learned:

  • A certificate is composed of some metadata, the public key and the signature of the issuer (skipping a lot of stuff here, obviously).
  • Keys for certificates can be either RSA or ECDSA. In both cases, there is a 1:1 relationship between the public and private keys (in other words, each public key has exactly one private key).

Given these facts, we can rely on that to avoid the issues with certificate expiration, distributing new certificates, etc.

Whenever a cluster need a new certificate, it will use the same private/public key pair to generate the new certificate. Because the public key is the same (and we verify that the client has the private key during the handshake), even if the certificate itself changed, we can verify that the other side know the actual secret, the private key.

In other words, we slightly changed the trust model in RavenDB. From trusting a particular certificate, we trust that certificate’s private key. That is what grants access to RavenDB. In this way, when you update the certificate, as long as you keep the same key pair, we can still authenticate you.

This feature means that you can drastically reduce the amount of work that an admin has to do and lead you to a system that you setup once and just keeps working.

There are some fine details that we still had to deal with, of course. An admin may issue a certificate and want it to expire, so just having the user re-generate a new certificate with the private key isn’t really going to work for us. Instead, RavenDB validates that the chain of signatures on the certificate are the same. Actually, to be rather more exact, it verifies that the chain of signatures that signed the original (trusted by the admin) certificate and the new certificate that was just presented to us are signed by the same chain of public key hashes.

In this way, if the original issuer gave you a new certificate, it will just work. If you generate a new certificate on your own with the same key pair, we’ll reject that. The model that we have in mind here is trusting a driver’s license. If you have an updated driver’s license from the same source, that is considered just as valid as the original one on file. If the driver license is from Toys R Us, not so much.

Naturally, all such automatic certificate updates are going to be logged to the audit log, and we’ll show the updated certificates in the management studio as well.

As usual, we’ll welcome your feedback, the previous version of this post got us a great feature, after all.

time to read 2 min | 301 words

The Reddit’s front page contain a list of recent posts from all communities. In most cases, you want to show posts from communities that the user is subscribe to, but at the same time, you want to avoid flooding the front page with posts from any single community. You also need this to be really fast.

It turns out that doing this in RavenDB is actually very easy. We are going to create a map/reduce index that will aggregate the few most recent posts per community, like so:

image

What this index will do is provide us with the five most recent posts in each community, as well as their date. This is an interesting example of a map/reduce index, because we are using both aggregation and fanout in the index.

The nice thing about this index is that we can project the results directly from it to the user. Let’s see how the queries will look like:

image

This is a simple query that does quite a lot. It gives us the most recent 15 posts across all the communities that the user care about, with no single community able to generate more than 5 posts. It sort them the posted date and fetch the actual posts in the same query. This is going to give you consistent performance regardless of how much data you have and how many updates your experience. The actual Reddit front page is a lot more complex, I’m sure, but this serve as a nice example of how you can do non trivial stuff in RavenDB’s indexes that simplify your life by a lot.

time to read 1 min | 153 words

I wanted to point out the RavenDB Customers Portal website, because it has a very important function that may not seem obvious.

As part of the process of setting up RavenDB, we provide our users with a domain name so they can run their clusters securely. This is pretty easy and has been used by thousands of our users.

However, advanced scenarios, such as adding a node to a cluster or changing a node IP required you to re-run the setup and weren’t convenient. We have now made it even simpler, you can use the customers portal to edit your cluster DNS configuration.

Here is how this looks like:

image

This is available to customers who purchased a commercial license as well as users running on the community edition. As usual, we would love to get your feedback.

time to read 6 min | 1018 words

I got a great comment on my previous post about using Map/Reduce indexes in RavenDB for event sourcing. The question was how to handle time sensitive events or ordered events in this manner. The simple answer is that you can’t, RavenDB intentionally don’t expose anything about the ordering of the documents to the index. In fact, given the distributed nature of RavenDB, even the notion of ordering documents by time become really hard.

But before we close the question as “cannot do that by design", let’s see why we want to do something like that. Sometimes, this really is just the developer wanting to do things in the way they are used to and there is no need for actually enforcing the ordering of documents. But in other cases, you want to do this because there is a business meaning behind these events. In those cases, however, you need to handle several things that are a lot more complex than they appear. Because you may be informed of an event long after that actually happened, and you need to handle that.

Our example for this post is going to be mortgage payments. This is a good example of a system where time matters. If you don’t pay your payments on time, that matters. So let’s see how we can model this as an event based system, shall we?

A mortgage goes through several stages, but the only two that are of interest for us right now are:

  • Approval – when the terms of the loan are set (how much money, what is the collateral, the APR, etc).
  • Withdrawal – when money is actually withdrawn, which may happen in installments.

Depending on the terms of the mortgage, we need to compute how much money should be paid on a monthly basis. This depend on a lot of factors, for example, if the principle is tied to some base line, changes to the base line will change the amount of the principle. If only some of the amount was withdrawn, if there are late fees, balloon payment, etc. Because of that, on a monthly basis, we are going to run a computation for the expected amount due for the next month.

And, obviously, we have the actual payments that are being made.

Here is what the (highly simplified) structure looks like:

image

This includes all the details about the mortgage, how much was approved, the APR, etc.

The following is what the expected amount to be paid looks like:

image

And here we have the actual payment:

image

All pretty much bare bones, but sufficient to explain what is going on here.

With that in place, let’s see how we can actually make use of it, shall we?

Here are the expected payments:

image

Here are the mortgage payments:

image

The first thing we want to do is to aggregate the relevant operations on a monthly basis, since this is how mortgages usually work. I’m going to use a map reduce index to do so, and as usual in this series of post, we’ll use JavaScript indexes to do the deed.

Unlike previous examples, now we have real business logic in the index. Most specifically, funds allocations for partial payments. If the amount of money paid is less than the expected amount, we first apply it to the interest, and only then to the principle.

Here are the results of this index:

image

You can clearly see that mistake that were made in the payments. On March, the amount due for the loan increased (took another installment from the mortgage) but the payments were made on the old amount.

We aren’t done yet, though. So far we have the status of the mortgage on a monthly basis, but we want to have a global view of the mortgage. In order to do that, we need to take a few steps. First, we need to define an Output Collection for the index, that will allow us to further process the results on this index.

In order to compute the current status of the mortgage, we aggregate both the mortgage status over time and the amount paid by the bank for the mortgage, so we have the following index:

Which gives us the following output:

image

As you can see, we have a PastDue marker on the loan. At this point, we can make another payment on the mortgage, to close the missing amount, like so:

image

This will update the monthly mortgage status and then the overall status. Of course, in a real system (I mentioned this is highly simplified, right?) we’ll need to take into account payments made in one time but applied to different times (which we can handle by an AppliedTo property) and a lot of the actual core logic isn’t in indexes. Please don’t do mortgage logic in RavenDB indexes, that stuff deserve its own handling, in your own code. And most certainly don’t do that in JavaScript. The idea behind this post is to explore how we can handle non trivial event projection using RavenDB. The example was chosen because I assume most people will be familiar with it and it wasn’t immediately obvious how to go about actually solving it.

If you want to play with this, you can import the following file (Settings > Import Data) to get the documents and index definitions.

time to read 10 min | 1813 words

imageThis is a sordid tale of chance and mystery and the nasty tricks that Murphy can play on you.

A few customers reported an error similar to the following one:

Invalid checksum for page 1040, data file Raven.voron might be corrupted, expected hash to be 0 but was 16099259854332889469

One such case might be a disk corruption, but multiple customers reporting it is an indication of a much bigger problem. That was a trigger for a STOP SHIP reaction. We consider data safety a paramount goal of RavenDB (part of the reason why I’m doing this Production Postmortem series), and we put some of our most experienced people on it.

The problem was, we couldn’t find it. Having access to the corrupted databases showed that the problem occurred on random. We use Voron in many different capacities (indexing, document storage, configuration store, distributed log, etc) and these incidents happened across the board. That narrowed the problem to Voron specifically, and not bad usage of Voron. This reduced the problem space considerably, but not enough for us to be able to tell what is going on.

Given that we didn’t have a lead, we started  by recognizing what the issue was and added additional guards against it. In fact, the error itself was a guard we added, validating that the data on disk is the same data that we have written to it. The error above indicates that there has been a corruption in the data because the expected checksum doesn’t match the actual checksum from the data. This give us an early warning system for data errors and prevent us from proceeding on erroneous data. We have added this primarily because we were worried from physical disk corruption of data, but it turns out that this is also a great early warning system for when we mess up.

The additional guards were primarily additional checks for the safety of the data in various locations on the pipeline. Given that we couldn’t reproduce the issue ourselves, and none of the customers affected were able to reproduce this, we had no idea how to go from there. Therefor, we had a team that kept on trying different steps to reproduce this issue and another team that added additional safety measures for the system to catch any such issue as early as possible.

The additional safety measures went into the codebase for testing, but we still didn’t have any luck in figuring out what we going on. We went from trying to reproduce this by running various scenarios to analyzing the code and trying to figure out what was going on. Everything pointed to it being completely impossible for this to happen, obviously.

We got a big break when the repro team managed to reproduce this error when running a set of heavy tests on 32 bits machines. That was really strange, because all the reproductions to date didn’t run on 32 bits.

It turns out that this was a really lucky break, because the problem wasn’t related to 32 bits at all. What was going on there is that under 32 bits, we run in heavily constrained address space, which under load, can cause us to fail to allocate memory. If this happens at certain locations, this is considered to be a catastrophic error and requires us to close the database and restart it to recover. So far, this is pretty standard and both expected and desired reaction. However, it looked like sometimes, this caused an issue. This also tied to some observations from customers about the state of the system when this happened (low memory warnings, etc).

The very first thing we did was to test the same scenario on the codebase with the new checks added. So far, the repro team worked on top of the version that failed at the customers’ sites, to prevent any other code change from masking the problem. With the new checks, we were able to confirm that they actually triggered and caught the situation early. That was a great confirmation, but we still didn’t know what was going on. Luckily, we were able to add more and more checks to the system and run the scenario. The idea was to trip over a guard rail as early as possible, to allow us to inspect what actually caused it.

Even with a reproducible scenario, that was quite hard. We didn’t have a reliable method of reproducing it, we had to run the same set of operations for a while to hopefully reproduce this scenario. That took quite a bit of time and effort. Eventually, we figured out what was the root cause of the issue.

In order to explain that, I need to give you a refresher on how Voron is handling I/O and persistent data.

Voron is using MVCC model, in which any change to the data is actually done on a scratch buffer, this allow us to have snapshot isolation at very little cost and give us a drastically simplified model for working with Voron. Other important factors include the need to be transactional, which means that we have to make durable writes to disk. In order to avoid doing random writes, we use a Write Ahead Journal. For these reasons, I/O inside Voron is basically composed of the following operations:

  • Scratch (MEM) – copy on write data for pages that are going to be changed in the transaction. Usually purely in memory. This is how we maintain the Isolated and Atomic aspects on ACID.
  • Journal (WAL) – sequential, unbuffered, writes that include all the modifications to the transaction. This is how we maintain the Atomic and Durability aspects in ACID.
  • Flush (MMAP)– copy data from the scratch buffers to the data file, which allow us to reuse space in the scratch file.
  • Sync – (FSYNC) – ensure that the data from a previous flush is stored in durable medium, allow us to delete old journal files.

In Voron 3.5, we had Journal writes (which happen on each transaction commit) at one side of the I/O behavior and flush & sync as the other side. In Voron 4.0, we actually split it even further, meaning that journal writes, data flush and file sync are all independent operations which can happen independently.

Transactions are written to the journal file one at a time, until it reach a certain size (usually about 256MB), at which point we’ll create a new journal file. Flush will move data from the scratch buffers to the data file and sync will ensure that the data that was moved to the data file is durably stored on disk, at which point you can safely delete the old journals.

In order to trigger this bug, you needed to have the following sequence of events:

  • Have enough transactions happen quickly enough that the flush / sync operations are lagging by more than a single file behind the transaction rate.
  • Have a transaction start a new journal file while the flush operation was in progress.
  • Have, concurrently, the sync operation complete an operation that include that last journal file. Sync can take a lot of time.
  • Have another flush operation go on while the sync is in progress, which will move the flush target to the new journal file.
  • Have the sync operation complete, which only synced some of the changes that came from that journal, but because the new flush (which we didn’t sync yet) already moved on from that journal, mistakenly believe that this journal file is completed done and delete it.

All of these steps, that is just the setup for the actual problem, mind you.

In this case, we are prepared to have to this issue, but we aren’t yet to actually experience it. This is because what happened is that the persistent state (on disk) of the database is now suspect, if a crash happens, we will miss the oldest journal that still have transactions that haven’t been properly persisted to the data file.

Once you have setup the system properly, you aren’t done, in terms of reproducing this issue. We now have a race, the next flush / sync cycle is going to fix this issue. So you need to have a restart of the database within a very short period of time.

For additional complexity, the series of steps above will cause a problem, but even if you crash in just the right location, there are still some mitigating circumstances. In many cases, you are modifying the same set of pages in multiple transactions, and if the transactions that were lost because of the early deletion of the journal file had pages that were modified in future transactions, these transactions will fill up the missing details and there will be no issue. That was one of the issues that made it so hard to figure out what was going on. We needed to have a very specific set of timing between three separate threads (journal, flush, sync) that create the whole, then another race to restart the database at this point before Voron will fix itself in the next cycle, all happening just at the stage that Voron moves between journal files (typically every 256MB of compressed transactions, so not very often at all) and with just the right mix of writes to different pages on transactions that span multiple journal files.

These are some pretty crazy requirements for reproducing such an issue, but as the saying goes: One in a million is next Tuesday.

What made this bug even nastier was that we didn’t caught it earlier already. We take the consistency guarantees of Voron pretty seriously and we most certainly have code to check if we are missing transactions during recovery. However, we had a bug in this case. Because obviously there couldn’t be a transaction previous to Tx #1, we aren’t checking for a missing transaction at that point. At least, that was the intention of the code. What was actually executing was a check for missing transactions on every transaction except for the first transaction on the first journal file during recovery. So instead of ignoring just the the check on Tx #1, we ignored it on the first tx on all recoveries.

Of course, this is the exact state that we have caused in this bug.

Sigh.

We added all the relevant checks, tightened the guard rails a few more times to ensure that a repeat of this issue will be caught very early and provided a lot more information in case of an error.

Then we fixed the actual problems and subject the database to what in humans would be called enhanced interrogation techniques. Hammers were involved, as well as an irate developer with penchant to pulling the power cord at various stages just to see what will happen.

We have released the fix in RavenDB 4.1.4 stable release and we encourage all users to upgrade as soon as possible.

time to read 3 min | 454 words

If you are tracking the nightlies of RavenDB, the Pull Replication feature has fully landed. You now have three options to chose when you define replication in your systems.

image

External Replication is meant to go from the current database to another database (usually in a different cluster). It is a way to share data with another location. The owner of the replication is the current database, which initiate the connection and send the data to the other side.

Pull Replication reverse this behavior. The first thing you’ll need to do to get Pull Replication working is to define the Pull Replication Hub.

image

As you can see, there isn’t much to do here. We give the hub a name and minimal configuration (how far back this should go, basically). In this case, we are allowing sinks to get the data from the database, with a 20 minutes delay in built into the loop. You can also export the sink configuration from this view. We also define a certificate that provide access to this Hub Pull Replication, this certificate allow only access to this Pull Replication Hub, it grant no additional permissions. In this way, you may have one certificate that provide access to a delayed public stock ticker and another that provides an immediate access to the data.

The next step is to go to the other side, the sink. There, we either manually define the details on the hub (or more likely import the configuration). The sink will then connect to the hub and start pulling the data from it. Here is what this looks like:

image

The idea is that you are very likely to have a lot more sinks than hubs. That is why we make it easy to define the sink just by importing (although in practical terms we expect that this will just be part of a shared image that is deployed many times).

One we have defined the Sink Pull Replication, it will connect to the Hub and start accepting data. You can track how this works from the studio:

image

On the other side, you can track the connected sinks on the Hub:

image

And this is all you need to setup Pull Replication yourself.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. RavenDB 4.2 Features (4):
    19 Mar 2019 - Time travel and revisions revert
  2. Workflow design (4):
    06 Mar 2019 - Making the business people happy
  3. Data modeling with indexes (6):
    22 Feb 2019 - Event sourcing–Part III–time sensitive data
  4. Production postmortem (25):
    18 Feb 2019 - This data corruption bug requires 3 simultaneous race conditions
  5. Making money from Open Source Software (3):
    08 Feb 2019 - How we do it?
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats