Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

, @ Q j

Posts: 6,868 | Comments: 49,212

filter by tags archive
time to read 1 min | 180 words

We were asked about best practices for managing the RavenDB session (unit of work) in a .NET Core MVC application. I thought it is interesting enough to warrant its own post.

RavenDB’s client API is divided into the Document Store, which holds the overall configuration required to access a RavenDB and the Document Session, which is a short lived object implementing Unit of Work pattern and typically only used for a single request.

We’ll start by adding the RavenDB configuration to the appsettings.json file, like so:

We bind it to the following strongly typed configuration class:

The last thing to do it to register this with the container for dependency injection purposes:

We register both the Document Store and the Document Session in the container, but note that the session is registered on a scoped mode, so each connection will get a new session.

Finally, let’s make actual use of the session in a controller:

Note that we used to recommend having SaveChangesAsync run for you automatically, but at this time, I think it is probably better to do this explicitly.

time to read 6 min | 1051 words

A process running on your system is typically a black box. You don’t have a lot of insight into what is going on inside it. Oh, there are all sort of tools you can use to infer things out (looking at system calls, memory consumption, network connections, etc), but by default… it is a mystery.

RavenDB is a database. It is meant to run unattended for long durations and is designed to mostly run itself. That means that when you look at it, you want to be able to figure out exactly what is going on with the system as soon as possible. To that end, we have included a lot of features inside RavenDB that expose the internal state of the system. From tracing each I/O and its duration to providing detailed statistics about costs and amount of effort invested in various tasks.

These features are invaluable to figure out exactly what is going on in RavenDB at a particular point in time. Of course, nothing beats the ability to open a debugger and inspect the state of the system. But that is something that you can only really do on development. It is not something that can be done on production, obviously. Or can it?

Since RavenDB 3.0, we actually had just this feature, being able to ask RavenDB to capture and display its own state in a format that should be very familiar to developers. When we created RavenDB 4.0, we were able to carry on this feature on Windows (at some cost), but it was a complete non starter on Linux.

On Windows, a process can debug another process if they belong to the same user (somewhat of an over simplification, but good enough). On Linux, the situation is a lot more complex. A process can usually only debug another process if the debugger is running as root or is the parent process of the debugee process.

Another complication was that we are using ClrMD, a wonderful library that allow us to introspect live processes (among many other things). It did not have support for Linux, until about a month ago… as soon as we had the most basic of support there, we jumped into action, seeing how we can bring this feature to Linux as well. A lot of our users are running production systems on Linux, and the ability to look at the system and go: “Hmm. I wonder what this is doing” and then being able to tell is something that we consider a major boost to RavenDB.

It took a lot of fighting and learning a lot more about how debugging permissions work on Linux than I ever wanted to know. But we got it working (details below). You can see how this looks like on a live Linux server:

image

As you can see, there is an indexing thread here doing some work on spatial data. We are going to enhance this view further with the ability to see CPU times as well as job names. The idea is that this is something that you will look at and get enough insight to not need to check the logs or try to infer what is going on. You could just tell.

Now, for the gory details of how this works. We changed the implementation on both Windows and Linux to use passive attach to process, which is much faster. The first thing we tried, once we moved to passive attach is to debug ourselves.

This is a nice enough feature, and quite elegant. We debug ourselves, pull the stack traces and display the data. Unfortunately, this doesn’t work on Linux. A process cannot debug itself. All debugging in Linux is based on the ptrace() system call, and the permissions to that are as specified. I can’t imagine the security implications of letting a process debug itself. After all, it is already can do anything the process can do, because it is the process. But I guess that this is an esoteric enough scenario that no one noticed and then the reaction was, use a workaround.

The usual workaround is to have a process that would spawn RavenDB and then it would be able to debug it. That is… possible, but it would be a major shift in how we deploy, not something that I wanted to do. There is also the ptrace_scope flag, which is supposed to control this behavior. In my tests, at least, disabling the security checks via this flag did absolutely nothing.

Running as root worked, just fine, of course. And then the process crashed. On Linux, when trying to debug your own process, there seems to be an interesting interaction between the debugger and debuggee if an exception is thrown. To the point where it will corrupt the CoreCLR state and kill the process. That was a fun bug to trace, sort of. Linux has a escape hatch in the form of PR_SET_PTRACER option that can be used. However, you can’t designate your own process, unfortunately. That combined with the hard crashes, made self debugging a non starter.

But I still want this feature, and without changing too much about how we are doing things.

Here is what we ended up doing. We have a separate process just to capture the stack trace. When you ask RavenDB to get its stack trace, it will spawn this process, but ask it to wait. It will then grant the new process the permissions necessary to debug RavenDB and signal it to continue. At this point, the debugger child process will capture the stack trace and send it back to RavenDB. RavenDB will reset the permission, enhance the stack trace with additional information that we can provide from inside the process and display it to the user.

The actual debugger process is also marked with setcap to provide it additional permissions it needs. This separation means that we isolate these permissions to a single purpose tool that can be invoked and closed, without increasing the attack surface of RavenDB.

The end result is that you can walk to a production RavenDB server, running on Windows or Linux, and get better information than if you just attached to it with the debugger.

time to read 8 min | 1502 words

One of the silent features of people moving to the cloud is that it make it the relation between performance and $$$ costs evident. In your on data center, it is easy to lose track between the yearly DC budget and the application performance. But on the cloud, you can track it quite easily.

The immediately lead people to try to optimize their costs and use the minimal amount of resources that they can get away with. This is great for people who believe in efficient software.

One of the results of the focus on price and performance has been the introduction of burstable cloud instances. These instances allow you to host machines that do not need the full machine resources to run effectively. There are many systems whose normal runtime cost is minimal, with only occasional spikes. In AWS, you have the T series and Azure has the B series. Given how often RavenDB is deployed on the cloud, it shouldn’t surprise you that we are frequently being run on such instances. The cost savings can be quite attractive, around 15% in most cases. And in the case of small instances, that can be even more impressive.

RavenDB can run quite nicely on a Raspberry PI, or a resource starved container, and for many workloads, that make sense. Looking at AWS in this case, consider the t3a.medium instance with 2 cores and 4 GB at 27.4$ / month vs. a1.large (the smallest non burstable instance) with the same spec (but ARM machine) at 37.2$ per month. For that matter, a t3a.small with 2 cores and 2 GB of memory is 13.7$. As you can see, the cost savings adds up, and it make a lot of sense to want to use the most cost effective solution.

Enter the problem with burstable instances. They are bursty. That means that you have two options when you need more resources. Either you end up with throttling (reducing the amount of CPU that you can use) or you are being charged for the additional extra CPU power you used.

Our own production systems, running all of our sites and backend systems are running on a cluster of 3 t3.large instances. As I mentioned, the cost savings are significant. But what happens when you go above the limit? In our production systems, for the past 6 months, we have never had an instance where RavenDB used more than the provided burstable performance. It helps that the burstable machines allows us to accrue CPU time when we aren’t using it, but overall it means that we are doing a good job of  handling requests efficiently. Here are some metrics from one of the nodes in the cluster during a somewhat slow period (the range is usually 20 – 200 requests / sec).

image

So we are pretty good in terms of latency and resource utilizations, that’s great.

However, the immediately response to seeing that we aren’t hitting the system limits is… to reduce the machine size again, to pay even less. There is a lot of material on cost optimization in the cloud that you can read, but that isn’t the point of this post. One of the more interesting choices you can make with burstable instances is to ask to not go over the limit. If your system is using too much CPU, just take it away until it is done. Some workloads are just fine for this, because there is no urgency. Some workloads, such as servicing requests, are less kind to this sort of setup. Regardless, this is a common enough deployment model that we have to take it into account.

Let’s consider the worst case scenario from our perspective. We have a user that runs the cheapest possible instance a t3a.nano with 2 cores and 512 MB of RAM costing just 3.4$ a month. The caveat with this instance is that you have just 6 CPU credits / hour to your name. A CPU credit is basically 100% CPU for 1 minute. Another way of looking at this is that t t3a.nano instance has a budget of 360 CPU seconds per hour. If it uses more than that, it is charged a hefty fee (about ten times per hour than the machine cost). So we have users that disable the ability to go over the budget.

Now, let’s consider what is going to happen to such an instance when it hits an unexpected load. In the context of RavenDB, it can be that you created a new index on a big collection. But something seemingly simple such as generating a client certificate can also have a devastating impact on such an instance.RavenDB generates client certificates with 4096 bits keys. On my machine (nice powerful dev machine), that can take anything from 300 – 900 ms of CPU time and cause a noticeable spike in one of the cores:

image

On the nano machine, we have measured key creation time of over six minutes.

The answer isn’t that my machine is 800 times more powerful than the cloud machine. The problem is that this takes enough CPU time to consume all the available credits, which cause throttling. At this point, we are in a sad state. Any additional CPU credits (and time) that we earn goes directly to the already existing computation. That means that any other CPU time is pushed back. Way back. This happens at a level below the operating system (at the hypervisor), so there it isn’t even aware of it. What is happening from the point of view of the OS is that the CPU is suddenly much slower.

All of this make sense and expected given the burstable nature of the system. But what is the observed behavior?

Well, the instance appears to be frozen. All the cloud metrics will tell you that everything is green, but you can’t access the system (no CPU to accept new connections) you can’t SSH into it (same) and if you have an existing SSH connection, it appears to have frozen. Measuring performance from inside the machine shows that everything is cool. One CPU is busy (expected, generating the certificate, doing indexing, etc). Another is idle. But the system behaves as if it has no CPU available, which is exactly what is going on, except in a way that we can’t tell.

RavenDB goes to a lot of trouble to be a nice citizen and a good neighbor. This includes scheduling operations so the underlying OS will always have the required resources to handle additional load. For example, background tasks are run with lowered priority, etc. But when running on a burstable machine that is being throttled… well, from the outside it looked like certain trivial operations would just take the entire machine down and it wouldn’t be recoverable short of hard reboot.

Even when you know what is going on, it doesn’t really help you. Because from inside the machine, there is no way to access the cloud metrics in a good enough precision to take action.

We have a pretty strong desire to not get into these kind of situation, so we implemented what I can only refer to as counter measures. When you are running on a burstable instance, you can let RavenDB know what is your CPU credits situation, at which point RavenDB will actively monitor the machine and compute its own tally of the outstanding CPU credits situation. When we notice that the CPU credits are running short, we’ll start pro-actively halting internal processes to give the machine more space to recover from the CPU credits shortage.

We’ll move ETL processes and ongoing backups to other nodes in the cluster and even pause indexing in order to recover the CPU time we need. Here is an example of what you’ll see in such a case:

image

One of the nice things about the kind of promises RavenDB make about its indexes is that it is able to make these kind of decisions without impacting the guarantees that we provide.

If the situation is still bad, we’ll start proactively rejecting requests. The idea is that instead of getting of getting to the point where we are being throttled (and effectively down to the world), we’ll process a request by simply sending 503 Service Unavailable response back. This is going to be very cheap to do, so won’t put additional strain on our limited budget. At the same time, the RavenDB’s client treat this kind of error as a trigger for a failover and will use another node to service this request.

This was a pretty long post to explain that RavenDB is going to give you a much nicer experience on burstable machines even if you don’t have bursting capabilities.

time to read 5 min | 960 words

imageWhat happens when you want to to page through result sets of a query while the underlying data set is being constantly modified?

This seems like a tough problem, and one that you wouldn’t expect to encounter very often, right? But a really good example of just this issue is the notion of a feed in a social network. To take Twitter for simplicity, you have many people generating twits, and other users are browsing through their timeline.

What ends up happening is that the user browsing through their timeline is actually trying to page through the results, but at the same time, you are going to get more updates to the timeline while you are reading it. One of the key requirements that we have here, then, is that we want to be sure that we aren’t actually creating a lot of jitter for the user as they scroll through the timeline. Luckily, because this is a hard problem, the users are already quite familiar with some of the side affects. It would surprise no one to see the same twit multiple times in the timeline. It can be because of a retweet or a like by a user you follow, or it can be a result of the way paging is done.

Now that we understand what we want to achieve, let’s see how you can try getting there. The simplest way to handle this is to ask the database for some sort of a stable reference for the query. So instead of executing the query and being done with it, you’ll have the server maintain it for a period of time and send you the pages as you need them. This is simple, easy to implement and costly in terms of system resources. You’ll need to keep the query results in memory for each one of your users, and that can be quite a lot of memory to keep around just in case. Especially given the different between human’s interaction times and the speed of modern systems.

Another way to do that is to ask the database engine to generate a way to re-create the query as it was at that time. This is sometimes called a continuation token or some such. That works great, usually, but come with its own complications. For example, imagine that I’m doing this on the following query:

from Users order by LastName limit 5

Which gives us the following result:

image

And I got the first five users, and now I want to get the next five. Between the first and second query, a user whose last name is “Aardvark” was inserted in the system. At this point, what would you expect to get from the query? We have two choices here, as you can see below:

image

The problem is that from my perspective, both of those has serious issues. To start with, to compute the results shown in orange, you’ll need to jump through some serious hoops on the backend, and the result looks strange. To get the results in green is quite easy, but it will mean that you missed out on seeing Aardvark.

You might have noticed that the key issue here isn’t so much with the way we build the query, but the order in which we need to traverse it. We ask to sort it by last name, but we also want to get results as they come by. As it turns out, the process becomes a whole lot simpler if we unify these concepts. If we issue the following query, for example, our complexity threshold drops by a lot.

from Messages order by CreatedAt desc limit 5

This is because we now get the following results:

image

And paging through this now is pretty easy, if we want to page down, we can now issue the query to get the next page of data:

from Messages order by CreatedAt desc where CreatedAt < 217 limit 5

By making sure that we are paging and filtering on the same property, we can easily scroll through the results without having to do too much work, either in the application or in our database backend. We can also query if there are new stuff that were missed by checking CreatedAt > 222, of course.

But there is one wrinkle here. I intentionally used the CreateAt field but put numeric values there. Did you notice that there was no 220 value? That one was created on an isolated node and didn’t arrive yet. When it will show up in the local database, we’ll need to decide if we’ll give it a new value (making sure it will show up in the timeline) or store it as is, meaning that it might get lost.

These type of questions are probably more relevant at the business level. You might want to apply different behaviors based on how many likes a twit has, for example.

Another option is to have an UpdatedAt field as well, which can allow you to quickly answer the question: “What items in the range I scan has changed?”. This method also allows for simpler model for handling updates, but much of that depends on the kind of behavior you want to get. This method handles updates, including updates to the parts seen and unseen in a reasonable way and predictable cost.

time to read 7 min | 1301 words

Last week I posted about some timeseries work that we have been doing with RavenDB. But I haven’t actually talked about the feature in this space before, so I thought that this would be a good time to present what we want to build.

The basic idea with timeseries is that this is a set of data points taken over time. We usually don’t care that much about an individual data point but care a lot about their aggregation. Common usages for time series include:

  • Heart beats per minute
  • CPU utilization
  • Central back interest rate
  • Disk I/O rate
  • Height of ocean tide
  • Location tracking for a vehicle
  • USD / Bitcoin closing price

As you can see, the list of stuff that you might want to apply this to is quite diverse. In a world that keep getting more and more IoT devices, timeseries storing sensor data are becoming increasingly common. We looked into quite a few timeseries databases to figure out what needs they serve when we set out to design and build timeseries support to RavenDB.

RavenDB is a document database, and we envision timeseries support as something that you use at the document boundary. A good example of that would the heartrate example. Each person has their own timeseries that record their own heartrate over time. In RavenDB, you would model this as a document for each person, and a heartrate timeseries on each document.

Here is how you would add a data point to my Heartrate’s timeseries:

image

I intentionally starts from the Client API, because it allow me to show off several things at once.

  1. Appending a value to a timeseries doesn’t require us to create it upfront. It will be created automatically on first use.
  2. We use UTC date times for consistency and the timestamps have millisecond precision.
  3. We are able to record a tag (the source for this measurement) on a particular timestamp.
  4. The timeseries will accept an array of values for a single timestamp.

Each one of those items is quite important to the design of RavenDB timeseries, so let’s address them in order.

The first thing to address is that we don’t need to create timeseries ahead of time. Doing so will introduce a level of schema to the database, which is something that we want to avoid. We want to allow the user complete freedom and minimum of fuss when they are building features on top of timeseries. That does lead to some complications on our end. We need to be ab le to support timeseries merging. Allowing you to append values on multiple machines and merging them together into a coherent whole.

Given the nature of timeseries, we don’t expect to see conflicting values. While you might see the same values come in multiple times, we assume that in that case you’ll likely just get the same values for the same timestamps (duplicate writes). In the case of different writes on different machines with different values for the same timestamp, we’ll arbitrarily select the largest of those values and proceed.

Another implication of this behavior is that we need to handle out of band updates. Typically in timeseries, you’ll record values in increasing date order. We need to be able to accept values out of order. This turns out to be pretty useful in general, not just for being able to handle values from multiple sources, but also because it is possible that you’ll need to load archived data to already existing timeseries.  The rule that guided us here was that we wanted to allow the user as much flexibility as possible and we’ll handle any resulting complexity.

The second topic to deal with is the time zone and precision. Given the overall complexity of time zones in general, we decided that we don’t want to deal with any of that and want to store the times in UTC only. That allows you to work properly with timestamps taken from different locations, for example. Given the expected usage scenarios for this feature, we also decided to support millisecond precision. We looked at supporting only second level of precision, but that was far too limiting. At the same time, supporting lower resolution than millisecond would result in much lower storage density for most situations and is very rarely useful.

Using DateTime.UtcNow, for example, we get a resolution on 0.5 – 15 ms, so trying to represent time to a lower resolution isn’t really going to give us anything. Other platforms have similar constraints, which added to the consideration of only capturing the time to millisecond granularity.

The third item on the list may be the most surprising one. RavenDB allows you to tag individual timestamps in the timeseries with a value. This gives you the ability to record metadata about the value. For example, you may want to use this to record the type of instrument that supplied the value. In the code above, you can see that this is a value that I got from a FitBit watch. I’m going to assign it lower confidence value than a value that I got from an actual medical device, even if both of those values are going to go on the same timeseries.

We expect that the number of unique tags for values in a given time period is going to be small, and optimize accordingly. Because of the number of weasel words in the last sentence, I feel that I must clarify. A given time period is usually in the order of an hour to a few days, depending on the number of values and their frequencies. And what matters isn’t so much the number of values with a tag, but the number of unique tags. We can very efficiently store tags that we have already seen, but having each value tagged with a different tag is not something that we designed the system for.

You can also see that the tag that we have provided looks like a document id. This is not accidental. We expect you to store a document id there, and use the document itself to store details about the value. For example, if the type of the device that captured the value is medical grade or just a hobbyist. You’ll be able to filter by the tag as well as filter by the related tag document’s properties. But I’ll show that when I’ll post about queries, in a different post.

The final item on the list that I want to discuss in this post is the fact that a timestamp may contain multiple values. There are actually quite a few use cases for recording multiple values for a single timestamp:

  • Longitude and latitude GPS coordinates
  • Bitcoin value against USD, EUR, YEN
  • Systolic and diastolic reading for blood pressure

In each cases, we have multiple values to store for a single measurement. You can make the case that the Bitcoin vs. Currencies may be store as stand alone timeseries, but GPS coordinates and blood pressure both produce two values that are not meaningful on their own. RavenDB handles this scenario by allowing you to store multiple values per timestamp. Including support for each timestamp coming with a separate number of values. Again, we are trying to make it as easy as possible to use this feature.

The number of values per timestamp is going to be limited to 16 or 32, we haven’t made a final decision here. Regardless of the actual maximum size, we don’t expect to have more than a few of those values per timestamp in a single timeseries.

Then again, the point of this post is to get you to consider this feature in your own scenarios and provide feedback about the kind of usage you want to have for this feature. So please, let us know what you think.

time to read 1 min | 81 words

Yesterday I talked about the design of the security system of RavenDB. Today I re-read one of my favorite papers ever about the topic.

This World of Ours by James Mickens

This is both one of the most hilarious paper I ever read (I had someone check up on me when I was reading that, because of suspicious noises coming from my office) and a great insight into threat modeling and the kind of operating environment that your system will run at.

time to read 4 min | 671 words

A decade(!) ago  I wrote that you should avoid soft deletes. Today I run into a question in the mailing list and I remembered writing about this, it turned out that there has been quite the discussion on this at the time.

The context of the discussion at the time was deleting data from relational systems, but the same principles apply. The question I just fielded asked how you can translate a Delete() operation inside the RavenDB client to a soft delete (IsDeleted = true) operation. The RavenDB client API supports a few ways to interact with how we are talking to the underlying database, including some pretty interesting hooks deep into the pipeline.

What it doesn’t offer, though, is a way to turn a Delete() operation into and update (or an update to a delete). We do have facilities in place that allow you to detect (and abort) on invalid operations. For example, invoices should never be deleted. You can tell the RavenDB client API that it should throw whenever an invoice is about to be deleted, but you have no way of saying that we should take the Delete(invoice) and turn that into a soft delete operation.

This is quite intentionally by design.

Having a way to transform basic operations (like delete –> update) is a good way to be pretty confused about what is actually going on in the system. It is better to allow the user to enforce the required behavior (invoices cannot be deleted) and let the calling code handle this different.

The natural response here, of course, is that this places a burden on the calling code. Surely we want to be able to follow DRY and not write conditionals when the user clicks on the delete button. But this isn’t an issue where this is extra duplicated code.

  • An invoice is never deleted, it is cancelled. There are tax implications on that, you need to get it correct.
  • A payment is never removed, it is refunded.

You absolutely want to block deletions of those type of documents, and you need to treat them (very) different in code.

In the enusing decade since the blog posts at the top of this post were written, there have been a number of changes. Some of them are architecturally minor, such as the database technology of choice or the guiding principles for maintainable software development. Some of them are pretty significant.

One such change is the GDPR.

“Huh?!” I can imagine you thinking. How does the GDPR applies to an architectural discussion of soft deletes vs. business operations. It turns out that it is very relevant. One of the things that the GDPR mandates (and there are similar laws elsewhere, such as the CCPA) the right to be forgotten. So if you are using soft deletes, you might actually run into real problems down the line. “I asked to be deleted, they told me they did, but they secretly kept my data!”.  The one thing that I keep hearing about the GDPR is that no one ever found it humorous. Not with the kind of penalties that are attached to it.

So when thinking about deletes in your system, you need to consider quite a few factors:

  • Does it make sense, from a business perspective, to actually lose that data? Deleting a note from a customer’s record is probably just fine. Removing the record of the customer at all? Probably not.
  • Do I need to keep this data? Invoices are one thing that pops to mind.
  • Do I need to forget this data? That is the other way, and what you can forget and how can be really complex.

At any rate, for all but the simplest scenarios, just marking IsDeleted = true is likely not going to be sufficient. And all the other arguments that has been raised (which I’m not going to repeat, read the posts, they are good ones) are still in effect.

time to read 2 min | 394 words

Image result for ravendb clusterRavenDB allows you to tune, per transaction, what level of safety you want your changes to have. At the most basic level, you can select between a single node transaction and a cluster wide transaction.

We get questions from customers about the usage scenario for each mode. It seems obvious that we really want to always have the highest level of safety for our data, so why not make sure that all the data is using cluster wide transactions at all times?

I like to answer this question with the lottery example. In the lottery example, there are two very distinct phases for the system. First, you record lottery tickets as they are purchased. Second, you run the actual lottery and select the winning numbers. (Yes, I know that this isn’t exactly how it works, bear with me for the sake of a clear example).

While I’m recording purchased lottery ticket, I always want to succeed in recording my writes. Even if there is a network failure of some kind, I never want to lose a write. It is fine if only one node will accept this write, since it will propagate the data to the rest off the cluster once communication is restored. In this case, you can use the single node transaction mode, and rely on RavenDB replication to distribute the data to the rest of the cluster. This is also the most scalable approach, since we can operate on each node separately.

However, for selecting the winning numbers (and tickets), you never want to have any doubt or the possibility of concurrency issues with that. In this case, I want to ensure that there is just one lottery winner selection transaction that actually commit, and for those purposes, I’m going to use the cluster transaction mode. In this way, we ensure that a quorum of the cluster will confirm the transaction for it to go through. This is the right thing to do for high value, low frequency operations.

We also have additional settings available, beyond the single node / full cluster quorum, which is write to a single node and wait for the transaction to propagate the write to some additional nodes. I don’t really have a good analogy for this use case using the lottery example, though. Can you think of one?

time to read 4 min | 735 words

imageAbout a month ago I wrote about a particular issue that we wanted to resolve. RavenDB is using X509 certificates for authentication. These are highly secured and are a good answer for our clients who need to host sensitive information or are working in highly regulated environments. However, certificates have a problem, they expire. In particular, if you are following common industry best practices, you’ll replace your certificates every 2 – 3 months. In fact, the common setup of using RavenDB with Let’s Encrypt will do just that. Certificates will be replaced on the fly by RavenDB without the need for an administrator involvement.

If you are running inside a single cluster, that isn’t something you need to worry about. RavenDB will coordinate the certificate update between the nodes in such a way that it won’t cause any disruption in service. However, it is pretty common in RavenDB to have multi cluster topologies. Either because you are deployed in a geo-distributed manner or because you are running using complex topologies (edge processing, multiple cooperating clusters, etc). That means that when cluster A replaces its certificate, we need to have a good story for cluster B still allowing it access, even though the certificate has changed.

I outlined our thinking in the previous post, and I got a really good hint,  13xforever suggested that we’ll look at HPKP (HTTP Public Key Pinning) as another way to handle this. HPKP is a security technology that was widely used, run into issues and was replaced (mostly by certificate transparency). With this hint, I started to investigate this further. Here is what I learned:

  • A certificate is composed of some metadata, the public key and the signature of the issuer (skipping a lot of stuff here, obviously).
  • Keys for certificates can be either RSA or ECDSA. In both cases, there is a 1:1 relationship between the public and private keys (in other words, each public key has exactly one private key).

Given these facts, we can rely on that to avoid the issues with certificate expiration, distributing new certificates, etc.

Whenever a cluster need a new certificate, it will use the same private/public key pair to generate the new certificate. Because the public key is the same (and we verify that the client has the private key during the handshake), even if the certificate itself changed, we can verify that the other side know the actual secret, the private key.

In other words, we slightly changed the trust model in RavenDB. From trusting a particular certificate, we trust that certificate’s private key. That is what grants access to RavenDB. In this way, when you update the certificate, as long as you keep the same key pair, we can still authenticate you.

This feature means that you can drastically reduce the amount of work that an admin has to do and lead you to a system that you setup once and just keeps working.

There are some fine details that we still had to deal with, of course. An admin may issue a certificate and want it to expire, so just having the user re-generate a new certificate with the private key isn’t really going to work for us. Instead, RavenDB validates that the chain of signatures on the certificate are the same. Actually, to be rather more exact, it verifies that the chain of signatures that signed the original (trusted by the admin) certificate and the new certificate that was just presented to us are signed by the same chain of public key hashes.

In this way, if the original issuer gave you a new certificate, it will just work. If you generate a new certificate on your own with the same key pair, we’ll reject that. The model that we have in mind here is trusting a driver’s license. If you have an updated driver’s license from the same source, that is considered just as valid as the original one on file. If the driver license is from Toys R Us, not so much.

Naturally, all such automatic certificate updates are going to be logged to the audit log, and we’ll show the updated certificates in the management studio as well.

As usual, we’ll welcome your feedback, the previous version of this post got us a great feature, after all.

time to read 4 min | 655 words

This post really annoyed me. Feel free to go ahead and go through it, I’ll wait. The gist of the post, titled: “WAL usage looks broken in modern Time Series Databases?” is that time series dbs that uses a Write Ahead Log system are broken, and that their system, which isn’t using a WAL (but uses Log-Structure-Merge, LSM) is also broken, but no more than the rest of the pack.

This post annoyed me greatly. I’m building databases for a living, and for over a decade or so, I have been focused primarily with building a distributed, transactional (ACID), database. A key part of that is actually knowing what is going on in the hardware beneath my software and how to best utilize that. This post was annoying, because it make quite a few really bad assumptions, and then build upon them. I particularly disliked the outright dismissal of direct I/O, mostly because they seem to be doing that on very partial information.

I’m not familiar with Prometheus, but doing fsync() every two hours basically means that it isn’t on the same plane of existence as far as ACID and transactions are concerned. Cassandra is usually deployed in cases where you either don’t care about some data loss or if you do, you use multiple replicas and rely on that. So I’m not going to touch that one as well.

InfluxDB is doing the proper thing and doing fsync after each write. Because fsync is slow, they very reasonable recommend batching writes. I consider this to be something that the database should do, but I do see where they are coming from.

Postgres, on the other hand, I’m quite familiar with, and the description on the post is inaccurate. You can configure Postgres to behave in this manner, but you shouldn’t, if you care about your data. Usually, when using Postgres, you’ll not get a confirmation on your writes until the data has been safely stored on the disk (after some variant of fsync was called).

What really got me annoyed was the repeated insistence of “data loss or corruption”, which shows a remarkable lack of understanding of how WAL actually works. Because of the very nature of WAL, the people who build them all have to consider the nature of a partial WAL write, and there are mechanisms in place to handle it (usually by considering this particular transaction as invalid and rolling it back).

The solution proposed in the post is to use SSTable (sorted strings table), which is usually a component in LSM systems. Basically, buffer the data in memory (they use 1 second intervals to write it to disk) and then write it in one go. I’ll note that they make no mention of actually writing to disk safely. So no direct I/O or calls to fsync. In other words, a system crash may leave you a lot worse off than merely 1 second of lost data.  In fact, it is possible that you’ll have some data there, and some not. Not necessarily in the order of arrival.

A proper database engine will:

  • Merge multiple concurrent writes into a single disk operation. In this way, we can handle > 100,000 separate writes per seconds (document writes, so significantly larger than the typical time series drops) on commodity hardware.
  • Ensure that if any write was confirmed, it actually hit durable storage and can never go away.
  • Properly handle partial writes or corrupted files, in such a way that none of the invariants on the system is violated.

I’m leaving aside major issues with LSM and SSTables, of which write amplification, and the inability to handle sustained high loads (because there is never a break in which you can do book keeping). Just the portions on the WAL usage (which shows broken and inefficient use) to justify another broken implementation is quite enough for me.

FUTURE POSTS

  1. The importance of redundancy - 8 hours from now

There are posts all the way to Jul 23, 2019

RECENT SERIES

  1. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
  2. Production postmortem (26):
    07 Jun 2019 - Printer out of paper and the RavenDB hang
  3. Reviewing Sled (3):
    23 Apr 2019 - Part III
  4. RavenDB 4.2 Features (5):
    21 Mar 2019 - Diffing revisions
  5. Workflow design (4):
    06 Mar 2019 - Making the business people happy
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats