Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,162 | Comments: 50,148

Privacy Policy Terms
filter by tags archive
time to read 2 min | 325 words

Yesterday I asked about dealing with livelihood detection of nodes running in AWS. The key aspect is that this need to be simple to build and easy to explain.

Here are a couple of ways that I came up with, nothing ground breaking, but they do the work while letting someone else do all the heavy lifting.

Have a well known S3 bucket that each of the nodes will write an entry to. The idea is that we’ll have something like (filename –  value):

  • i-04e8d25534f59e930 – 2021-06-11T22:01:02
  • i-05714ffce6c1f64ad – 2021-06-11T22:00:49

The idea is that each node will scan the bucket and read through each of the files, getting the last seen time for all the nodes. We’ll consider all the nodes whose timestamp is within the last 1 minute to be alive and any other node is dead.  Of course, we’ll also need to update the node’s file on S3 every 30 seconds to ensure that other nodes know that we are alive.

The advantage here is that this is trivial to explain and implement and it can work quite well in practice.

The other option is to actually piggy back on top of the infrastructure that is dedicated for this sort of scenario. Create an elastic load balancer and setup a target group. On startup, the node will register itself to the target group and setup the health check endpoint. From this point on, each node can ask the target group to find all the healthy nodes.

This is pretty simple as well, although it requires significantly more setup. The advantage here is that we can detect more failure modes (a node that is up, but firewalled away, for example).

Other options, such as having the nodes ping each other, are actually quite complex since they need to find each other. That lead to some level of service locator, but then you’ll have to avoid each node pining all the other nodes, since that can get busy on the network.

time to read 2 min | 286 words

I’m teaching a course at university about cloud computing. That can be a lot of fun, but quite frustrating at time. The key issue for me is that I occasionally need to provide students with some way to do something that I know how to do properly, but I can’t.

Case in point, assuming that I have a distributed cluster of nodes, and we need to detect what nodes are up or down, how do you do that?

With RavenDB, we assign an observer to the cluster whose job is to do health monitoring. I can explain that to the students, but I can’t expect them to utilize this technique in their exercises, there is too much detail there. The focus of the lesson or exercise is not to build a distributed system but to make use of one, after all.

As a rule, I try to ensure that all projects that we are working on can be done in under 200 lines of Python code. That puts a hard limit to the amount of behavior I can express. Because of that, I find myself looking for ways to rely on existing infrastructure to deal with the situation. 

Each node is running the same code, and they are setup so they can talk to one another, if needed. It is important that all the live nodes will converge to agree on the active nodes in relatively short order.

The task is to find the list of active nodes in a cluster, where nodes may go up or down dynamically. We are running in AWS cloud so you can use its resources, how would you do that?

The situation should be as simple as possible and easy to explain to students.

time to read 5 min | 885 words

In the database field and information retrieval in general, there is a very common scenario. I have a list of (sorted) integers that I want to store, and I want to do that in an as efficient a manner as possible. There are dozens of methods to do this and this is a hot topic for research. This is so useful because there are so many places where you can operate on a sorted integer list and gain massive benefits. Unlike generic compression routines, we can usually take advantage of the fact that we understand the data we are trying to work with and get better results.

The reason I need to compress integers (actually, int64 values) is that I’m trying to keep track of matches for some data, so the integers that I’m tracking are actually file offsets for user’s data inside of Voron. That lead to a few different scenarios that I have to deal with:

  • There is a single result
  • There is a reasonable number of results
  • There is a boatload of results

I’m trying to figure out what is the best way to store the later two options in as efficient manner as possible.

The first stop was Daniel Lemire’s blog, naturally, since he has wrote about this extensively. I looked at the following schemes: FastPFor and StreamVByte. They have somewhat different purposes, but basically, FastPFor is using a bits stream while StreamVByte is using byte oriented mode. Theoretically speaking, you can get better compression rate from FastPFor, but StreamVByte is supposed to be faster. Another integer compression system come from the Gorilla paper from Facebook, that is a bigger scheme, which include time series values compression. However, part of that scheme talks about how you can compress integers (they use that to store the ticks of a particular operation). We are actually using that for the time series support inside of RavenDB.

I’m not going to cover that in depth, here is the paper on Gorilla compression, the relevant description is at section 4.1.1. Suffice to say that they are using a bit stream and delta of deltas computation. Basically, if you keep getting values that are the same distance apart, you don’t need to record all the value, you can compute that naturally. In the best case scenario, Gorilla compression needs a single bit per value, assuming the results are spaced similarly.

For my purpose, I want to get as high a compression rate as possible, and I need to store just the list of integers. The problem with Gorilla compression is that if we aren’t getting numbers that are the same distance apart, we need to record the amount that they are different. That means that at a minimum, we’ll need a minimum of 9 bits per value. That adds up quickly, sadly.

On the other hand, with PFor, there is a different system. PFor computes the maximum number of bits required for a batch of integer, and then record just those values. I found the Binary Packing section (2.6) in this paper to be the clearest explanation on how that works exactly.  The problem with PFor, however, is that if you have a single large outlier, you’ll waste a lot of bits unnecessarily.

I decided to see if I can do something about that and created an encoder that works on batches of 128 integers at a time. This encoder will:

  • Check the maximum number of bits required to record the deltas of the integers. That along already saves us a lot.
  • Then we check the top and bottom halves of the batch, to see if we’ll get a benefit from recording them separately. A single large value (or a group of them) that is localized to a part of the batch will be recorded independently in this case.
  • Finally, instead of only recording the meaningful bit ranges, we’ll also analyze the batch we get further. The idea is to try to find ranges within the batch that have the same distance from one another. We can encode those as repetitions instead of each independent value. That can end up saving a surprisingly amount of space.

You can look at the results of my research here. I’ll caution you that this is raw, but the results are promising. I’m able to beat (in terms of compression rate) the standard PFor implementation by a bit, with a lot less code.

I’m also looking at a compression rate of 30% – 40% for realistic data sets. And if the data is setup just right and I’m able to take advantage of the repeated delta optimization, I can pack things real tight.

Currently numbers say that I can push upward of 10,000 int64 values in an 8KB buffer without any repeated deltas. It goes to just under 500,000 int64 values in an 8KB buffer if I can take full advantage of the deltas.

The reason I mention the delta so often, it is very likely that I’ll record values that are roughly the same size, so we’ll get offsets that are the same space from one another. In that case, my encoder goes to town and the compression rate is basically crazy.

This is a small piece of a much larger work, but this is the first time in a while that I got to code at Voron’s level. This is fun.

time to read 3 min | 438 words

RavenDB has been using the Raft protocol for the past years. In fact, we have written three or four different implementations of Raft along the way. I implemented Raft using pure message passing, on top of async RPC and on top of TCP. I did that using actor model and using direct parallel programming as well as the usual spaghettis mode.

The Raft paper is beautiful in how it explain a non trivial problem in a way that is easy to grok, but it is also something that can require dealing with a number of subtleties. I want to discuss some of the ways to successfully implement it. Note that I’m assuming that you are familiar with Raft, so I won’t explain anything here.

A key problem with Raft implementations is that you have multiple concurrent things happening all at once, on different machines. And you always have the election timer waiting in the background. In order to deal with that, I divide the system into independent threads that each has their own task.

I’m going to talk specifically about the leader mode, which is the most complex aspect, usually. In this mode, we have:

  • Leader thread – responsible for determining the current progress in the cluster.
  • Follower thread – once per follower – responsible for communicating with a particular follower.

In addition, we may have values being appended to our log concurrently to all of the above. The key here is that the followers threads will communicate with their follower and push data to it. The overall structure for a follower thread looks like this:

What is the idea? We have a dedicated thread that will communicate with the follower. It will either ping the follower with an empty AppendEntries (every 1/3 of the election timeout) or it will send a batch of up to 50 entries to update the follower. Note that there is nothing in this code about the machinery of Raft, that isn’t the responsibility of the follower thread. The leader, on the other hand, listen to the notifications from the followers threads, like so:

The idea is that each aspect of the system is running independently, and the only communication that they have with each other is the fact that they can signal the other that they did some work. We then can compute whatever that work changed the state of the system.

Note that the code here is merely drafts, missing many details. For example, we aren’t sending the last commit index on AppendEntries, and committing the log is an asynchronous operation, since it can take a long time and we need to keep the system in operation.

time to read 2 min | 316 words

I run into this question on Hacker News, asking for the best computer science papers. There are a few that I keep getting back to, either because they are so fundamental or they are so useful.

Without any particular order

  • The Raft Paper – a distributed consensus algorithm that made sense to me on first read. There are a lot of subtle issues to consider, but when reading the paper, everything clicked. That is head and shoulders above what Paxos literature is about.
  • The Ubiquitous BTree – talk about a paper that I used daily. Admittedly, I didn’t get started on BTrees from this paper, but this is a very well written one and it does a great job presenting the topic. It is also from 1979, and BTree were already “ubiquitous” at that time, which tells us something.
  • Extendible Hashing – this is also from 1979, and it is well written. I implemented extendible hashing based on this article directly and I grokked it right away.
  • How Complex Systems Fail – not strictly a computer science paper. In fact, I’m fairly certain that this fits more into civil engineering, but it does an amazing job of explaining the internals of complex systems and the why and how they fail. I took a lot from this paper. It is also very short and highly readable.
  • OLTP Through the Looking Glass – discuss the internal structure of database engines and the cost and complexities of their various pieces.
  • You’re doing it wrong – discuss the implementation of Varnish proxy from the point of view of a kernel hacker. Totally different approach to the design of the system. Had a lot of influence on how I build systems.

I’m fairly certain that my criteria won’t be yours, but those are all papers that I have read multiple times and have utilized their insights in my daily work.

time to read 4 min | 663 words

This is a tale of two options that we took for an exhaustive test. Amazon recently came out with a new disk type on the cloud. As a database vendor, that is of immediate interest to me, so we took a deep look into that.

GP3 disks are about 20% cheaper than their GP2 equivalent. What is more, they come with a guarantee level of performance even before you purchase additional IOPS. Consider the following two disks:

  Size IOPS MB/S Price
GP2 512GB 1,536 250 51.2 USD
GP3 512GB 3,000 125 40.9 USD
GP3 512GB 4,075 250 51.2 USD

In other words, for the same disk, we can get a much better baseline performance at a cheaper price. What isn’t there not to like?

The major difference between GP2 and GP3, however, is their latency. In practice, we see an additional 1 – 2 milliseconds in response times from the GP3 disk vs. the GP2 disk. In other words, GP3 disks are somewhat slower, even if they are able to run more IOPS, their latency is higher.

A really key observation from us, however, is that GP3 does not offer burst I/O capabilities. And that means that I can breath a huge sigh of relief.

RavenDB as a database is meant to run on anything from an SD card to HDD to SSD to NVMe drives. We are used to account for the I/O being the slowest thing around and have already mostly coded around that. An additional millisecond in disk latency doesn’t matter that much in the grand scheme of things.

However… the fact that this doesn’t provide I/O burst is a huge plus for us. RavenDB can easily deal with slow I/O, what it find it very hard to deal with is an environment that very rapidly change its operational characteristics.

Let’s assume that we have a 100 GB GP2 disk, which means that we have a baseline of around 300 IOPS and 75MB / sec of throughput. RavenDB is under some high load, and it is using the maximum capabilities of the hardware. However, because of burstiness, we are actually able to utilize 3,000 IOPS and 250MB/sec for a while.

Until all the I/O credits are gone and we are forced into a screeching halt. That means, for example, that we read from the network at a rate of 250MB/sec, but we are unable to write to the disk at this level. There is a negative balance of 125MB/sec that needs to be stored some where. We can buffer that in memory, of course, but that only work for so long. That means that we have to put a huge break all of a sudden, which the rest of the eco system isn’t happy with. For example, the other side that is sending us data at 250MB /sec, they are likely not going to be able to respond in time to the shift is our behavior. It is very likely that the network connection would congest and break in this case.

All of the internal optimizations inside of RavenDB will also be skewed for a while, until we are used to the new level of speed. If this was gradual, we could adjust a lot more easily, but this is basically like hitting the brakes at speed. You will slow down, sure, but you are also likely to cause an accident.

As a simple example, RavenDB can compress the data that it writes to disk, and it balances the compression ratio vs. the cost to write to the disk. If we know that the disk is slow, we can spend more time trying to reduce the amount of data we write. If this changes rapidly, we are operating under the old assumptions and may create a true traffic jam

The fact that GP3 disks have a predictable performance profile means that we are much better suited to run on them. A more predictable platform from which to operate gives me a much better opportunity to handle optimizations.

time to read 3 min | 548 words

Voron is RavenDB’s storage engine. It is how we store data, keep transactions and in generally get a lot of our abilities. In this post, I want to talk about the way RavenDB manages disk space internally.  Voron uses a single data file to do its work, the data file is divided into 8KB pages, like so:

Voron uses eager disk allocations to reserve disk space from the operating system. Each time the space inside the file runs out, Voron will double the size of the file. That last until the file size reaches 2GB, after which RavenDB will grow by 1GB at a time. This behavior ensures that Voron gives the underlying file system enough information to provide the database with a continuous range of disk space. In other words, we grab disk space in large chunks to avoid fragmentation of the data file.  

What happens when you delete data, however? Voron mark the free space in its free list and will use that space before it will allocate more disk space from the operating system.

Why aren’t we releasing the disk space back to the operating system? The simplest reason is that it isn’t an just the data at the end of the file that is freed. In fact, like in the image above, free and busy segments are interwoven in the file. We can’t just truncate the file.

Internal references inside of RavenDB make use of the position data inside the file, so just moving the data won’t help. Instead, you have to compact the data. That forces us to re-write the entire database layout from scratch and fixes those references.  That is an offline operation, however.

For the most part, it doesn’t actually matter. RavenDB will use the internal free space as needed, so it isn’t like it is actually lost.

One feature that we are considering for version 6.0 of RavenDB is hole punching. That means that we’ll make use of advanced file system API to free the disk space allocated to RavenDB even mid file.

On Linux, that means using FALLOC_FL_PUNCH_HOLE. On Windows, that means using FSCTL_SET_ZERO_DATA.

That will have the advantage of freeing disk space back to the operating system without needing user intervention. In particular, that is going to make it so a user that delete data to free disk space see the free space reflected in the OS metrics.

There are problems with this approach, however. First, the size of the file remains the same, which leads to interesting questions. Consider:

image

Second, this defeats the purpose of wanting to optimize disk allocations. If we free disk space in this manner, when we get it back, it may no longer be continuous on the disk. That said, it is not that big a problem in the days of SSDs and NVMes as it was at the time of the rotational hard disk.

Then, you may get into a very bad situation in which Voron tries to use disk space that it had allocated mid file (but was already freed) but it can’t, because the disk is full. Right now, this is simply an impossible error, with hole punching, we need to consider how to deal with this.

time to read 2 min | 370 words

RavenDB is written in C# and .NET, unlike most of the database engines out there. The other databases are mostly written in C, C++ and sometimes Java.

I credit the fact that I wrote RavenDB in C# as a major part of the reason I was able to drive it forward to the point it is today. That wasn’t easy and there are a number of issues that we had to struggle with as a result of that decision. And, of course, all the other databases at the playground look at RavenDB strangely for being written in C#.

In RavenDB 4.0, we have made a lot of architectural changes. One of them was to replace some of the core functionality of RavenDB with a C library to handle core operations. Here is what this looks like:

However, there is still a lot to be done in this regard and that is just a small core.

Due to the COVID restrictions, I found myself with some time on my hands and decided that I can spare a few weekends to re-write RavenDB from scratch in C. I considered using Rust, but that seemed like to be over the top.

The results of that can be seen here. I kept meticulous records of the process of building this, which I may end up publishing at some point. Here is an example of how the code looks like:

The end result is that I was able to take all the knowledge of building and running RavenDB for so long and create a compatible system in not that much code. When reading the code, you’ll note methods like defer() and ensure(). I’m using compiler extensions and some macro magic to get a much nicer language support for RAII. That is pretty awesome to do in C, even if I say so myself and has dramatically reduced the cognitive load of writing with manual memory management.

An, of course, following my naming convention, Gavran is Raven in Croatian.

I’ll probably take some time to finish the actual integration, but I have very high hopes for the future of Gavran and its capabilities. I’m currently running benchmarks, you can expect them by May 35th.

time to read 7 min | 1398 words

A couple of days, a fire started in OVH’s datacenter. You can read more about this here:

They use slightly different terminology, but translating that to the AWS terminology, an entire "region” is down, with SGB1-4 being “availability zones” in the region.

For reference, there are some videos from the location that make me very sad. This is what it looked like under active fire:

https://i.imgur.com/Sbt0IoR.jpg

I’m going to assume that this is a total loss of everything that was in there.

RavenDB Cloud isn’t offering any services in any OVH data centers, but this is a good time to go over the Disaster Recovery Plan for RavenDB and RavenDB Cloud. It is worth noting that the entire data center has been hit, with the equivalent to an entire AWS region going down.

I’m not sure that this is a fair comparison, it doesn’t looks like that SBG 1-4 are exactly the same thing as AWS availability, but it is close enough to draw parallels.

So far, at least, there have been no cases where Amazon has lost an entire region. There were occasions were a whole availability zone was down, but never a complete region. The way Amazon is handling Availability Zones seems to most paranoid, with each availability zone distanced “many kilometers” from each other in the same region. Contrast that with the four SGB that all went down. For Azure, on the other hand, they explicitly call out the fact that availability zones may not provide sufficient cover for DR scenarios. Google Cloud Platform also provides no information on the matter. For that matter, we also have direct criticism on the topic from AWS.

Yesterday, on the other hand, Oracle Cloud had a DNS configuration error that took effectively the entire cloud down.  The good news is that this is just inability to access the cloud, not actual loss of a region, as was the case on OVH. However, when doing Disaster Recovery Planning, having the the entire cloud drop off the face of the earth is also something that you have to consider.

With that background out of the way, let’s consider the impact of losing two availability zones inside AWS, losing a entire region in Azure or GCP or even losing an entire cloud platform. What would be the impact on a RavenDB cluster running in that scenario?

RavenDB is designed to be resilient. Using RavenDB Cloud, we make sure that each of the nodes in the cluster is running on a separate availability zone. If we lose two zones in a region, there is still a single RavenDB instance that can operate. Note that in this case, we won’t have a quorum. That means that certain operations won’t be available (you won’t be able to create new databases, for example) but read and write operations will work normally and your application will fail over silently to the remaining server. When the remaining servers recover, RavenDB will update them with all the missing data that was modified while they were down.

The situation with OVH is actually worse than that. In this case, a datacenter is gone. In other words, these nodes aren’t coming back. RavenDB will allow you to perform emergency operations to remove the missing nodes and rebuilt the cluster from the single surviving node.

What about the scenario where the entire region is gone? In this case, if there are no more servers for RavenDB to run on, it is going to be down. That is the time to enact the Disaster Recovery Plan. In this case, it means deploying a new RavenDB Cluster to a new region and restoring from backup.

RavenDB Cloud ensures full daily backups as well as hourly incremental backups for all databases, so the amount of data loss will be minimal. That said, where are the backups located?

By default, RavenDB stores the backups in S3, in the same region as the cluster itself. Amazon S3 has the best durability guarantees in the market. This is beyond just the number of nines that they provide in terms of data durability. A standard S3 object is going to be residing in three separate availability zones. As mentioned, for AWS, we have guarantees about distance between those availability zones that we haven’t seen from other cloud providers. For that reason, when your cluster reside in AWS, we’ll store the backups on S3 in the same region. For Azure and GCP, on the other hand, we’ll also use AWS S3 for backup storage. For a whole host of reasons, we select a nearby region. So a cluster on Azure US East would store its backups on AWS S3 on US-East-1, for example. And a cluster on Azure in the Netherlands will store its backups on AWS S3 on the Frankfurt region. In addition to all other safeguards, the backups are encrypted, naturally.

The cloud has been around for 15 years already (amazing, I know) and so far, AWS has had a great history with not suffering catastrophic failures like the one that OVH has run into. Then again, until last week, you could say the same about OVH, but with 20+ years of operations. Part of the Disaster Recovery Process is knowing what risks are you willing to accept. And the purpose of this post is to expand on what we do and how we plan to react to such scenarios, be they ever so unlikely.

RavenDB actually has a number of features that are applicable for handling these sorts of scenarios. They aren’t enabled in the cloud by default, but they are important to discuss for users who need to have business continuity.

  • Multi-region or Multi-cloud clusters are available. You can setup RavenDB across multiple disparate location in several manners, but the end result is that you can ensure that you have no single point of failure, while still using RavenDB to its fullest potential. This is commonly used in large applications that are naturally geo distributed, but it also serve as a way to not put all your eggs in a single basket.
  • In addition to the default backup strategy (same AWS region on AWS or nearby AWS region for Azure or GCP), you can setup backups to additional regions.

One of the key aspects of business continuity is the issue of the speed in which you can go back to normal operations. If you are running a large database, just the time to restore from backup can be a significant amount of time. If you have a database (or databases) whose backup are in the hundreds of GB range, just the time it takes to get the backups can be measures in many hours, leaving aside the restore costs.

For that reason, RavenDB also support the notion of an offsite observer. That can be a single isolated node or a whole cluster. We can take advantage of the fact that the observer is not in active use and under provision it, in that case, when we need to switch over, we can quickly allocate additional resources to it to ramp it up to production grade. For example, let’s assume that we have a cluster running in Azure Northern Europe region, composed of 3 nodes with 8 cores each. We also have setup an observer cluster in Azure Norway East region. Instead of having to allocate a cluster of 3 nodes with 8 cores each, we can allocate a much smaller size, say 2 cores only (paying less than a third of the original cluster cost as a premium). In the case of disaster, we can respond quickly and within a matter of minutes, the Norway East cluster will be expanded so each of the nodes will have 8 cores and can handle full production traffic.

Naturally, RavenDB is likely to be only a part of your strategy. There is a lot more to ensuring that your users won’t notice that something is happening while your datacenter is on fire, but at least as it relates to your database, you know that you are in good hands.

time to read 2 min | 363 words

A few days ago I posted about looking at GitHub projects for junior developer candidates. One of the things that is very common in such scenario is to see them use string concatenation for queries, I hate that. I just reached to a random candidate GitHub profile right now and found this gem:

The number of issues that I have with this code is legion.

  • Not closing the connection or disposing the command.
  • The if can be dropped entirely.
  • And, of course, the actual SQL INJECTION vulnerability in the code.

There is a reason that I have such a reaction for this type of code, even when looking at junior developer candidates. For them, this is acceptable, I guess. They are learning and focusing mostly on what is going on, not the myriad of taxes that you have to pay in order to get something to production. This is never meant to be production code (I hope, at least). I’m not judging this on that level. But I have to very consciously remind myself of this fact whenever I run into code like this (and as I said, this is all too common).

The reason I have such a visceral reaction to this type of code is that I see it in production systems all too often. And that leads to nasty stuff like this:

And this code led to a 70GB data leak on Gab. The killer for me that this code was written by someone with 23 years of experience.

I actually had to triple check what I was seeing when I read the code the first time, because I wasn’t sure that this is actually possible. I thought maybe this is some fancy processing done to avoid SQL injection, not that this is basically string interpolation.

Some bugs are things that you can excuse. A memory leak or a double free are things that will happen to anyone who is writing in C, regardless of experience and how careful they write. They are often subtle and easy to miss, happening in corner cases of error handling.

This sort of bug is a big box of red flags. It is also on fire.

FUTURE POSTS

  1. The cost of the cloud - about one day from now
  2. Installing RavenDB on a Ubuntu machine - 3 days from now

There are posts all the way to Jun 22, 2021

RECENT SERIES

  1. Challenge (58):
    16 Jun 2021 - Detecting livelihood in a distributed cluster
  2. Webinar (4):
    11 Jun 2021 - Machine Learning and Time Series in RavenDB with Live Examples
  3. Webinar recording (13):
    24 May 2021 - The Rewards of Escaping the Relational Mindset
  4. Building a phone book (3):
    02 Apr 2021 - Part III
  5. Building a social media platform without going bankrupt (10):
    05 Feb 2021 - Part X–Optimizing for whales
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats