RavenDB has been using the Raft protocol for the past years. In fact, we have written three or four different implementations of Raft along the way. I implemented Raft using pure message passing, on top of async RPC and on top of TCP. I did that using actor model and using direct parallel programming as well as the usual spaghettis mode.
The Raft paper is beautiful in how it explain a non trivial problem in a way that is easy to grok, but it is also something that can require dealing with a number of subtleties. I want to discuss some of the ways to successfully implement it. Note that I’m assuming that you are familiar with Raft, so I won’t explain anything here.
A key problem with Raft implementations is that you have multiple concurrent things happening all at once, on different machines. And you always have the election timer waiting in the background. In order to deal with that, I divide the system into independent threads that each has their own task.
I’m going to talk specifically about the leader mode, which is the most complex aspect, usually. In this mode, we have:
- Leader thread – responsible for determining the current progress in the cluster.
- Follower thread – once per follower – responsible for communicating with a particular follower.
In addition, we may have values being appended to our log concurrently to all of the above. The key here is that the followers threads will communicate with their follower and push data to it. The overall structure for a follower thread looks like this:
What is the idea? We have a dedicated thread that will communicate with the follower. It will either ping the follower with an empty AppendEntries (every 1/3 of the election timeout) or it will send a batch of up to 50 entries to update the follower. Note that there is nothing in this code about the machinery of Raft, that isn’t the responsibility of the follower thread. The leader, on the other hand, listen to the notifications from the followers threads, like so:
The idea is that each aspect of the system is running independently, and the only communication that they have with each other is the fact that they can signal the other that they did some work. We then can compute whatever that work changed the state of the system.
Note that the code here is merely drafts, missing many details. For example, we aren’t sending the last commit index on AppendEntries, and committing the log is an asynchronous operation, since it can take a long time and we need to keep the system in operation.
I run into this question on Hacker News, asking for the best computer science papers. There are a few that I keep getting back to, either because they are so fundamental or they are so useful.
Without any particular order
- The Raft Paper – a distributed consensus algorithm that made sense to me on first read. There are a lot of subtle issues to consider, but when reading the paper, everything clicked. That is head and shoulders above what Paxos literature is about.
- The Ubiquitous BTree – talk about a paper that I used daily. Admittedly, I didn’t get started on BTrees from this paper, but this is a very well written one and it does a great job presenting the topic. It is also from 1979, and BTree were already “ubiquitous” at that time, which tells us something.
- Extendible Hashing – this is also from 1979, and it is well written. I implemented extendible hashing based on this article directly and I grokked it right away.
- How Complex Systems Fail – not strictly a computer science paper. In fact, I’m fairly certain that this fits more into civil engineering, but it does an amazing job of explaining the internals of complex systems and the why and how they fail. I took a lot from this paper. It is also very short and highly readable.
- OLTP Through the Looking Glass – discuss the internal structure of database engines and the cost and complexities of their various pieces.
- You’re doing it wrong – discuss the implementation of Varnish proxy from the point of view of a kernel hacker. Totally different approach to the design of the system. Had a lot of influence on how I build systems.
I’m fairly certain that my criteria won’t be yours, but those are all papers that I have read multiple times and have utilized their insights in my daily work.
This is a tale of two options that we took for an exhaustive test. Amazon recently came out with a new disk type on the cloud. As a database vendor, that is of immediate interest to me, so we took a deep look into that.
GP3 disks are about 20% cheaper than their GP2 equivalent. What is more, they come with a guarantee level of performance even before you purchase additional IOPS. Consider the following two disks:
In other words, for the same disk, we can get a much better baseline performance at a cheaper price. What isn’t there not to like?
The major difference between GP2 and GP3, however, is their latency. In practice, we see an additional 1 – 2 milliseconds in response times from the GP3 disk vs. the GP2 disk. In other words, GP3 disks are somewhat slower, even if they are able to run more IOPS, their latency is higher.
A really key observation from us, however, is that GP3 does not offer burst I/O capabilities. And that means that I can breath a huge sigh of relief.
RavenDB as a database is meant to run on anything from an SD card to HDD to SSD to NVMe drives. We are used to account for the I/O being the slowest thing around and have already mostly coded around that. An additional millisecond in disk latency doesn’t matter that much in the grand scheme of things.
However… the fact that this doesn’t provide I/O burst is a huge plus for us. RavenDB can easily deal with slow I/O, what it find it very hard to deal with is an environment that very rapidly change its operational characteristics.
Let’s assume that we have a 100 GB GP2 disk, which means that we have a baseline of around 300 IOPS and 75MB / sec of throughput. RavenDB is under some high load, and it is using the maximum capabilities of the hardware. However, because of burstiness, we are actually able to utilize 3,000 IOPS and 250MB/sec for a while.
Until all the I/O credits are gone and we are forced into a screeching halt. That means, for example, that we read from the network at a rate of 250MB/sec, but we are unable to write to the disk at this level. There is a negative balance of 125MB/sec that needs to be stored some where. We can buffer that in memory, of course, but that only work for so long. That means that we have to put a huge break all of a sudden, which the rest of the eco system isn’t happy with. For example, the other side that is sending us data at 250MB /sec, they are likely not going to be able to respond in time to the shift is our behavior. It is very likely that the network connection would congest and break in this case.
All of the internal optimizations inside of RavenDB will also be skewed for a while, until we are used to the new level of speed. If this was gradual, we could adjust a lot more easily, but this is basically like hitting the brakes at speed. You will slow down, sure, but you are also likely to cause an accident.
As a simple example, RavenDB can compress the data that it writes to disk, and it balances the compression ratio vs. the cost to write to the disk. If we know that the disk is slow, we can spend more time trying to reduce the amount of data we write. If this changes rapidly, we are operating under the old assumptions and may create a true traffic jam
The fact that GP3 disks have a predictable performance profile means that we are much better suited to run on them. A more predictable platform from which to operate gives me a much better opportunity to handle optimizations.
Voron is RavenDB’s storage engine. It is how we store data, keep transactions and in generally get a lot of our abilities. In this post, I want to talk about the way RavenDB manages disk space internally. Voron uses a single data file to do its work, the data file is divided into 8KB pages, like so:
Voron uses eager disk allocations to reserve disk space from the operating system. Each time the space inside the file runs out, Voron will double the size of the file. That last until the file size reaches 2GB, after which RavenDB will grow by 1GB at a time. This behavior ensures that Voron gives the underlying file system enough information to provide the database with a continuous range of disk space. In other words, we grab disk space in large chunks to avoid fragmentation of the data file.
What happens when you delete data, however? Voron mark the free space in its free list and will use that space before it will allocate more disk space from the operating system.
Why aren’t we releasing the disk space back to the operating system? The simplest reason is that it isn’t an just the data at the end of the file that is freed. In fact, like in the image above, free and busy segments are interwoven in the file. We can’t just truncate the file.
Internal references inside of RavenDB make use of the position data inside the file, so just moving the data won’t help. Instead, you have to compact the data. That forces us to re-write the entire database layout from scratch and fixes those references. That is an offline operation, however.
For the most part, it doesn’t actually matter. RavenDB will use the internal free space as needed, so it isn’t like it is actually lost.
One feature that we are considering for version 6.0 of RavenDB is hole punching. That means that we’ll make use of advanced file system API to free the disk space allocated to RavenDB even mid file.
On Linux, that means using FALLOC_FL_PUNCH_HOLE. On Windows, that means using FSCTL_SET_ZERO_DATA.
That will have the advantage of freeing disk space back to the operating system without needing user intervention. In particular, that is going to make it so a user that delete data to free disk space see the free space reflected in the OS metrics.
There are problems with this approach, however. First, the size of the file remains the same, which leads to interesting questions. Consider:
Second, this defeats the purpose of wanting to optimize disk allocations. If we free disk space in this manner, when we get it back, it may no longer be continuous on the disk. That said, it is not that big a problem in the days of SSDs and NVMes as it was at the time of the rotational hard disk.
Then, you may get into a very bad situation in which Voron tries to use disk space that it had allocated mid file (but was already freed) but it can’t, because the disk is full. Right now, this is simply an impossible error, with hole punching, we need to consider how to deal with this.
I’m doing some cloud work, and I am working based off the official documentation, trying to automate the creation of an AWS Lambda. In order to allow me to quickly iterate, I’m basically creating the entire thing from scratch each time.
I have the following code:
- aws iam create-role --role-name $AWS_ROLE --assume-role-policy-document file://trust-policy.json
- aws iam attach-role-policy --role-name $AWS_ROLE --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
- aws lambda create-function --function-name $FUNC_NAME --zip-file fileb://lambda.zip --handler lambda_function.lambda_handler --runtime python3.8 --role $ARN_ROLE
So far, so good, and exactly like it shows in the docs. But if you’ll run it as a script, it will fail with:
An error occurred (InvalidParameterValueException) when calling the CreateFunction operation: The role defined for the function cannot be assumed by Lambda.
If I re-run the exact same command, however, it works properly.
There is this interesting command, which indicates that roles are using eventual consistency:
aws iam wait role-exists --role-name $AWS_ROLE
Except… that this doesn’t work. It looks like there is some additional delay between creating the role, validating that it was created and when it is actually available for Lambda to be used.
After looking around and feeling like a fool, I added a sleep for 10 seconds to the script, and the problem went away.
I’m posting this for posterity sake and in the hope that someone can tell me that there is a better way. For now, I think I need a shower.
We just published a white paper on RavenDB performance vs. Couchbase performance in a real customer scenario.
I had to check the results three times before I believed them. RavenDB is pretty awesome, but I had no idea it was that awesome.
The data set was reasonably big, 1.35 billion docs and the scenario we present is a real world one based on production load.
Some of the interesting details:
- RavenDB uses 1/3 of the disk space that Couchbase uses, but stores 3 times as much data.
- Operationally, RavenDB just worked, Couchbase needed 6 times the hardware to just scrape by. A single failure in Couchbase meant at least 15 – 45 minutes for the node to recover. Inducing failures in RavenDB brought the node back up in a few seconds.
- For queries, we pitted a Couchbase cluster with 96 cores and 384 GB RAM against single RavenDB node running on a Raspberry PI. RavenDB on the Pi was able to sustain better latencies at the 99 percentile handling twice as much load as Couchbase is able.
There are all sort of other goodies in the white paper and we went pretty deep into the overall architecture and impact of the difference design decisions.
As usual, we welcome your feedback.
You can hear me speaking at the Angular Show about using document database from the point of view of full stack or front end developers.
In this episode, panelists Brian Love, Jennifer Wadella, and Aaron Frost welcome Oren Eini, founder of RavenDB, to the Angular Show. Oren teaches us about some of the key decisions around structured vs unstructured databases (or SQL vs NoSQL in hipster developer parlance). With the boom of document-driven unstructured databases, we wanted to learn why you might choose this technology, the pitfalls and benefits, and what are the options out there. Of course, Oren has a bit of a bias for RavenDB, so we'll learn what RavenDB is all about and why it might be a good solution for your Angular application.
One of the “fun” aspects of running in the cloud is the fact that certain assumptions that you take for granted are broken, sometimes seriously so. Today post is about an issue a customer run into in the cloud. They were seeing some cases of high latency of operations from RavenDB. In the cloud, the usual answer is to provision more resources, but we generally recommend that only when we can show that the load is much higher than expected to be handled on the hardware.
The customer was running on a cluster with disk that were provisioned with 1,000 IOPS and 120 MB /sec, that isn’t a huge amount, but it is certainly respectable. Looking at the load, we can see fairly constant writes and the number of indexes is around 30. Looking at the disk, we can see that we are stalling there, the queue length is very high and the disk latency has a user visible impact.
All told, we would expect to see a significant amount of I/O operations as a result of that, but the fact that we hit the limits of the provisioned IOPS was worth a second look. We started pulling at the details and it became clear that there was something that we could do about it. During indexing, we create some temporary files to store the Lucene segments before we commit them to the index. Each indexing run can create between four and six such files. When we create them, we do that with the flag DeleteOnClose, this is a flag that exists on Windows, but not on Linux. On Linux, we are running on ext4 with journaling enabled, which means that each file system metadata modification requires a journal write at the file system level. Those temporary files live for a very short amount of time, however. We delete them on close, after all, and the indexing run is very short.
6 files per index times 30 indexes means 180 files. Each one of those will be created and destroyed (generating a journal event each time) and there is a constant low volume of writes. That means that there are 360 IOPS at the file system level just because of this issue.
The fix for that was two folds. First, for small files, under 128KB, we never hit the disk. We can keep them completely in memory. For larger file, we want to avoid using too much memory, so we spill them to disk, but instead of creating new files each time, we’ll reuse them between indexing run.
The end result is that we are issuing fewer I/O operations, reducing the amount of trivial IOPS we consume and can get a lot more done with the same hardware. The actual fix is fairly small and targeted, but the impact is felt across the entire system.
In the past two posts in the series, I talked about ways to store phone book records in a file. During the candidates review process, I noticed that many candidates failed to make their lives significantly easier by placing limits on themselves.
- Using variable length records.
- Using a single file.
- Choosing simple algorithm to do the whole task.
If we force fixed length records, either directly or via record splitting (if each record is 64 bytes, a record that is bigger than that would reside in some multiple of the record size), the task become much easier. I’ve mostly ignored that in my code so far because I’m using binary offsets, but it can really make the code a lot simpler.
Using a single file lead to complications, because you have to do internal space management (where do the records live, where is the metadata?). It also make it much harder to recover used space in many cases.
The last one is probably the most interesting limitation, and not something that I would expect a junior developer to figure out. The use of a single option is typically limiting you to whatever a particular algorithm is providing, but you can extend on that significantly.
Let’s see another approach to building a persistent phone book. I’m going to effectively build an LSM here. You can see the code here.
I called it a pretty horrible LSM (Log Structure Merge), but all the relevant pieces are there. It is just horribly inefficient. The key problem, by the way, is around the number of times it will open a file handle. That can be really slow on Windows and end up being a real issue for any significant size.
There are also probably a lot of other bugs there, but also enough to figure out how this is actually built.
And with this post, I can can say that I explicitly scratched this itch.
A fun task to take this further, by the way, is to try to implement a persistent trie for the phone book.