Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,154 | Comments: 50,132

Privacy Policy Terms
filter by tags archive
time to read 2 min | 298 words

Yesterday I had presented a webinar on using Machine Learning and Time Series in RavenDB. One of the things that I love about doing those webinars is that I get to field questions from the audience and have to really think on my feet.

In almost all cases, I think that I am able to provide good answers, and I usually accompany these with a live demo showing what is going on.

Yesterday, that wasn’t the case. During the demo, I managed to run into a fairly obscure bug very deep in the internals of RavenDB and got the wrong results. Then I got stuck on that and couldn’t figure out what is a proper workaround until just after the webinar concluded.

Hugely embarrassing, but at least I can take comfort that it wasn’t the first time that a live demo failed and probably won’t be the last time.

The good news, on the other hand, is that I created an issue to fix this problem. Today I made myself a cup of coffee and was about to dig into the problem when I realized that it has already been fixed. Time from opening the bug to it getting fixed, under 6 hours. That is without rushing it, mind you. I think that given the turnaround time, it is a good thing overall that I run into this.

The actual problem, for what it is worth, is that we had lazily evaluated a collection during its enumeration. However, when using GroupBy(), we effectively enumerated the value twice, fetching different values each time. The second time, we would use the last value from the collection, since it was lazily evaluated. You can check the pull request if you are that much of a nerd Smile.

time to read 1 min | 103 words

You can now get a Grafana dashboard that would monitor your RavenDB instances. You can use Telegraf 1.18.0+, which includes a RavenDB input plugin, which will expose all the relevant details.

You can see how that would look like:

image

This is part of the work we want to do in order to make it easier and smoother to operate your RavenDB clusters. It joins the work we do on the cluster dashboard as well as a host of other (mostly minor) changes.

image

time to read 1 min | 93 words

In this talk, Oren Eini, founder of RavenDB, is going to take apart a database engine on stage. We are going to inspect all the different pieces that make for an industrial-grade database engine, from the way the data is laid out on disk to how the database is ensuring that transactions are durable. We'll explore algorithms such as B+Tree, write-ahead logs, discuss concurrency strategies and how different features of the database work together to achieve the end goals.

time to read 1 min | 70 words

From the very start, most of the RavenDB community communication was handled via the mailing list.

Some members of the community mentioned that this workflow is outdated and wanted to move to a more modern options. As a result, we opened up the GitHub Discussions board and we welcome the community there as well.

Ask your questions, interact and let us know what you are doing with RavenDB.

time to read 3 min | 438 words

RavenDB has been using the Raft protocol for the past years. In fact, we have written three or four different implementations of Raft along the way. I implemented Raft using pure message passing, on top of async RPC and on top of TCP. I did that using actor model and using direct parallel programming as well as the usual spaghettis mode.

The Raft paper is beautiful in how it explain a non trivial problem in a way that is easy to grok, but it is also something that can require dealing with a number of subtleties. I want to discuss some of the ways to successfully implement it. Note that I’m assuming that you are familiar with Raft, so I won’t explain anything here.

A key problem with Raft implementations is that you have multiple concurrent things happening all at once, on different machines. And you always have the election timer waiting in the background. In order to deal with that, I divide the system into independent threads that each has their own task.

I’m going to talk specifically about the leader mode, which is the most complex aspect, usually. In this mode, we have:

  • Leader thread – responsible for determining the current progress in the cluster.
  • Follower thread – once per follower – responsible for communicating with a particular follower.

In addition, we may have values being appended to our log concurrently to all of the above. The key here is that the followers threads will communicate with their follower and push data to it. The overall structure for a follower thread looks like this:

What is the idea? We have a dedicated thread that will communicate with the follower. It will either ping the follower with an empty AppendEntries (every 1/3 of the election timeout) or it will send a batch of up to 50 entries to update the follower. Note that there is nothing in this code about the machinery of Raft, that isn’t the responsibility of the follower thread. The leader, on the other hand, listen to the notifications from the followers threads, like so:

The idea is that each aspect of the system is running independently, and the only communication that they have with each other is the fact that they can signal the other that they did some work. We then can compute whatever that work changed the state of the system.

Note that the code here is merely drafts, missing many details. For example, we aren’t sending the last commit index on AppendEntries, and committing the log is an asynchronous operation, since it can take a long time and we need to keep the system in operation.

time to read 2 min | 316 words

I run into this question on Hacker News, asking for the best computer science papers. There are a few that I keep getting back to, either because they are so fundamental or they are so useful.

Without any particular order

  • The Raft Paper – a distributed consensus algorithm that made sense to me on first read. There are a lot of subtle issues to consider, but when reading the paper, everything clicked. That is head and shoulders above what Paxos literature is about.
  • The Ubiquitous BTree – talk about a paper that I used daily. Admittedly, I didn’t get started on BTrees from this paper, but this is a very well written one and it does a great job presenting the topic. It is also from 1979, and BTree were already “ubiquitous” at that time, which tells us something.
  • Extendible Hashing – this is also from 1979, and it is well written. I implemented extendible hashing based on this article directly and I grokked it right away.
  • How Complex Systems Fail – not strictly a computer science paper. In fact, I’m fairly certain that this fits more into civil engineering, but it does an amazing job of explaining the internals of complex systems and the why and how they fail. I took a lot from this paper. It is also very short and highly readable.
  • OLTP Through the Looking Glass – discuss the internal structure of database engines and the cost and complexities of their various pieces.
  • You’re doing it wrong – discuss the implementation of Varnish proxy from the point of view of a kernel hacker. Totally different approach to the design of the system. Had a lot of influence on how I build systems.

I’m fairly certain that my criteria won’t be yours, but those are all papers that I have read multiple times and have utilized their insights in my daily work.

time to read 4 min | 663 words

This is a tale of two options that we took for an exhaustive test. Amazon recently came out with a new disk type on the cloud. As a database vendor, that is of immediate interest to me, so we took a deep look into that.

GP3 disks are about 20% cheaper than their GP2 equivalent. What is more, they come with a guarantee level of performance even before you purchase additional IOPS. Consider the following two disks:

  Size IOPS MB/S Price
GP2 512GB 1,536 250 51.2 USD
GP3 512GB 3,000 125 40.9 USD
GP3 512GB 4,075 250 51.2 USD

In other words, for the same disk, we can get a much better baseline performance at a cheaper price. What isn’t there not to like?

The major difference between GP2 and GP3, however, is their latency. In practice, we see an additional 1 – 2 milliseconds in response times from the GP3 disk vs. the GP2 disk. In other words, GP3 disks are somewhat slower, even if they are able to run more IOPS, their latency is higher.

A really key observation from us, however, is that GP3 does not offer burst I/O capabilities. And that means that I can breath a huge sigh of relief.

RavenDB as a database is meant to run on anything from an SD card to HDD to SSD to NVMe drives. We are used to account for the I/O being the slowest thing around and have already mostly coded around that. An additional millisecond in disk latency doesn’t matter that much in the grand scheme of things.

However… the fact that this doesn’t provide I/O burst is a huge plus for us. RavenDB can easily deal with slow I/O, what it find it very hard to deal with is an environment that very rapidly change its operational characteristics.

Let’s assume that we have a 100 GB GP2 disk, which means that we have a baseline of around 300 IOPS and 75MB / sec of throughput. RavenDB is under some high load, and it is using the maximum capabilities of the hardware. However, because of burstiness, we are actually able to utilize 3,000 IOPS and 250MB/sec for a while.

Until all the I/O credits are gone and we are forced into a screeching halt. That means, for example, that we read from the network at a rate of 250MB/sec, but we are unable to write to the disk at this level. There is a negative balance of 125MB/sec that needs to be stored some where. We can buffer that in memory, of course, but that only work for so long. That means that we have to put a huge break all of a sudden, which the rest of the eco system isn’t happy with. For example, the other side that is sending us data at 250MB /sec, they are likely not going to be able to respond in time to the shift is our behavior. It is very likely that the network connection would congest and break in this case.

All of the internal optimizations inside of RavenDB will also be skewed for a while, until we are used to the new level of speed. If this was gradual, we could adjust a lot more easily, but this is basically like hitting the brakes at speed. You will slow down, sure, but you are also likely to cause an accident.

As a simple example, RavenDB can compress the data that it writes to disk, and it balances the compression ratio vs. the cost to write to the disk. If we know that the disk is slow, we can spend more time trying to reduce the amount of data we write. If this changes rapidly, we are operating under the old assumptions and may create a true traffic jam

The fact that GP3 disks have a predictable performance profile means that we are much better suited to run on them. A more predictable platform from which to operate gives me a much better opportunity to handle optimizations.

time to read 3 min | 548 words

Voron is RavenDB’s storage engine. It is how we store data, keep transactions and in generally get a lot of our abilities. In this post, I want to talk about the way RavenDB manages disk space internally.  Voron uses a single data file to do its work, the data file is divided into 8KB pages, like so:

Voron uses eager disk allocations to reserve disk space from the operating system. Each time the space inside the file runs out, Voron will double the size of the file. That last until the file size reaches 2GB, after which RavenDB will grow by 1GB at a time. This behavior ensures that Voron gives the underlying file system enough information to provide the database with a continuous range of disk space. In other words, we grab disk space in large chunks to avoid fragmentation of the data file.  

What happens when you delete data, however? Voron mark the free space in its free list and will use that space before it will allocate more disk space from the operating system.

Why aren’t we releasing the disk space back to the operating system? The simplest reason is that it isn’t an just the data at the end of the file that is freed. In fact, like in the image above, free and busy segments are interwoven in the file. We can’t just truncate the file.

Internal references inside of RavenDB make use of the position data inside the file, so just moving the data won’t help. Instead, you have to compact the data. That forces us to re-write the entire database layout from scratch and fixes those references.  That is an offline operation, however.

For the most part, it doesn’t actually matter. RavenDB will use the internal free space as needed, so it isn’t like it is actually lost.

One feature that we are considering for version 6.0 of RavenDB is hole punching. That means that we’ll make use of advanced file system API to free the disk space allocated to RavenDB even mid file.

On Linux, that means using FALLOC_FL_PUNCH_HOLE. On Windows, that means using FSCTL_SET_ZERO_DATA.

That will have the advantage of freeing disk space back to the operating system without needing user intervention. In particular, that is going to make it so a user that delete data to free disk space see the free space reflected in the OS metrics.

There are problems with this approach, however. First, the size of the file remains the same, which leads to interesting questions. Consider:

image

Second, this defeats the purpose of wanting to optimize disk allocations. If we free disk space in this manner, when we get it back, it may no longer be continuous on the disk. That said, it is not that big a problem in the days of SSDs and NVMes as it was at the time of the rotational hard disk.

Then, you may get into a very bad situation in which Voron tries to use disk space that it had allocated mid file (but was already freed) but it can’t, because the disk is full. Right now, this is simply an impossible error, with hole punching, we need to consider how to deal with this.

FUTURE POSTS

  1. The performance regression odyssey - 44 minutes from now

There are posts all the way to Jun 10, 2021

RECENT SERIES

  1. Webinar recording (13):
    24 May 2021 - The Rewards of Escaping the Relational Mindset
  2. Webinar (3):
    12 May 2021 - Real Time Architecture
  3. Building a phone book (3):
    02 Apr 2021 - Part III
  4. Building a social media platform without going bankrupt (10):
    05 Feb 2021 - Part X–Optimizing for whales
  5. Production postmortem (30):
    07 Jan 2021 - The file system limitation
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats