Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,386 | Comments: 50,790

Privacy Policy Terms
filter by tags archive
time to read 5 min | 969 words

One of the big changes in RavenDB is the new search engine, Corax. We want to replace Lucene with a purpose built search engine, capable of doing everything that we can do with Lucene, but far faster.

In this post, I want to discuss a particular detail of the implementation. Managing the posting list. In information retrieval, the idea of a posting list is that we have a list of document ids that match a particular term. I’m ignoring the details of how we create or manage the list. The basic idea is that we have a need to store a potentially large number of document ids, update them on the fly as well query them. Lucene manages that by creating immutable segments, but that design decision doesn’t match the way we want to do things in Corax. The problem with immutable segments is that you’ll eventually need to run compaction, which can be really expensive.

As a database, we already have a really good data structure that matches most of those requirements, it turns out. A B+Tree can do a pretty good approximation of a posting list, but it does have a problem. It’s not efficient in terms of space usage. A document id is a 64 bits integer, and we can make a number of assumptions about it. Therefore, I create a dedicated B+Tree like structure to hold the posting list. This B+Tree is called a Set inside of Voron, and it can hold any number of uint64 values. Internally, this is managed as two separate types of pages. The Branch pages (which are fairly standard B+Tree branches) and the Leaf pages.

Here is the first version of the Set Leaf Page implementation:

image

Let’s figure out what is going on here. The page has a notion of a base. That means all the documents IDs that have the same upper 33 bits. Basically, inside a single page, all the IDs are in the same 2 billion range. That means that even though the document ids are 64 bits, in the context of a single page, we can treat them as 32 bits integers. That turns out to be important, since most integer compression routines work on 32 bits integers.

How does that work? We have a section in the page that is dedicated to raw values, and we insert values into that section until it is full. Whenever that happens, we compress the raw values section using PForDelta compression. The page will then contain multiple compressed segments and a single raw values segment. Each compressed segment is non-overlapping with the other compressed segments (but may overlap with the raw values). PForDelta is really good in compressing integers, especially if it is able to figure out patterns in your data. And documents IDs in Corax are explicitly designed to have common patterns so it will be able to take advantage of this behavior. When we read the entries from the page, we merge the compressed segments with the raw values and return the full details.

The code isn’t particularly simple, but has a fairly straightforward approach to the problem once you understand the design.

One thing that I haven’t touched is the notion of removals. That is an important concept, and we handle that in an interesting way. Remember that I said that the baseline for the page is the upper 33 bits? That is because the numbers inside the page are 31 bits in length, we reserve the top bit to mark a value as a tombstone marker.  In other words, when we write to the raw values, we can have either a new value or a removed value, distinguished using the uppermost bit.

When we fill the raw values segment, we compress it alongside the relevant compressed segments. At that point, we’ll filter out the removed values. This is very similar to the segments that Lucene is using, but we do that on a page boundary (8KB), not across all values.

We are able to push a lot of values into a single page. We see typically thousands to tens of thousands of documents IDs fitting into a single 8KB page. That is pretty amazing, since even if you have a posting list that has millions of entries, the actual disk reads are minimal.

The design was with us throughout most of the development of Corax, but we ran into a serious issue with the way it works when we started to build the query optimizer for Corax.

That is an interesting statement, I think you’ll agree. What is the relation between a (very) low-level design of the on-disk data format and the result of the query optimizer?

Well, it turns out that one of the things that we need to know for the query optimizer is: “How big is this posting list?”

This question is actually really important to be able to generate optimal queries. And given the structure we have, we can provide a good answer to that, most of the time, but not always.

Why is that?

The problem is what happens when we want to remove a value from the set, or add an existing value. If the value already exists inside a compressed segment, we don’t open the compressed segement (which will require re-writing it from scratch), so we record an addition that is actually spurious. Conversely, if we try to remove a value that isn’t in the set, we’ll wrongly decrement the number of entries in the posting list, leading to issues with a mismatch between the record number of entries and the actual number we have in the posting list.

That was very confusing to figure out, I’ll admit. It was also really hard to fix, but I’ll address that in the next post in the series.

time to read 4 min | 750 words

I’ve been calling myself a professional software developer for just over 20 years at this point. In the past few years, I have gotten into teaching university courses in the Computer Science curriculum. I have recently had the experience of supporting a non-techie as they went through a(n intense) coding bootcamp (aiming at full stack / front end roles). I’m also building a distributed database engine and all the associated software.

I list all of those details because I want to make an observation about the distinction between fundamental and transient knowledge.

My first thought is that there is so much to learn. Comparing the structure of C# today to what it was when I learned it (pre-beta days, IIRC), it is a very different language. I had literally decades to adjust to some of those changes, but someone that is just getting started needs to grasp everything all at once. When I learned JavaScript you still had browsers in the market that didn’t recognize it, so you had to do the “//<!—” trick to get things to work (don’t ask!).

This goes far beyond mere syntax and familiarity with language constructs. The overall environment is also critically important. One of the basic tasks that I give in class is something similar to: “Write a network service that would serve as a remote dictionary for key/value operations”.  Most students have a hard time grasping details such as IP vs. host, TCP ports, how to read from the network, error handling, etc. Adding a relatively simple requirement (make it secure from eavesdroppers) will take it entirely out of their capabilities.

Even taking a “simple” problem, such as building a CRUD website is fraught with many important details that aren’t really visible. Responsive design, mobile friendly, state management and user experience, to name a few. Add requirements such as accessibility and you are setting the bar too high to reach.

I intentionally choose the examples of accessibility and security, because those are “invisible” requirements. It is easy to miss them if you don’t know that they should be there.

My first website was a PHP page that I pushed to the server using FTP and updated live in “production”. I was exposed to all the details about DNS and IPs, understood exactly that the server side was just a machine in a closet, and had very low levels of abstractions. (Naturally, the solution had no security or any other –ities). However, that knowledge from those early experiments has served me very well for decades. Same for details such as how TCP works or the basics of operating system design.

Good familiarity with the basic data structures (heap, stack, tree, list, set, map, queue) paid itself many times over. The amount of time that I spent learning WinForms… still usable and widely applicable even in other platforms and environments. WPF or jQuery? Not so much.

Learning patterns paid many dividends and was applicable on a wide range of applications and topics.

I looked into the topics that are being taught (both for bootcamps and universities) and I understand why in many cases, those are being skipped. You can actually be a front end developer without understanding much (if at all) about networks. And the breadth of details you need to know is immense.

My own tendency is to look at the low level stuff, and given that I work on a database engine, that is obviously quite useful. What I have found, however, is that whenever I dug deep into a topic, I found ways to utilize that knowledge at a later point in time. Sometimes I was able to solve a problem in a way that would be utterly inconceivable to me previously. I’m not just talking about being able to immediately apply new knowledge to a problem. If that were the case, I would attribute that to wanting to use the new thing I just learned.

However, I’m talking about scenarios where months or years later I ran into a problem, and was then able to find the right solution given what was then totally useless knowledge.

In short, I understand that chasing the 0.23-alpha-stage-2.3.1-dev updates on the left-pad package is important, but I found that spending time deep in the stack has a great cumulative effect.

Joel Spolsky wrote about leaky abstractions, that was 20 years ago. I remember reading that blog post and grokking that. And it is true, being able to dig one or two layers down from where you usually live has a huge amount of leverage on your capabilities.

time to read 3 min | 411 words

A user came to us with an interesting scenario. They have a RavenDB cluster, which is running in a distributed manner. At some point, we have a user that creates a document inside of RavenDB as well as posts a message using SQS (Amazon queuing system) to be picked up by a separate process.

The flow of the system is shown below:

image

The problem they run into is that there is an inherent race condition in the way they work. The backend worker that picks up the messages may use a different node to read the data than the one that it was written to.

RavenDB uses asynchronous replication model, which means that if the queue and the backend workers are fast enough, they may try to load the relevant document from the server before it was replicated to it. Amusingly enough, that typically happens on light load (not a mistake, mind). On high load, the message processing time usually is sufficient to delay things for replication to happen. In light load, the message is picked up immediately, exposing the race condition.

The question is, how do you deal with this scenario? If this was just a missing document, that was one thing, but we also need to handle another scenario. While the message is waiting to be processed in the queue, it may be updated by the user.

So the question now is, how do we handle distributed concurrency in a good manner using RavenDB.

The answer to this question is the usage of change vectors.  A change vector is a string that represents the version of a document in a distributed environment. The change vector is used by RavenDB to manage optimistic concurrency.

This is typically used to detect changes in a document behind our backs, but we can use that in this scenario as well. The idea is that when we put the message on the queue, we’ll include the change vector that we got from persisting the document. That way, when the backend worker picks up the message and starts using it, the worker can compare the change vectors.

If the document doesn’t exist, we can assume that there is a replication delay and try later. If the document exists and the change vector is different, we know the document was modified, and we may need to use different logic to handle the message in question.

time to read 2 min | 337 words

A database indexing strategy is a core part of achieving good performance. About 99.9% of all developers have a story where adding an index to a particular query cut the runtime from seconds or minutes to milliseconds. That percentage is 100% for DBAs, but the query was cut from hours or days to milliseconds.

The appropriate indexing strategy is often a fairly complex balancing act between multiple competing needs. More indexing means more I/O and cost on writes, but faster reads. RavenDB has a query optimizer engine that will analyze your queries and generate the appropriate set of indexes on the fly, without you needing to think much about it. That means that RavenDB will continuously respond to your operational environment and changes in it. The end result is an optimal indexing strategy at all times.

This automatic behavior applies only to automatic indexes, however. RavenDB also allows you to define your own indexes and many customers run critical business logic in those indexes. RavenDB now has a feature that aims to help you manage/organize your indexes by detecting redundant definitions & unqueried indexes which can be removed or merged.

The index cleanup feature is now exposed in the Studio (since build 5.4.5):

image

When you select it, the Studio will show you the following options:

image

You can see that RavenDB detected that two indexes can be merged into a single one, and additionally there are some indexes that haven’t been used in a while or have been completely superseded by other indexes.

RavenDB will even go ahead and suggest the merged index for you:

image

The idea is to leverage RavenDB’s smarts so you won’t have to spend too much time thinking about index optimization and can focus on the real value-added portions of your system.

time to read 5 min | 998 words

A long while ago, I was involved in a project that dealt with elder home care. The company I was working for was contracted by the government to send care workers to the homes of elderly people to assist in routine tasks. Depending on the condition of the patient in question, they would spend anything from 2 – 8 hours a day a few days a week with the patient. Some caretakers would have multiple patients that they would be in charge of.

There were a lot of regulations and details that we had to get right. A patient may be allowed a certain amount of hours a week by the government, but may also pay for additional hours from their own pocket, etc. Quite interesting to work on, but not the topic of this post.

A key aspect of the system was actually paying the workers, of course. The problem was the way it was handled. At the end of each month, they would designate “office hours” for the care workers to come and submit their time sheets. Those had to be signed by the patients or their family members and were submitted in person at the office. Based on the number of hours worked, the workers would be paid.

The time sheets would look something like this:

image

A single worker would typically have 4 – 8 of those, and the office would spend a considerable amount of time deciphering those, computing the total hours and handing over payment.

The idea of the system I was working on was to automate all of that. Here is more or less what I needed to do:

image

For each one of the timesheet entries, the office would need to input the start & end times, who the care worker took care of, whether the time was approved (actually, whether this was paid by the government, privately, by the company, etc), and whether it was counted as overtime, etc.

The rules for all of those were… interesting, but we got it working and tested. And then we got to talk with the actual users.

They took one look at the user interface they had to work with and absolutely rebelled. The underlying issue was that during the end of the month period, each branch would need to handle hundreds of care workers, typically within four to six hours. They didn’t have the time to do that (pun intended). Their current process was to review the submitted time sheet with the care worker, stamp it with approved, and put just the total hours worked into the payroll system.

imageHaving to spend so much time on data entry was horrendous for them, but the company really wanted to have that level of granularity, to be able to properly track how many hours were worked and handle its own billing more easily.

Many of the care workers were also… non-technical, and the timesheet had to be approved by the patient or family worker. Having a signed piece of paper was easy to handle, trying to get them to go on a website and enter hours was a non-starter. That was also before everyone had a smartphone (and even today I would assume that would be difficult in that line of business).

As an additional twist, it turns out that the manual process allowed the office employees to better manage the care workers. For example, they may give them a 0.5 / hour adjustment for a particular month to manage a difficult client or deal with some specific issue, or approve (at their discretion) overtime pay when it wasn’t quite "proper” to do so.

One of the reasons that the company wanted to move to a modern system was to avoid this sort of “initiatives”, but they turned out to be actually quite  important for the proper management of the care workers and actually getting things done. For an additional twist, they had multiple branches, and each of those had a different methodology of how that was handled, and all of them were different from what HQ thought it should be.

The process turned out to be even more complex than we initially got, because there was a lot of flexibility in the system that was  actually crucial for the proper running of the business.

If I recall properly, we ended up having a two-stage approach. During the end of the month rush, they would fill in the gross hours and payment details. That part of the system was intentionally simplified to the point where we did almost no data validation and trusted them to put the right values.

After payroll was processed, they had to actually put in all those hours and we would run a check on the expected amount / validation vs. what was actually paid. That gave the company the insight into what was going on that they needed, the office people were able to keep on handling things the way they were used to, and discrepancies could be raised and handled more easily.

Being there in the “I’m just a techie” role, so I got to sit on the sidelines and see tug-of-war between the different groups. It was quite interesting to see, I described the process above in a bit of a chaotic manner, but there were proper procedures and processes in the offices. They were just something that the company HQ never even realized was there.

It also taught me quite a lot about the importance  of accepting “invalid” data. In many cases, you’ll see computerized solutions flat out refuse to accept values that they consider to be wrong. The problem is that often, you need to record reality, which may not agree with validation rules on your systems. And in most cases, reality wins.

time to read 5 min | 861 words

I love B+Trees, but they can be gnarly beasts, with the number of edge cases that you can run into. Today’s story is about a known difficult place, page splitting in the tree. Consider the following B+Tree, showing a three-level tree with 3 elements on each page.

image

Consider what will happen when we want to insert a new value to the tree, the value: 27. Given the current state of the tree, that should go on the page marked in red:

image

But there is no place for the new value on this page, so we have to split it. The tree will then look like so, we split the page and now we need to add the new page to the parent, but that one also doesn’t have room for it:

image

So we are now in a multi-level split process. Let’s see what this looks like when we go up the tree. This is the final state of the tree when we are done doing all the splits:

image

The reason for all of this is that we need to add 27 to the tree, and we haven’t done that yet. At this stage, we got the tree back in order and we can safely add the new value to the tree, since we made sure we have enough space.

However, note that the exact same process would apply if we were adding 27 or 29. The page that we’ll add them to, however, is different.

This can be quite complex to keep track of, because of the recursive nature of the process. In code, this looks something like this:

I am skipping on some details, but that is the gist of it. So we do the split (recursively if needed) and then after we wired the parent page properly, we find the right location for the new value.

An important aspect here is the cursor. We use that to mark our current location in the tree, so the cursor will always contain all the parent pages that we are currently searching upon. A lot of the work that we are doing in the tree is related to the cursor.

Now, look at the code and consider the behavior of this code when we insert the value 29. It will correctly generate this page:

image

However.. what happens if we’ll insert 27?

Well, when we split the page, we went up the tree. And then we had another split, and then we went down another branch. So as written, the result would be adding the 27 to the same page as we would the 29. This would look like this:

image

Look at the red markers. We put entry 27 on the wrong page.

Fixing this issue is actually pretty hard, because we need to keep track of the values as we go up and down the tree. For fun, imagine what happens in this exact scenario, but when you have 6 levels in the tree and you end up in a completely different location in the tree.

I spent a lot of time struggling with this issue, including getting help from some pretty talented people, and the final conclusion we got was “it’s complicated”.

I don’t want complications here, I need it to be as simple as possible, otherwise, we can’t make any sort of sense here. I kept spinning more and more complex systems to resolve this, when I realized that I just looked at the problem in the wrong manner all along.

The issue was that I was trying to add the new value to the tree after I sorted out the structure of the tree, but there was actually nothing that forced me to do that. Given that I already split the page at this stage, I know that I have sufficient space to add the key without doing anything else.  I can first add the key to the right page, then write the split page back to the tree. In this case, I don’t need to do any sort of backtracking or state management .

Here is what this looks like:

And with this change, the entire class of problems related to the tree structure just went away.

I’m very happy with this result, even if it is a bit ironic. Like the problem at hand, a lot of the complexity was there because I had to backtrack the implementation decisions and go on a new path to solve this.

Also, I just checked, the portion that controls page splits inside Voron has had roughly 1 change a year for the past 5 years. Given our scope and usage, that means that it has been incredibly stable in the face of everything that we could throw at it.

time to read 3 min | 582 words

The design of the X509Certificate2 is badly broken in terms of safety. If you load a certificate from the disk or a byte buffer, it will go ahead and create a file on the disk behind the scene. If you’ll dispose the instance, the file will be removed. However, if you don’t explicitly dispose the instance, that is too bad. The file remains.

A ticking time bomb, because eventually you’ll have a lot of such files on the disk. Which is then a fun state to try to recover from.

I’m not sure why this design decision was made. I assume that at the time, people didn’t need to work so much with certificates, and a lot of the issues are likely with dealing with the underlying crypto API. Regardless, it is mandatory to dispose the certificate after you use it.

And that leads to a problem. Consider the following code:

The idea is that we want to be able to switch certificates on the fly (since we need to update them before they expire, without interrupting the server). Old connections can still use the old certificate, while new ones will use the updated one.

Practically speaking, the certificate itself shouldn’t be used after the call to AuthenticateAsServerAsync(), but I don’t believe that we have any such promises. Regardless, as the async designation indicates, that can take a while. How would I know to dispose the old certificate? I have to consider multi threading here as well, if I dispose the certificate while it is being used to authenticate a request, that request will likely fail. Given that I’m racing a native API and disposing its resources while it is under use, I may open some severe issues.

Ideally, the X509Certificate2 should manage that for me. If it would have implemented a finalizer, it would dispose itself when the GC made sure that no one was looking at it. That is what I want to happen, but in this case, we have no such support.

Luckily we got options. Behold the following code:

What does this do? It uses several tricks to get what we want, attaching an external finalizer to an object that we don’t control.

First, ConditionalWeakTable will ensure that as long as there is a reference to the certificate, the cleaner will be referenced as well. When there is no reference for the certificate, we’ll need to run the finalizer for the cleaner.

Next, we have the usage of CriticalFinalizerObject, this is done to ensure that the finalizer will be called even when the process terminates. This is the same manner .NET flushes file handles, so we can be sure that we are doing the utmost to ensure that we’ll properly dispose of the files.

Finally, there is the dance with the GetValueOrDefault() call in RegisterForDisposalDuringFinalization(). We need to consider what would happen if we’ll get concurrent requests to register the certificate. If we’ll let it race, one of the cleaners will be discarded, and then the finalizer will be called on that, causing havoc.

In this manner, we let ConditionalWeakTable ensure that there is just one instance, and set the value afterward. Since the value is unique per instance, we can set it multiple times (it will always be set to the same value).

End result, it takes less than 10 lines of code to fix this (and of course, remember to call register whenever you create a certificate instance). But I would really like that to just be the default behavior. Otherwise, that is a very risky trap.

time to read 4 min | 698 words

I read this blog post from Discord, which presents a really interesting approach to a problem that they ran into. Basically, in the cloud you have the choice between fast and ephemeral disks and persistent (and much slower) disks.

Technically speaking you can try to get really fast persistent disks, but those are freaking expensive. Ultra disks on Azure and io2 disks on AWS and Extreme persistent disks for GCP. The cost for a single 1TB disk with 25K IOPS would be around 1,800 USD / month, for example. That is just for the disk!

To give some context, an r6gd.16xlarge instance with 64 cores and 512 GB (!) of RAM as well as 3.8 TB of NVMe drive will cost you less than that. That machine can also do 80K IOPS, for context.

At Discord scale, they ran into the limitations of I/O in the cloud with their database and had to come up with a solution for that.

My approach would be to… not do anything, to be honest. They are using a database (ScyllaDB) that is meant to run on commodity hardware. That means that losing a node is an expected scenario. The rest of the nodes in the cluster will be able to pitch in and recover the data when the node comes back or is replaced. That looks like the perfect scenario for running with fast ephemeral disks, no?

There are two problems with ephemeral disks. First, they are ephemeral Smile, wich means that a hardware problem can make the whole disk go poof. Not an issue, that is why you are using a replicated database, no? We’ll get to that.

Another issue that is raised by Discord is that they rely on disk snapshots for backups and other important workflows. The nice thing about snapshotting a cloud disk is that it is typically fast and cheap to do. Otherwise you need to deal with higher level backup systems at the database layer. The issue is that you probably do want that, because restoring a system from a set of disk snapshots that were taken at different times and at various states is likely to be… challenging.

Discord's solution to this is to create a set of ephemeral disks that are mirrored into persistent disks. Reads are going to the fast disks and writes are spread around. A failure on the ephemeral disks will lead to a recovery of the data from the network disk.

Their post has the gory details and a lot more explanation aside. I wanted to write this post for a simple reason. As a database guy, this system scares me to pieces.

The issue is that I don’t know what kind of data consistency we can expect in such an environment. Do I get any guarantees about the equivalence of data between the fast disks and the network disk? Can they diverge from one another? What happens when you have partial errors?

Consider a transaction that modifies three different pages that end up going to three different fast disks + the network disk. If one of the fast disks has an error during write, what is written to the persistent disk?

Can I get an out-of-date read from this system if we read from the network disk for some reason? That may mean that two pages that were written in one transaction are coming back as different versions. That will likely violate assumptions and invariants and can lead to all sorts of… interesting problems.

Given that this system is meant to handle the failure modes, it sounds really scary because it is an additional layer of complexity (and one that the database is unaware of) to deal with.

Aside from the snapshots, I assume that the reason for this behavior is to avoid the cost of rebuilding a node when there is a failure. I don’t have enough information to say what is the failure rate and the impact on the overall system, but the solution provided is elegant, beautiful, and quite frankly, pretty scary to me.

There have been quite a few unknowns that we had to deal with in the realm of storage. But this adds a whole new layer of things that can break.

time to read 5 min | 932 words

One of the high costs that we have right now in my Redis Clone is strings. That is actually a bit misleading, take a look here:

image

Strings take 12.57% of the runtime, but there is also the GC Wait, where we need to cleanup after them. That means that the manner in which we are working is pretty inefficient.

Our test scenario right now also involves solely GET and SET requests, there are no deletions, expirations, etc. I mention that because we need to consider what we’ll replace the strings with.

The simplest option is to replace that with a byte array, but that is still managed memory and incurs the costs associated with GC. We can pool those byte arrays, but then we have an important question to answer, how do we know when a buffer is no longer used?

Consider the following set of events:

Time Thread #1 Thread #2
1 SET abc  
2   GET abc
3 SET abc  
4   Use the buffer we got on #2

In this case, we have thread #2 accessing the value buffer, but we replaced that buffer. We need to let thread #2 keep using this buffer until it is done.

This little tidbit put us right back at concurrent manual memory management, which is scary. We can do things in a slightly different manner, however. We can take advantage of the GC to support us, like so:

The idea is pretty simple. We have a class that holds a buffer, and when the GC notices that it is no longer in use, it will add its buffer back to the pool. The idea is that we rely on the GC to resolve this (really hard) problem for us. The fact that this moves the cost to the finalizer means that we can not worry about this. Otherwise, you have to jump through a lot of hoops.

The ReusableBuffer class also implements GetHashCode() / Equals() which allow us to use it as a key in the dictionary.

Now that we have the backing store for keys and values, let’s see how we can read & write from the network. I’m going to go back to the ConcurrentDictionary implementation for now, so I’ll handle only a single concept at a time.

Before, we used StreamReader / StreamWriter to do the work, now we’ll use PipeReader / PipeWriter from System.IO.PIpelines. That will allow us to easily work with the raw bytes directly and it is meant for high performance scenarios.

I wrote the code twice, once using the reusable buffer model and once using PIpeReader / PipeWriter and allocating strings. I was surprised to see that my fancy reusable buffers were within 1% performance of the (much simpler) strings implementation. That is 1% in the wrong direction, by the way.

On my machine, the buffer based system was 165K ops/second while the strings based one was 166K ops/sec.

Here is the reusable buffer based approach complete source code. And to compare, here is the string based one. The string based one is about 50% shorter in terms of lines of code.

I’m guessing that the allocation pattern is really good for the kind of heuristics that the GC does. We either have long term objects (in the cache) or very short term ones.

It’s worth pointing out that the actual parsing of the commands from the network isn’t using strings. Only the actual keys and values are actually translated to strings. The rest I’m doing using raw bytes.

Here is what the code looks like for the string version under the profiler:

image

And here is the same thing using the reusable buffer:

image

There are a few interesting things to note here. The cost of ExecCommand is almost twice as high as the previous attempt. Digging deeper, I believe that the fault is here:

This is the piece of code that is responsible for setting an item in the dictionary. However, note that we are doing a read for every write? The idea here is that if we have a set on an existing item, we can avoid allocating the buffer for the key again, and reuse it.

However, that piece of code is in the critical path for this benchmark and it is quite costly. I changed it to do the allocations always, and we got a fairly consistent 1% – 3% faster than the string version. Here is what this looks like:

image

In other words, here is the current performance table (under the profiler):

  • 1.57 ms  - String based 
  • 1.79 ms - Reusable buffer based (reduce memory usage)
  • 1.04 ms - Reusable buffer (optimized lookup)

All of those numbers are under the profiler, and on my development machine. Let’s see what we get when I’m running them on the production instances, shall we?

  • String based – 1,602,728.75 ops/sec
  • Reusable buffer (with reducing memory code) – 1,866,676.53 ops/sec
  • Reusable buffer (optimized lookup) – 1,756,930.64

Those results do not match with what we see in my development machine. The likely reason is that the amount of operations is high enough and the load is sufficiently big that we are seeing a much bigger impact from the memory optimization at scale.

That is the only conclusion I can draw from the fact that the memory reduction code, which adds costs, is actually able to process more requests/seconds under such load.

time to read 4 min | 753 words

Now that I’m done with the low hanging fruits, I decided to shift the Redis implementation to use System.IO.Pipelines. That is a high performance I/O API that is meant specifically for servers that need to eke out all the performance out of the system.

The API is a bit different, but it follows a very logical pattern and makes a lot of sense. Here is the main loop of handling commands from a client:

The idea is that we get a buffer from the network, we read everything (including pipelined commands) and then we flush to the client. The more interesting things happen when we start processing the actual commands, because now we aren’t utilizing StreamReader but PipeReader. So we are working at the level of bytes, not strings.

Here is what this (roughly) looks like, I’m not showing the whole thing because I want to focus on the issue that I ran into:

The code is reading from the buffer, parsing the Redis format and then executing the commands. It supports multiple commands in the same buffer (pipelining) and it has absolutely atrocious performance.

Yes, the super speedy API that is significantly harder to get right (compared to the ease of working with strings) is far slower. And by far slower I mean the following, on my development machine:

  • The previous version clocks at around 126,017.72 operations per second.
  • This version clocks at less than 100 operations per second.

Yes, you read that right, less than one hundred operations per second compared to over hundred thousands for the unoptimized version.

That was… surprising, as you can imagine.

I actually wrote the implementation twice, using different approaches, trying to figure out what I was doing wrong. Surely, it can’t be that bad.

I took a look at the profiler output, to try to figure out what is going on:

image

It says, quite clearly, that the implementation is super bad, no? Except, that this is what you are supposed to be using. So what is going on?

The underlying problem is actually fairly simple and relates to how the Pipelines API achieves its performance. Instead of doing small calls, you are expected to get a buffer and process that. Once you are done processing the buffer you can indicate what amount of data you consumed, and then you can issue another call.

However, there is a difference between consumed data and examined data. Consider the following data:

*3
$3
SET
$15
memtier-2818567
$256
xxxxxxxxxx ... xxxxxx
*2
$3
GET
$15
memtier-7689405
*2
$3
GET
$15
memt

What you can see here is a pipelined command, with 335 bytes in the buffer.  We’ll process all of those commands in a single hit, except… look at the highlighted portion. What do we have there?

We have a partial command. In other words, we are expected to execute a GET with a key size of 15 bytes, but we only have the first 4 bytes here. That is actually expected and fine. We consumed all the bytes until the highlighted portion (thus letting the PipeReader know that we are done with them). The problem is that when we issue a call now, we’ll get the highlighted portion (which we didn’t consume), but we aren’t ready to process that. Data is missing. We indicate that to the PipeReader using the examined portion. So the PipeReader knows that it needs to read more from the network.

However… my code has a subtle bug. It will report that it examined the yellow highlight, not the green one. In other words, we tell the PipeReader that we consumed some portion of the buffer, and examined some more, but there are still bytes on the buffer that are neither consumed nor examined. That means that when we issue the read call, expecting to get data from the network, we’ll actually get the same buffer again, to do the exact same processing.

Eventually, we’ll have more data in the buffer from the other side, so the correctness of the solution isn’t impacted. But it will kill your performance.

The fix is really simple, we need to tell the PipeReader that we examined the entire buffer, so it will not do a busy wait and wait for more data from the network. Here is the bug fix:

With that change in place, we can hit 187,104.21 operations per second! That is 50% better, which is awesome. I haven’t profiled things yet properly, because I also want to address another issue, how are we going to deal with the data from the network. More on that in my next post.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Recording (6):
    17 Nov 2022 - RavenDB in a Distributed Cloud Environment
  2. RavenDB Indexing (2):
    20 Oct 2022 - exact()
  3. Production postmortem (45):
    03 Oct 2022 - Do you trust this server?
  4. Webinar recording (15):
    26 Aug 2022 - Modeling Relationships and Hierarchies in a Document Database
  5. re (32):
    16 Aug 2022 - How Discord supercharges network disks for extreme low latency
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats