Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,430 | Comments: 50,880

Privacy Policy Terms
filter by tags archive
time to read 4 min | 749 words

I write databases for a living, which means that I’m thinking a lot about persistence. Here is a fun challenge that we went through recently. We have the need to store a list of keys and values and then lookup a value by key. Pretty standard stuff. The keys and values are both 64 bits integers. In other words, what I would like to have is:

Dictionary<long,long> lookup;

That would be perfect, except that I’ve to persist the data, which means that I have to work with raw bytes. It’s easiest to think about it if we have some code in front of us. Here is the interface that I need to implement:

As you can see, we have a byte buffer (8KB in size) and we want to add or lookup values from the buffer. All the data resides in the buffer, nothing is external. And we cannot unpack it in memory, because this is used for lookups, so this needs to be really fast.

The keys we are storing are file offsets, so they correlate quite nicely to the overall size of the file. Meaning that you’ll have a lot of small values, but also large ones. Given a key, we need to be able to look its value quickly, since we may run this lookup billions of times.

Given that I have 8KB of data, I can do the following, just treat the buffer as a sorted array, which means that I get a pretty easy way to search for a particular value and a simple way to actually store things.

Theoretically, given an 8KB page, and 16 bytes per each (key, value) entry, we can store up to 512 entries per page. But it turns out that this is just a theory. We also need to keep track of the number of items that we have, and that takes some space. Just a couple of bytes, but it means that we don’t have those bytes available. A page can now contain up to 511 entries, and even at full capacity, we have 14 bytes wasted (2 for the number of entries, and the rest are unused).

Here is what this looks like in code:

As you can see, we are creating two arrays, the keys are growing from the bottom of the page and the values are growing from the top. The idea is that I can utilize the BinarySearch() method to quickly find the index of a key (or where it ought) to go. From there, I can look at the corresponding values array to get the actual value. The fact that they are growing separately (and toward each other) means that I don’t need to move as much memory if I’m getting values out of order.

For now, I want to set up the playground in which we’ll operate. The type of data that you write into such a system is important. I decided to use the following code to generate the test set:

The idea is that we’ll generate a random set of numbers, in the given distribution. Most of the values are in the range of 8MB to 512GB, representing a pretty good scenario overall, I think.

And with that, we just need to figure out what metrics we want to use for this purpose. My goal is to push as many values as I can into the buffer, while maintaining the ability to get a value by its key as fast as possible.

The current approach, for example, does a binary search on a sorted array plus an extra lookup to the companion values array. You really can’t beat this, if you allow to store arbitrary keys. Here is my test bench:

This will insert key/value pairs into the page until it is full. Note that we allow duplicates (we’ll just update the value), so we need to keep track of the number of entries inserted, not just the number of insertions.  We also validate the structure at any step of the way, to ensure that we always get the right behavior.

This code runs as expected and we can put 511 values into the page before it gives up. This approach works, it is simple to reason about and has very few flaws. It is also quite wasteful in terms of information density. I would like to do better than 511 entries / pager. Is it possible to drop below 16 bytes per entry?

Give it some thought, I’m going to present several ways of doing just that in my next post…

time to read 3 min | 555 words

We looked into the internal of Corax’s posting list and in the last post I mentioned that we have a problem with the Baseline of the page.

We are by no means the first people to work with posting lists, and there is a wide body of knowledge on the topic. When it came time to figure out the compression format for our posting list, we used the PFOR format (Patched Frame of Reference). It, like pretty much all other integer compression methods, uses 32 bits integers. Corax utilizes 64 bits integer as the document ids, so that was a problem. We solved that problem by using a Baseline for each page. In other words, each page would be able to contain values in a range of 2.1 billion of one another. That is a very reasonable range, given that a page is 8KB in size.

There was a problem as we built more features into Corax. The posting list needed to store not just the document id, but also the frequency of the term in the source document. It turns out that we need 8 bits to do so, and we already have 64 bits range so… Instead of creating another location to store the frequencies, we put them directly inside the posting list. But that meant that we reserved a range of bits. We have 64 bits overall, so not a big problem, right? Except that on a page basis, we have a lot less. Before, a page could contain a range of 2.1 billion, but we reserved 10 bits (frequency and tag, the details of which are not important to our story) and we ended up with a range that is 4 million per page. That little tidbit meant that we could only store in a page items that were within 4MB of one another. And that was a problem. Whenever we had a posting list where two values would be more than 4MB from one another, we would need to split the page. And since the posting list and the entries live on the same file, having more page splits means that entries are now further apart.

Here is an example of what this looks like:

image

The index is taking more space than the source data, and most of that is used to store… nothing, since we ended up with a very wide posting list containing very few entries. One of the cases of two independent issues compounding each other very quickly.

So we changed things again, instead of limiting ourselves to 32 bits range per page, we changed the PFor format to allow for 64 bits integers directly. Once again, that leads to simplification in the codebase and has greatly reduced the amount of disk space that we are actually using.

To give some context, here is the current amount of disk space taken by the same entity that previously took 800+GB:

image

The problem wasn’t with the number of entries, but that each entry would consume 8KB of disk space on its own, and in the image, you are seeing the number of posting lists, not the number of posting lists entries.

time to read 3 min | 594 words

In a previous post (which went out a long time ago) I explained that we have the notion of a set of uint64 values that are used for document IDs. We build a B+Tree with different behaviors for branch pages and leaf pages, allowing us to pack a lot of document IDs (thousands or more) per page.

The problem is that this structure hold the data compressed, so when we add or remove a value, we don’t know if it exists already or not. That is a problem, because while we are able to do any relevant fixups to skip duplicates and erase removed values, we end up in a position where the number of entries in the set is not accurate. That is a Problem, with a capital P, since we use that for query optimizations.

The solution for that is to move to a different manner of storing the data in the leaf page, instead of going with a model where we add the data directly to the page and compress when the raw values section overflows, we’ll use the following format instead:

image

Basically, I removed the raw values section from the design entirely. That means that whenever we want to add a new value, we need to find the relevant compressed segment inside the page and add to it (potentially creating a page split, etc).

Obviously, that is not going to perform well for write. Since on each addition, we’ll need to decompress the segment, add to it and then compress it again.

The idea here is that we don’t need to do that. Instead of trying to store the entries in the set immediately, we are going to keep them in memory for the duration of the transaction. Just before we commit the transaction, we are going to have two lists of document IDs to go through. One of added documents and one of removed documents. We can then sort those ids and then start walking over the list, find the relevant page for each section in the list, and merging it with the compressed values.

By moving the buffering stage from the per-page model to the per-transaction model, we actually gain quite a lot of performance, since if we have a lot of changes to a single page, we can handle compression of the data only once. It is a very strange manner of working, to be honest, because I’m used to doing the operation immediately. By delaying the cost to the end of the transaction, we are able to gain two major benefits. First, we have a big opportunity for batching and optimizing work on large datasets. Second, we have a single code path for this operation. It’s always: “Get a batch of changes and apply them as a unit”. It turns out that this is far easier to understand and work with. And that is for the writes portion of Corax.

Remember, however, that Corax is a search engine, so we expect a lot of reads. For reads, we can now stream the results directly from the compressed segments. Given that we can usually pack a lot of numbers into a segment, and that we don’t need to compare to the uncompressed portion, that ended up benefiting us significantly on the read side as well, surprisingly.

Of course, there is also another issue, look at the Baseline in the Page Header? We’ll discuss that in the next post, turned out that it wasn’t such a good idea.

time to read 3 min | 581 words

We care a lot about the performance of RavenDB, like a whole lot. To the point where we have a dedicated team that is continuously burning money CPU cycles testing out all sorts of scenarios with RavenDB. You can see the performance page on the website for some of their results. It got to the point where we stock NVMe drives at the office because we go through them often enough that we need them available for replacement. The benchmark must run, and the numbers must rise, after all.

But the story today isn’t about the costs we pay to reach our performance goals. Rather, it is about a curious little snafu that we ran into when looking at the results. Here are the benchmark results, I intentionally stripped out everything that will give context to this story. What you can see is the same scenario being run on two identical machines, with the difference being the disk that is being used to host the database.

image

In this case, the blue line is io1 disk (high IOPS, low latency, and high costs) versus gp3 (reasonable IOPS, much higher latency, and lower costs). In this case, lower numbers are better.

If you’ll look at the benchmark, you can see that it makes complete sense. RavenDB is a database product, we are running a benchmark, and we use the disk. It’s predictable that the disk latency will have an impact on the performance of the benchmark.

Except… in this case, we are looking at a benchmark that is read-only, and it is meant to run completely from memory. We wrote it so the data size is less than the amount of memory on the system. So why do we have an impact of the disk at all in this case?

We even have a warmup phase before we actually start measuring, to ensure that everything is in memory. Something here does not line up.

After investigating deeper, we discovered the following:

  • When running the automated benchmark, the performance was always correlated to the disk type.
  • When running the same benchmark, manually, there was much better performance and no slowdown related to the slower disk.

That is a really annoying bug, because the fact that we are observing it somehow makes it go away? What is going on?

After a while, we finally figured it out. The problem was the warmup phase. Basically, the warmup is just running the benchmark itself, discarding the results.

Can you guess what the problem was?

The warmup phase is running when the system is cold (naturally), we were hitting the server with enough requests up front that it was unable to process them all (it was queuing on the disk). That meant that a very large portion of the requests in the warmup would timeout or just fail. When we started the benchmark phase, significant portions of the system were still on the disk. So the benchmark become a disk-bound test, with predictable results.

When we ran it manually, we would always do that after the benchmark already run, so our system would be warm (and fast, with no disk access).

The solution for the problem was to scale down the number of requests that the warmup phase is running, to allow gradual loading of the data to memory, without overloading the hardware.

A case where our expectations and what really happened did not line up, creating some… interesting surprises in the end result.

time to read 1 min | 179 words

RavenDB has the public live test instance, and we have recently upgraded that to version 6.0.  That means that you can start playing around with RavenDB 6.0 directly, including giving us feedback on any issues that you find.

Of particular interest, of course, is the sharding feature, it is right here:

image

And once enabled, you can see things in more details:

image

If we did things properly, the only thing you’ll notice that indicates that you are running in sharded mode is:

image

Take a look, and let us know what you think.

As a reminder, at the top right of the page, there is the submit feedback option:

image

Use it, we are waiting for your insights.

time to read 2 min | 259 words

Let’s say that you have the following scenario, you have an object in your hands that is similar to this one:

It holds some unmanaged resources, so you have to dispose it.

However, this is used in the following manner:

What is the problem? This object may be used concurrently. In the past, the frame was never updated, so it was safe to read from it from multiple threads. Now there is a need to update the frame, but that is a problem. Even though only a single thread can update the frame, there may be other threads that hold a reference to it. That is a huge risk, since they’ll access freed memory. At best, we’ll have a crash, more likely it will be a security issue. At this point in time, we cannot modify all the calling sites without incurring a huge cost. The Frame class is coming from a third party and cannot be changed, so what can we do? Not disposing the frame would lead to a memory leak, after all.

Here is a nice trick to add a finalizer to a third party class. Here is how the code works:

The ConditionalWeakTable associates the lifetime of the disposer with the frame, so only where there are no more outstanding references to the frame (guaranteed by the GC), the finalizer will be called and the memory will be freed.

It’s not the best option, but it is a great one if you want to make minimal modifications to the code and get the right behavior out of it.

time to read 2 min | 262 words

Trevor Hunter from Kobo Rakuten is going to be speaking about Kobo’s usage of RavenDB in a webinar next Wednesday.

When I started at Kobo, we needed to look beyond the relational and into document databases. Our initial technology choice didn't work out for us in terms of reliability, performance, or flexibility, so we looked for something new and set on a journey of discovery, exploratory testing, and having fun pushing contender technologies to their limits (and breaking them!). In this talk, you'll hear about our challenges, how we evaluated the options, and our experience since widely adopting RavenDB. You'll learn about how we use it, areas we're still a bit wary of, and features we're eager to make more use of. I'll also dive into the key aspects of development - from how it affects our unit testing to the need for a "modern DBA" role on a development team.

About the speaker: Trevor Hunter: "I am a leader and coach with a knack for technology. I’m a Chief Technology Officer, a mountain biker, a husband, and a Dad. My curiosity to understand how things work and my desire to use that understanding to help others are the things I hope my kids inherit from me. I am currently the Chief Technology Officer of Rakuten Kobo. Here I lead the Research & Development organization where our mission is to deliver the best devices and the best services for our readers. We innovate, create partnerships, and deliver software, hardware, and services to millions of users worldwide."

You can register to the webinar here.

time to read 2 min | 246 words

RavenDB Sharding is now running as a production replication in our backend systems and we are stepping up our testing in a real-world environment.

We are now also publishing nightly builds of RavenDB 6.0, including Sharding support.

image

There are some known (minor) issues in the Studio, which we are busy fixing, but it is already possible to create and manage sharded clusters. As usual, we would love your feedback.

Here are some of the new features that I’m excited about, you can see that we have an obese bucket here, far larger than all other items:

image

And we can drill down and find why:

image

We have quite a few revisions of a pretty big document, and they are all obviously under a single bucket.

You can now also get query timings information across the entire sharded cluster, like so:

image

And as you can see, this will give you a detailed view of exactly where the costs are.

You can get the nightly build and start testing it right now. We would love to hear from you, especially as you test the newest features.

time to read 9 min | 1605 words

It’s very common to model your backend API as a set of endpoints that mirror your internal data model. For example, consider a blog engine, which may have:

  • GET /users/{id}: retrieves information about a specific user, where {id} is the ID of the user
  • GET /users/{id}/posts: retrieves a list of all posts made by a specific user, where {id} is the ID of the user
  • POST /users/{id}/posts: creates a new post for a specific user, where {id} is the ID of the user
  • GET /posts/{id}/comments: retrieves a list of all comments for a specific post, where {id} is the ID of the post
  • POST /posts/{id}/comments: creates a new comment for a specific post, where {id} is the ID of the post

This mirrors the internal structure pretty closely, and it is very likely that you’ll get to an API similar to this if you’ll start writing a blog backend. This represents the usual set of operations very clearly and easily.

The problem is that the blog example is so attractive because it is inherently limited. There isn’t really that much going on in a blog from a data modeling perspective. Let’s consider a restaurant and how its API would look like:

  • GET /menu: Retrieves the restaurant's menu
  • POST /orders: Creates a new order
  • POST /orders/{order_id}/items: Adds items to an existing order
  • POST /payments: Allows the customer to pay their bill using a credit card

This looks okay, right?

We sit at a table, grab the menu and start ordering. From REST perspective, we need to take into account that multiple users may add items to the same order concurrently.

That matters, because we may have bundles to take into account. John ordered the salad & juice and Jane the omelet, and Derek just got coffee. But coffee is already included in Jane’s order, so no separate charge for that. Here is what this will look like:

 ┌────┐┌────┐┌─────┐┌──────────────────────┐
 │John││Jane││Derek││POST /orders/234/items│
 └─┬──┘└─┬──┘└──┬──┘└─────┬────────────────┘
   │     │      │         │       
   │    Salad & Juice     │       
   │─────────────────────>│       
   │     │      │         │       
   │     │     Omelet     │       
   │     │───────────────>│       
   │     │      │         │       
   │     │      │ Coffee  │       
   │     │      │────────>│       

The actual record we have in the end, on the other hand, looks like:

  • Salad & Juice
  • Omelet & Coffee

In this case, we want the concurrent nature of separate requests, since each user will be ordering at the same time, but the end result should be the final tally, not just an aggregation of the separate totals.

In the same sense, how would we handle payments? Can we do that in the same manner?

 ┌────┐┌────┐┌─────┐┌──────────────────┐
 │John││Jane││Derek││POST /payments/234│
 └─┬──┘└─┬──┘└──┬──┘└────────┬─────────┘
   │     │      │            │          
   │     │     $10           │          
   │────────────────────────>│          
   │     │      │            │          
   │     │      │ $10        │          
   │     │──────────────────>│          
   │     │      │            │          
   │     │      │    $10     │          
   │     │      │───────────>│  

In this case, however, we are in a very different state. What happens in this scenario if one of those charges were declined? What happens if they put too much. What happens if there is a concurrent request to add an item to the order while the payment is underway?

When you have separate operations, you have to somehow manage all of that. Maybe a distributed transaction coordinator or by trusting the operator or by dumb luck, for a while. But this is actually an incredibly complex topic. And a lot of that isn’t inherent to the problem itself, but instead about how we modeled the interaction with the server.

Here is the life cycle of an order:

  • POST /orders: Creates a new order – returns the new order id
  • ** POST /orders/{order_id}/items: Adds / removes items to an existing order
  • ** POST /orders/{order_id}/submit: Submits all pending order items to the kitchen
  • POST /orders/{order_id}/bill: Close the order, compute the total charge
  • POST /payments/{order_id}: Handle the actual payment (or payments)

I have marked with ** the two endpoints that may be called multiple times. Everything else can only be called once.

Consider the transactional behavior around this sort of interaction. Adding / removing items from the order can be done concurrently. But submitting the pending orders to the kitchen is a boundary, a concurrent item addition would either be included (if it happened before the submission) or not (and then it will just be added to the pending items).

We are also not going to make any determination on the offers / options that were selected by the diners until they actually move to the payment portion. Even the payment itself is handled via two interactions. First, we ask to get the bill for the order. This is the point when we’ll compute orders, and figure out what bundles, discounts, etc we have. The result of that call is the final tally. Second, we have the call to actually handle the payment. Note that this is one call, and the idea is that the content of this is going to be something like the following:

{
  "order_id": "789",
  "total": 30.0,
  "payments": [
    {
      "amount": 15.0,
      "payment_method": "credit_card",
      "card_number": "****-****-****-3456",
      "expiration_date": "12/22",
      "cvv": "123"
    },
    { 
        "amount": 10.0, 
        "payment_method": "cash" },
    {
      "amount": 5.0,
      "payment_method": "credit_card",
      "card_number": "****-****-****-5678",
      "expiration_date": "12/23",
      "cvv": "456"
    }
  ]
}

The idea is that by submitting it all at once, we are removing a lot of complexity from the backend. We don’t need to worry about complex interactions, race conditions, etc. We can deal with just the issue of handling the payment, which is complicated enough on its own, no need to borrow trouble.

Consider the case that the second credit card fails the charge. What do we do then? We already charged the previous one, and we don’t want to issue a refund, naturally. The result here is a partial error, meaning that there will be a secondary submission to handle the remainder payment.

From an architectural perspective, it makes the system a lot simpler to deal with, since you have well-defined scopes. I probably made it more complex than I should have, to be honest. We can simply make the entire process serial and forbid actual concurrency throughout the process. If we are dealing with humans, that is easy enough, since the latencies involved are short enough that they won’t be noticed. But I wanted to add the bit about making a part of the process fully concurrent, to deal with the more complex scenarios.

In truth, we haven’t done a big change in the system, we simply restructured the set of calls and the way you interact with the backend. But the end result of that is the amount of code and complexity that you have to juggle for your infrastructure needs are far more lightweight. On real-world systems, that also has the impact of reducing your latencies, because you are aggregating multiple operations and submitting them as a single shot. The backend will also make things easier, because you don’t need complex transaction coordination or distributed locking.

It is a simple change, on its face, but it has profound implications.

FUTURE POSTS

  1. Integer compression: Using SIMD bit packing in practice - 10 hours from now
  2. Talk: Scalable Architecture From the Ground Up - about one day from now
  3. Integer compression: SIMD bit packing and unusual usages - 4 days from now
  4. Integer compression: Understanding FastPFor - 5 days from now
  5. Integer compression: The FastPFor code - 6 days from now

There are posts all the way to Jun 14, 2023

RECENT SERIES

  1. Integer compression (6):
    07 Jun 2023 - Understanding Simd Compression by Lemire
  2. Talk (7):
    14 Dec 2021 - Scalable architecture from the ground up
  3. Fight for every byte it takes (6):
    01 May 2023 - Decoding the entries
  4. Looking into Corax’s posting lists (3):
    17 Apr 2023 - Part III
  5. Recording (8):
    17 Feb 2023 - RavenDB Usage Patterns
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats