Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,408 | Comments: 50,842

Privacy Policy Terms
filter by tags archive
time to read 2 min | 259 words

Let’s say that you have the following scenario, you have an object in your hands that is similar to this one:

It holds some unmanaged resources, so you have to dispose it.

However, this is used in the following manner:

What is the problem? This object may be used concurrently. In the past, the frame was never updated, so it was safe to read from it from multiple threads. Now there is a need to update the frame, but that is a problem. Even though only a single thread can update the frame, there may be other threads that hold a reference to it. That is a huge risk, since they’ll access freed memory. At best, we’ll have a crash, more likely it will be a security issue. At this point in time, we cannot modify all the calling sites without incurring a huge cost. The Frame class is coming from a third party and cannot be changed, so what can we do? Not disposing the frame would lead to a memory leak, after all.

Here is a nice trick to add a finalizer to a third party class. Here is how the code works:

The ConditionalWeakTable associates the lifetime of the disposer with the frame, so only where there are no more outstanding references to the frame (guaranteed by the GC), the finalizer will be called and the memory will be freed.

It’s not the best option, but it is a great one if you want to make minimal modifications to the code and get the right behavior out of it.

time to read 9 min | 1605 words

It’s very common to model your backend API as a set of endpoints that mirror your internal data model. For example, consider a blog engine, which may have:

  • GET /users/{id}: retrieves information about a specific user, where {id} is the ID of the user
  • GET /users/{id}/posts: retrieves a list of all posts made by a specific user, where {id} is the ID of the user
  • POST /users/{id}/posts: creates a new post for a specific user, where {id} is the ID of the user
  • GET /posts/{id}/comments: retrieves a list of all comments for a specific post, where {id} is the ID of the post
  • POST /posts/{id}/comments: creates a new comment for a specific post, where {id} is the ID of the post

This mirrors the internal structure pretty closely, and it is very likely that you’ll get to an API similar to this if you’ll start writing a blog backend. This represents the usual set of operations very clearly and easily.

The problem is that the blog example is so attractive because it is inherently limited. There isn’t really that much going on in a blog from a data modeling perspective. Let’s consider a restaurant and how its API would look like:

  • GET /menu: Retrieves the restaurant's menu
  • POST /orders: Creates a new order
  • POST /orders/{order_id}/items: Adds items to an existing order
  • POST /payments: Allows the customer to pay their bill using a credit card

This looks okay, right?

We sit at a table, grab the menu and start ordering. From REST perspective, we need to take into account that multiple users may add items to the same order concurrently.

That matters, because we may have bundles to take into account. John ordered the salad & juice and Jane the omelet, and Derek just got coffee. But coffee is already included in Jane’s order, so no separate charge for that. Here is what this will look like:

 ┌────┐┌────┐┌─────┐┌──────────────────────┐
 │John││Jane││Derek││POST /orders/234/items│
 └─┬──┘└─┬──┘└──┬──┘└─────┬────────────────┘
   │     │      │         │       
   │    Salad & Juice     │       
   │─────────────────────>│       
   │     │      │         │       
   │     │     Omelet     │       
   │     │───────────────>│       
   │     │      │         │       
   │     │      │ Coffee  │       
   │     │      │────────>│       

The actual record we have in the end, on the other hand, looks like:

  • Salad & Juice
  • Omelet & Coffee

In this case, we want the concurrent nature of separate requests, since each user will be ordering at the same time, but the end result should be the final tally, not just an aggregation of the separate totals.

In the same sense, how would we handle payments? Can we do that in the same manner?

 ┌────┐┌────┐┌─────┐┌──────────────────┐
 │John││Jane││Derek││POST /payments/234│
 └─┬──┘└─┬──┘└──┬──┘└────────┬─────────┘
   │     │      │            │          
   │     │     $10           │          
   │────────────────────────>│          
   │     │      │            │          
   │     │      │ $10        │          
   │     │──────────────────>│          
   │     │      │            │          
   │     │      │    $10     │          
   │     │      │───────────>│  

In this case, however, we are in a very different state. What happens in this scenario if one of those charges were declined? What happens if they put too much. What happens if there is a concurrent request to add an item to the order while the payment is underway?

When you have separate operations, you have to somehow manage all of that. Maybe a distributed transaction coordinator or by trusting the operator or by dumb luck, for a while. But this is actually an incredibly complex topic. And a lot of that isn’t inherent to the problem itself, but instead about how we modeled the interaction with the server.

Here is the life cycle of an order:

  • POST /orders: Creates a new order – returns the new order id
  • ** POST /orders/{order_id}/items: Adds / removes items to an existing order
  • ** POST /orders/{order_id}/submit: Submits all pending order items to the kitchen
  • POST /orders/{order_id}/bill: Close the order, compute the total charge
  • POST /payments/{order_id}: Handle the actual payment (or payments)

I have marked with ** the two endpoints that may be called multiple times. Everything else can only be called once.

Consider the transactional behavior around this sort of interaction. Adding / removing items from the order can be done concurrently. But submitting the pending orders to the kitchen is a boundary, a concurrent item addition would either be included (if it happened before the submission) or not (and then it will just be added to the pending items).

We are also not going to make any determination on the offers / options that were selected by the diners until they actually move to the payment portion. Even the payment itself is handled via two interactions. First, we ask to get the bill for the order. This is the point when we’ll compute orders, and figure out what bundles, discounts, etc we have. The result of that call is the final tally. Second, we have the call to actually handle the payment. Note that this is one call, and the idea is that the content of this is going to be something like the following:

{
  "order_id": "789",
  "total": 30.0,
  "payments": [
    {
      "amount": 15.0,
      "payment_method": "credit_card",
      "card_number": "****-****-****-3456",
      "expiration_date": "12/22",
      "cvv": "123"
    },
    { 
        "amount": 10.0, 
        "payment_method": "cash" },
    {
      "amount": 5.0,
      "payment_method": "credit_card",
      "card_number": "****-****-****-5678",
      "expiration_date": "12/23",
      "cvv": "456"
    }
  ]
}

The idea is that by submitting it all at once, we are removing a lot of complexity from the backend. We don’t need to worry about complex interactions, race conditions, etc. We can deal with just the issue of handling the payment, which is complicated enough on its own, no need to borrow trouble.

Consider the case that the second credit card fails the charge. What do we do then? We already charged the previous one, and we don’t want to issue a refund, naturally. The result here is a partial error, meaning that there will be a secondary submission to handle the remainder payment.

From an architectural perspective, it makes the system a lot simpler to deal with, since you have well-defined scopes. I probably made it more complex than I should have, to be honest. We can simply make the entire process serial and forbid actual concurrency throughout the process. If we are dealing with humans, that is easy enough, since the latencies involved are short enough that they won’t be noticed. But I wanted to add the bit about making a part of the process fully concurrent, to deal with the more complex scenarios.

In truth, we haven’t done a big change in the system, we simply restructured the set of calls and the way you interact with the backend. But the end result of that is the amount of code and complexity that you have to juggle for your infrastructure needs are far more lightweight. On real-world systems, that also has the impact of reducing your latencies, because you are aggregating multiple operations and submitting them as a single shot. The backend will also make things easier, because you don’t need complex transaction coordination or distributed locking.

It is a simple change, on its face, but it has profound implications.

time to read 1 min | 108 words

The recording of my webinar showing off the new Sharding feature in RavenDB 6.0 is now live. I’m showcasing the new technology preview of RavenDB 6.0 and we have a nightly build already available for it. I think the webinar was really good, and I’m super excited to discuss all the ins & out of how this works.

Please take a look, and take the software for a spin. We have managed to get sharding down to a simple, obvious and clear process. And we are very much looking for your feedback.

time to read 2 min | 221 words

imageThis Wednesday I’m going to be doing a webinar about RavenDB & Sharding. This is going to be the flagship feature for RavenDB 6.0 and I’m really excited to be talking about it in public finally.

Sharding involves splitting your data into multiple nodes. Similar to having different volumes of a single encyclopedia.

RavenDB’s sharding implementation is something that we have spent the past three or four years working on. That has been quite a saga to get it out. The primary issue is that we want to achieve two competing goals:

  • Allow you to scale the amount of data you have to near infinite levels.
  • Ensure that RavenDB remains simple to use and operate.

The first goal is actually fairly easy and straightforward. It is the second part that made things complicated. After a lot of work, I believe that we have a really good solution at hand.

In the webinar, I’m going to be presenting how RavenDB 6.0 implements sharding, the behavior of the system at scale, and all the details you need to know about how it works under the cover.

I’m really excited to finally be able to show off the great work of the team! Join me, it’s going to be really interesting.

time to read 4 min | 749 words

incaseofemergency

Preconditions, postconditions, and invariants, oh my!

The old adage about Garbage In, Garbage Out is a really important aspect of our profession. If you try to do things that don’t make sense, the output will be nonsensical.

On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

~Charles Babbage – Inventor of the first computer

As you can see, the issue isn’t a new one. And there are many ways to deal with that. You should check your inputs, assume they are hostile, double check on every layer, etc.

Those are the principles of sound programming design, after all.

This post is about a different topic. When everything is running smoothly, you want to reject invalid operations and dangerous actions. The problem is when everything is hosed.

The concept of emergency operations is something that should be a core part of the design, because emergencies happen, and you don’t want to try to carve new paths in emergencies.

Let’s consider a scenario such as when the root certificate has expired, which means that there is no authentication. You cannot authenticate to the servers, because the auth certificate you use has also expired. You need to have physical access, but the data center won’t let you in, since you cannot authenticate.

Surely that is fiction, right? Happened last year to Facebook (bad IP configuration, not certs, but same behavior).

An important aspect of good design is to consider what you’ll do in the really bad scenarios. How do you recover from such a scenario?

For complex systems, it’s very easy to get to the point where you have cross dependencies. For example, your auth service relies on the database cluster, which uses the auth service for authentication. If both services are down at the same time, you cannot bring them up.

Part of the design of good software is building the emergency paths. When the system breaks, do you have a well-defined operation that you can take to recover?

A great example of that is fire doors in buildings. They are usually alarmed and open to the outside world only, preventing their regular use. But in an emergency, they allow the quick evacuation of a building safely, instead of creating a chokepoint.

We recently got into a discussion internally about a particular feature in RavenDB (modifying the database topology). There are various operations that you shouldn’t be able to make, because they are dangerous. They are also the sort of things that allow you to recover from disaster. We ended up creating two endpoints for this feature. One that included checks and verification. The second one is an admin-only endpoint that is explicitly meant for the “I know what I mean” scenario.

RavenDB actually has quite a bit of those scenarios. For example, you can authenticate to RavenDB using a certificate, or if you have a root access on the machine, you can use the OS authentication mechanism instead. We had scenarios where users lost their certificates and were able to use the alternative mechanism instead to recover.

Making sure to design those emergency pathways ahead of time means that you get to do that with a calm mind and consider more options. It also means that you get to verify that your emergency mechanism doesn’t hinder normal operations. For example, the alarmed fire door. Or in the case of RavenDB, relying on the operating system permissions as a backup if you are already running as a root user on the machine.

Having those procedures ahead of time, documented and verified, ends up being really important at crisis time. You don’t need to stumble in the dark or come up with new ways to do things on the fly. This is especially important since you cannot assume that the usual invariants are in place.

Note that this is something that is very easy to miss, after all, you spend a lot of time designing and building those features, never to use them (hopefully). The answer to that is that you also install sprinklers and fire alarms with the express hope & intent to never use them in practice.

The amusing part of this is that we call this: Making sure this areup to code.

You need to ensure that your product and code are up to code.

time to read 2 min | 337 words

A database indexing strategy is a core part of achieving good performance. About 99.9% of all developers have a story where adding an index to a particular query cut the runtime from seconds or minutes to milliseconds. That percentage is 100% for DBAs, but the query was cut from hours or days to milliseconds.

The appropriate indexing strategy is often a fairly complex balancing act between multiple competing needs. More indexing means more I/O and cost on writes, but faster reads. RavenDB has a query optimizer engine that will analyze your queries and generate the appropriate set of indexes on the fly, without you needing to think much about it. That means that RavenDB will continuously respond to your operational environment and changes in it. The end result is an optimal indexing strategy at all times.

This automatic behavior applies only to automatic indexes, however. RavenDB also allows you to define your own indexes and many customers run critical business logic in those indexes. RavenDB now has a feature that aims to help you manage/organize your indexes by detecting redundant definitions & unqueried indexes which can be removed or merged.

The index cleanup feature is now exposed in the Studio (since build 5.4.5):

image

When you select it, the Studio will show you the following options:

image

You can see that RavenDB detected that two indexes can be merged into a single one, and additionally there are some indexes that haven’t been used in a while or have been completely superseded by other indexes.

RavenDB will even go ahead and suggest the merged index for you:

image

The idea is to leverage RavenDB’s smarts so you won’t have to spend too much time thinking about index optimization and can focus on the real value-added portions of your system.

time to read 2 min | 329 words

When you search for some text in RavenDB, you’ll use case insensitive search by default. This means that when you run this query:

image

You’ll get users with any capitalization of “Oren”. You can ask RavenDB to do a case sensitive search, like so:

image

In this case, you’ll find only exact matches, including casing.  So far, that isn’t really surprising, right?

Under what conditions will you need to do searches like that? Well, it is usually when the data itself is case sensitive. User names on Unix are a good example of that, but you may also have Base64 data (where case matters), product keys, etc.

What is interesting is that this is a property of the field, usually.

Now, how does RavenDB handles this scenario? One option would be to index the data as is and compare it using a case insensitive comparator. That ends up being quite expensive, usually. It’s cheaper by far to normalize the text and compare it using ordinals.

The exact() method tells us how the field is supposed to be treated. This is done at indexing time. If we want to be able to query using both case-sensitive and case-insensitive manner, we need to have two fields. Here is what this looks like:

image

We indexed the name field twice, marking it as case sensitive for the second index field.

Here is what actually happens behind the scenes because of this configuration:

image

 

The analyzer used determines the terms that are generated per index field. The first index field (Name) is using the default LowerCaseKeywordAnalyzer analyzer, while the second index field (ExactName) is using the default exact KeywordAnalyzer analyzer.

time to read 5 min | 998 words

A long while ago, I was involved in a project that dealt with elder home care. The company I was working for was contracted by the government to send care workers to the homes of elderly people to assist in routine tasks. Depending on the condition of the patient in question, they would spend anything from 2 – 8 hours a day a few days a week with the patient. Some caretakers would have multiple patients that they would be in charge of.

There were a lot of regulations and details that we had to get right. A patient may be allowed a certain amount of hours a week by the government, but may also pay for additional hours from their own pocket, etc. Quite interesting to work on, but not the topic of this post.

A key aspect of the system was actually paying the workers, of course. The problem was the way it was handled. At the end of each month, they would designate “office hours” for the care workers to come and submit their time sheets. Those had to be signed by the patients or their family members and were submitted in person at the office. Based on the number of hours worked, the workers would be paid.

The time sheets would look something like this:

image

A single worker would typically have 4 – 8 of those, and the office would spend a considerable amount of time deciphering those, computing the total hours and handing over payment.

The idea of the system I was working on was to automate all of that. Here is more or less what I needed to do:

image

For each one of the timesheet entries, the office would need to input the start & end times, who the care worker took care of, whether the time was approved (actually, whether this was paid by the government, privately, by the company, etc), and whether it was counted as overtime, etc.

The rules for all of those were… interesting, but we got it working and tested. And then we got to talk with the actual users.

They took one look at the user interface they had to work with and absolutely rebelled. The underlying issue was that during the end of the month period, each branch would need to handle hundreds of care workers, typically within four to six hours. They didn’t have the time to do that (pun intended). Their current process was to review the submitted time sheet with the care worker, stamp it with approved, and put just the total hours worked into the payroll system.

imageHaving to spend so much time on data entry was horrendous for them, but the company really wanted to have that level of granularity, to be able to properly track how many hours were worked and handle its own billing more easily.

Many of the care workers were also… non-technical, and the timesheet had to be approved by the patient or family worker. Having a signed piece of paper was easy to handle, trying to get them to go on a website and enter hours was a non-starter. That was also before everyone had a smartphone (and even today I would assume that would be difficult in that line of business).

As an additional twist, it turns out that the manual process allowed the office employees to better manage the care workers. For example, they may give them a 0.5 / hour adjustment for a particular month to manage a difficult client or deal with some specific issue, or approve (at their discretion) overtime pay when it wasn’t quite "proper” to do so.

One of the reasons that the company wanted to move to a modern system was to avoid this sort of “initiatives”, but they turned out to be actually quite  important for the proper management of the care workers and actually getting things done. For an additional twist, they had multiple branches, and each of those had a different methodology of how that was handled, and all of them were different from what HQ thought it should be.

The process turned out to be even more complex than we initially got, because there was a lot of flexibility in the system that was  actually crucial for the proper running of the business.

If I recall properly, we ended up having a two-stage approach. During the end of the month rush, they would fill in the gross hours and payment details. That part of the system was intentionally simplified to the point where we did almost no data validation and trusted them to put the right values.

After payroll was processed, they had to actually put in all those hours and we would run a check on the expected amount / validation vs. what was actually paid. That gave the company the insight into what was going on that they needed, the office people were able to keep on handling things the way they were used to, and discrepancies could be raised and handled more easily.

Being there in the “I’m just a techie” role, so I got to sit on the sidelines and see tug-of-war between the different groups. It was quite interesting to see, I described the process above in a bit of a chaotic manner, but there were proper procedures and processes in the offices. They were just something that the company HQ never even realized was there.

It also taught me quite a lot about the importance  of accepting “invalid” data. In many cases, you’ll see computerized solutions flat out refuse to accept values that they consider to be wrong. The problem is that often, you need to record reality, which may not agree with validation rules on your systems. And in most cases, reality wins.

time to read 4 min | 680 words

I mentioned earlier that B+Trees are a gnarly beast to implement properly. On the face of it, this is a really strange statement, because they are a pretty simple data structure. What is so complex about the implementation? You have a fixed size page, you add to it until it is full, then you split the page, and you are done. What’s the hassle?

Here is a simple scenario for page splits, the following page is completely full. We cannot fit another entry there:

image

Now, if we try to add another item to the tree, we’ll need to split the page, and the result will be something like this (we add an entry with a key: users/050):

image

How did we split the page? The code for that  is really simple:

As you can see, since the data is sorted, we can simply take the last half of the entries from the source, copy them to the new page and call it a day. This is simple, effective, and will usually work just fine. The key word here is usually.

Given a B+Tree that uses variable size keys, with a page size of 4KB and a maximum size of 1 KB for the keys. On the face of it, this looks like a pretty good setup. If we split the page, we can be sure that we’ll have enough space to accommodate any valid key, right? Well, just as long as the data distribution makes sense. It often does not. Let’s talk about a concrete scenario, shall we? We store in the B+Tree a list of public keys.

This looks like the image below, where we have a single page with 16 entries and 3,938 bytes in use, and 158 bytes that are free. Take a look at the data for a moment, and you’ll notice some interesting patterns.

image

The data is divided into two distinct types, EdDSA keys and RSA keys. Because they are prefixed with their type, all the EdDSA keys are first on the page, and the RSA keys are last. There is a big size difference between the two types of keys. And that turns out to be a real problem for us.

Consider what will happen when we want to insert a new key to this page. We still have room to a few more EdDSA keys, so that isn’t really that interesting, but what happens when we want to insert a new RSA key? There is not enough room here, so we split the page. Using the algorithm above, we get the following tree structure post split:

image

Remember, we need to add an RSA key, so we are now going to go to the bottom right page and try to add the value. But there is not enough room to add a bit more than 512 bytes to the page, is there?

What happens next depends on the exact implementation. It is possible that you’ll get an error, or another split, or the tree will attempt to proceed and do something completely bizarre.

The key here (pun intended) is that even though the situation looks relatively simple, a perfectly reasonable choice can hide a pretty subtle bug for a very long time. It is only when you hit the exact problematic circumstances that you’ll run into problems.

This has been a fairly simple problem, but there are many such edge cases that may be hiding in the weeds of B+Tree implementations. that is one of the reasons that working with production data is such a big issue. Real world data is messy, it has unpredictable patterns and stuff that you’ll likely never think of. It is also the best way I have found to smoke out those details.

time to read 5 min | 861 words

I love B+Trees, but they can be gnarly beasts, with the number of edge cases that you can run into. Today’s story is about a known difficult place, page splitting in the tree. Consider the following B+Tree, showing a three-level tree with 3 elements on each page.

image

Consider what will happen when we want to insert a new value to the tree, the value: 27. Given the current state of the tree, that should go on the page marked in red:

image

But there is no place for the new value on this page, so we have to split it. The tree will then look like so, we split the page and now we need to add the new page to the parent, but that one also doesn’t have room for it:

image

So we are now in a multi-level split process. Let’s see what this looks like when we go up the tree. This is the final state of the tree when we are done doing all the splits:

image

The reason for all of this is that we need to add 27 to the tree, and we haven’t done that yet. At this stage, we got the tree back in order and we can safely add the new value to the tree, since we made sure we have enough space.

However, note that the exact same process would apply if we were adding 27 or 29. The page that we’ll add them to, however, is different.

This can be quite complex to keep track of, because of the recursive nature of the process. In code, this looks something like this:

I am skipping on some details, but that is the gist of it. So we do the split (recursively if needed) and then after we wired the parent page properly, we find the right location for the new value.

An important aspect here is the cursor. We use that to mark our current location in the tree, so the cursor will always contain all the parent pages that we are currently searching upon. A lot of the work that we are doing in the tree is related to the cursor.

Now, look at the code and consider the behavior of this code when we insert the value 29. It will correctly generate this page:

image

However.. what happens if we’ll insert 27?

Well, when we split the page, we went up the tree. And then we had another split, and then we went down another branch. So as written, the result would be adding the 27 to the same page as we would the 29. This would look like this:

image

Look at the red markers. We put entry 27 on the wrong page.

Fixing this issue is actually pretty hard, because we need to keep track of the values as we go up and down the tree. For fun, imagine what happens in this exact scenario, but when you have 6 levels in the tree and you end up in a completely different location in the tree.

I spent a lot of time struggling with this issue, including getting help from some pretty talented people, and the final conclusion we got was “it’s complicated”.

I don’t want complications here, I need it to be as simple as possible, otherwise, we can’t make any sort of sense here. I kept spinning more and more complex systems to resolve this, when I realized that I just looked at the problem in the wrong manner all along.

The issue was that I was trying to add the new value to the tree after I sorted out the structure of the tree, but there was actually nothing that forced me to do that. Given that I already split the page at this stage, I know that I have sufficient space to add the key without doing anything else.  I can first add the key to the right page, then write the split page back to the tree. In this case, I don’t need to do any sort of backtracking or state management .

Here is what this looks like:

And with this change, the entire class of problems related to the tree structure just went away.

I’m very happy with this result, even if it is a bit ironic. Like the problem at hand, a lot of the complexity was there because I had to backtrack the implementation decisions and go on a new path to solve this.

Also, I just checked, the portion that controls page splits inside Voron has had roughly 1 change a year for the past 5 years. Given our scope and usage, that means that it has been incredibly stable in the face of everything that we could throw at it.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Recording (8):
    17 Feb 2023 - RavenDB Usage Patterns
  2. Production postmortem (48):
    27 Jan 2023 - The server ate all my memory
  3. Answer (12):
    05 Jan 2023 - what does this code print?
  4. Challenge (71):
    04 Jan 2023 - what does this code print?
  5. RavenDB Indexing (2):
    20 Oct 2022 - exact()
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats