Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,273 | Comments: 50,512

Privacy Policy Terms
filter by tags archive
time to read 1 min | 144 words

I asked about a slow piece of code and why it is so slow. The answer is pretty simple, I messed up, take a look at this piece of code:

image

When the Value is an int, I’m creating a span from the address of a long variable (initialized to zero). That means that I have a lot of hash collisions, which means that adding to the dictionary turned into a sequential operation, which means that the actual cost here is O(N**2).

Interestingly enough, you can’t write this code without the use of unsafe. I actually didn’t know that the scope of the variable was even valid here to have its address taken. That was very hard to debug, because the problem was hidden away somewhere very far from where we looked at.

time to read 1 min | 81 words

The following code takes a long time to run. In fact, I’m writing this blog post while this is running, and I’m not sure how long that will take.

Update: This took over a minute to complete on my machine (which is pretty nice).

The killer is that this code is correct, it does the right thing, but it can be slow. I stripped a much larger scenario to ~50 lines of code, can you see what is going on? And why?

time to read 2 min | 239 words

After presenting the issue of how to return items to the array pool without creating a use after free bug, I asked you how you would fix that. There are several ways to try to do that, you can use reference counting scheme, or try to use locking, etc. All of those options are expensive, what is worse, they are expensive on a routine basis, not just for the free the buffer code path.

Instead, I changed the way we are handling returning the buffer. Take a look at the following code:

This may require some explanation. I’m using a ConditionaWeakTable here, that was added to the runtime to enable dynamic properties on objects. Basically, it creates a table that you can lookup by an object to get a key. The most important feature is that the runtime ensures that the associated reference lifetime match the key object lifetime. In other words, when we add the buffer in the eviction callback, we ensure that the ReturnBuffer we register will live at least as long as the buffer.

That means that we can let the GC do the verification job. We’ll now return the buffer back to the pool only after the GC has ensured that there are no outstanding references to it. Not a lot of code, and an elegant solution. This also ensures that we are only paying the code on eviction (likely rare), and not all the time.

time to read 2 min | 301 words

In my previous post, I discussed a bug that brought up in code review, that bug made me go into near panic mode. Here is the issue:

In order to understand this bug, you have to take into account multiple concurrent threads at the same time. Look at the ComputeHashAndPutInCache() method, where we register an eviction callback for the item in the cache. When we evict the item, we return the buffer to the buffer pool.

We want to avoid allocating memory, so that is certainly something desirable, no? However, consider what happens if I have a thread in ComputeHash(), getting a value from the cache. Before I have a chance to look at the value, however, the cache will decide to evict it. At which point the eviction callback will run.

We’ll return the buffer back to the buffer pool, where it may be used again by something else. I am also using this buffer to do other things from the caller of the ComputeHash() call. This is a somewhat convoluted use after free issue, basically.

And I find is super scary bug because of its affects:

  • Randomly, and rarely, some buffer will contain the wrong data, leading to wrong results (but hard to track it down).
  • Trying to find such a bug after the fact, however, is nearly impossible.
  • Most of the debugging techniques (repeating the operation for a particular value) will make it go away (the cache will keep the value and not evict it).

In short, this is a recipe for a really nasty debug session and an impossible to resolve bug. All from code that looks very much like an innocent bystander.

Now, I can obviously fix it by not using the array pool, but that may cause me to allocate more memory than I should. How would you approach fixing this issue?

time to read 1 min | 148 words

I was writing a fairly low level piece of code recently, this is deep in the guts of how RavenDB handles query. I’m adding a cache for some processing inside of RavenDB and the performance benefits are amazing (three orders of magnitude better). As you can imagine, this is the sort of things that I would really like to get into the main codebase. So I did all the usual taxes, created a pull request and moved to other things. Part of our process is to review all pull requests by another pair of eyes. And while there was some minor comments about the code, there was one comment asking about a particular line that had be break out in cold sweat.

I created a simpler reproduction for discussion purposes, here is the code:

Take a look, and let’s me know if you see the bug and its consequences.

time to read 8 min | 1557 words

Referencing counting is probably the oldest memory management technique in existence. It is widely used, easily to understand and explain and in most cases does the Right Thing.

There are edge cases and nasty scenarios, but for the most part, it works. At least, as long as you are running as a single threaded program. Here is what a reference counting scheme looks like:

However, the moment that we have any form of concurrency, the simplicity goes right out the window. Consider the call to add_ref vs. release. Both of them need a reference to a valid object. However, what happens if we have the following sequence of events?

  • We have an object whose reference count is set to 1.
  • Thread  #1 gets the address of the object.
  • Thread #2 now calls to release();
    • There are no more references to the object and it is freed.
  • Thread #1 now calls to add_ref();
    • It is passing the address of the recently released object, meaning that we have undefined behavior.

This is a hard problem, because we need to keep the reference count somewhere, and we also need to release the resource as soon as there are no more references to it.

The more generic term for this issue is the ABA problem.

In literature, there are a lot of attempts to solve this issue:

  • Hazzard pointers
  • GC
  • Epochs

They are complex solution to the problem, but mostly because we have two distinct steps here. First we get a pointer to the ref counted value, then we need to change that, but we need to do that safely with concurrent releases. One way of ensuring that this works is to take a lock around the entire operation (acquire the pointer & add ref) and take the same lock for release. That is a somewhat heavy weight approach.

It is also pretty much completely redundant on any modern system. Any x86/x64 CPU released in the past decade will have support this assembly instruction:

image

The cmpxchg16b instruction allow us to do an atomic operation on a value that is 128 bits long (16 bytes). That is important, because that means that we can break apart the stages above. We can store both the pointer and its counter in a single location, and operate on them atomically.

This is called the DCAS (double compare and swap), which greatly simplify the problem. To the point where there is really no reason to want to use anything else.

Except… if you care about ARM systems. There is no comparable instruction to this on ARM, and given how common those machines are, that seem to point us right back to the complexities of hazard pointers and epochs. Of course, ARM has 64 bits atomic instructions, as you can see here, for example:

image

But our pointer is also 64 bits, so that doesn’t really help us that much, does it? Interesting tidbit here, however, the pointers we use aren’t actually 64 bits in size. They are just 48 bits, in truth. That is for both x64 and for AArch64. That means that you can target only a maximum of 256TB of RAM, but there are no machines that big right now (the biggest that I’m aware of are 10% of this size and are expensive). Given that this is a CPU limit, we can probably assume that this isn’t going away soon and that when it does, ARM will likely have a 128 bits atomic instruction.

That means that a 64 bits instruction can give us a 16 bits of free space to work with. But we can do better. Let’s assume that we get our pointers from malloc(), or a similar call. We know that malloc() is required to return the data aligned on max_align_t size. For 64 bits, that is 16. In other words, we have full 20 bits to play with that we can utilize for our needs, while preserving the original pointer value. If we are using page aligned pointers, however, we can use 28(!) bits out of the 64 of the pointer value, for that matter.

Let’s see how we can take that assumption and turn that into something usable. I’m going to use Zig here, because I like the language and it gives us a succinct manner with which to work with native code. The first thing to do is to define how we are going to overall structure:

The size of this structure is 12 bytes, and I’m using Zig’s arbitrary precision integers to help us pack the data into a single u64 value, without having to write all the bit shifts manually. All the other functions that I’m showing here are going to be inside the generic structure. I’m asserting the size and that the T that we are working on is a pointer of some kind. You can see that the structure is also ready to be marked as an error, so we have three possible states:

  • Empty – no value is stored inside
  • Errored – we’ll just retain the error code
  • Value – we have a value and we keep the reference count in the references field. Note that this is a 19 bits field, so we have a maximum of 512K outstanding references for the counter. For pretty much all needs, I believe that this will be sufficient.

In addition to the data itself, we also have the notion of a version field. That one is needed because we want to allow the caller to wait for the value to become available. Let’s see how we can get a value from this?

What I’m doing here is to get the current value, do some basic checks (do we have a value, was an error registered, etc). Then we increment the reference as well as ensure that we don’t overflow it. Finally, we publish the new value using cmpxchng call. Note that the whole thing is in a while loop, to ensure that if the cmpxchng fails, we’ll retry. If we were able to update the reference count, we turn the compacted value into its original form and return that to the user. Because we get the pointer value and increment the reference count as a single step, we are ensured that you cannot end up with missing the release call.

The tryAcquire() call also have a sibling, which will wait for the value to become available, it looks like this:

If the value does not exists (if there is an error, we’ll return immediately), we’ll wait using the Futex.wait() call. This is why we need the version field. It allows us to properly wait without requiring to create kernel level objects. Let’s see how we set the new value, potentially concurrently with threads that want to get to it as well.

We have some convenience functions to make it clear what it is that we are actually setting (an error or a value) and then we get to the meat of this structure, the set() call. There isn’t much here, to be honest. We check that the value isn’t already set, and then set it properly as either an error or a value with a single reference (which is for the caller).

We again use cmpxchng() call to ensure that we are safe in regard to multiple threads (although the usage I have in mind for this calls for a single writer, not competing ones, it doesn’t cost us anything to make it safer to use). After setting the value, we increment the version field and wake any waiting threads.

You can also see how we validate and pack the pointer value to 44 bits.

All of this work, but we are still missing one part. The whole point of reference counting is to delete the value when it it no longer in use, where is that code at? Let’s take a look:

We are taking both the value that we are releasing and the destruction function. The value we want to free is there to ensure that it wasn’t modified meantime, and if there are no more references, we know that we can safely destroy it.

Note that this whole scheme relies on the fact that we are managing the reference count externally to the object itself. In other words, we are assuming that the RefCount value is going to be kept alive. In my case, I”m actually intending to use that as a cache, and we’ll keep an array of those values around for a long time. Otherwise, you run the same risk as before. If you have a reference to the RefCount value, and you may release that, you may end up with a situation where you have a reference to a released memory.

This technique of pointer packing is valid if you use manual memory management mode, you cannot use that in C#, for example, because object may move, and even if you pinned a value, the GC will not consider the value to be a valid pointer, so it will not work for you. For managed languages, there is actually a much better option. Just let go of the value and let the finalizer handle that. In other words, something else (an already existing component) will ensure that there are no live references remaining).

time to read 1 min | 149 words

Implementing a unit of work in Python can be an interesting challenge. Consider the following code:

This is about as simple a code as possible, to associate a tag to an object, right?

However, this code will fail for the following scenario:

You’ll get a lovely: “TypeError: unhashable type: 'Item'” when you try this. This is because data classes in Python has a complicated relationship with __hash__().

An obvious solution to the problem is to use:

However, the id() in Python is not guaranteed to be unique. Consider the following code:

On my machine, running this code gives me:

124597181219840
124597181219840

In other words, the id has been reused. This makes sense, since this is just the pointer to the value. We can fix that by holding on to the object reference, like so:

With this approach, we are able to implement proper reference equality and make sure that we aren’t mixing different values.

time to read 2 min | 286 words

I’m teaching a course at university about cloud computing. That can be a lot of fun, but quite frustrating at time. The key issue for me is that I occasionally need to provide students with some way to do something that I know how to do properly, but I can’t.

Case in point, assuming that I have a distributed cluster of nodes, and we need to detect what nodes are up or down, how do you do that?

With RavenDB, we assign an observer to the cluster whose job is to do health monitoring. I can explain that to the students, but I can’t expect them to utilize this technique in their exercises, there is too much detail there. The focus of the lesson or exercise is not to build a distributed system but to make use of one, after all.

As a rule, I try to ensure that all projects that we are working on can be done in under 200 lines of Python code. That puts a hard limit to the amount of behavior I can express. Because of that, I find myself looking for ways to rely on existing infrastructure to deal with the situation. 

Each node is running the same code, and they are setup so they can talk to one another, if needed. It is important that all the live nodes will converge to agree on the active nodes in relatively short order.

The task is to find the list of active nodes in a cluster, where nodes may go up or down dynamically. We are running in AWS cloud so you can use its resources, how would you do that?

The situation should be as simple as possible and easy to explain to students.

time to read 1 min | 200 words

imageI ask candidates to answer the following question. Sometimes at home, sometimes during an interview with a whiteboard.

You need to create an executable that would manage a phone book. The commands you need to support are:

  • phone-book.exe /path/to/file add [name] [phone]
  • phone-book.exe /path/to/file list [skip], [limit]

The output of the list operation must be the phone book records in lexical order. You may not sort the data during the list operation, however. All such work must be done in the add operation.

You may keep any state you’ll like in the file system, but there are separate invocations of the program for each step.  This program need to support adding 10 million records.

Feel free to constrain the problem in any other way that would make it easier for you to implement it. We’ll rate the solution on how much it cost in terms of I/O.

A reminder, we are a database company, this sort of question is incredibly relevant to the things that we do daily.

I give this question to candidates with no experience, fresh graduates,  etc. How would you rate its difficulty?

time to read 2 min | 225 words

I have a consultant that did some work for me. While the majority of our people are working in Israel and Poland, we actually have people working for us all over the map.

The consultant submitted their invoice at the end of the month and we sent a wire transfer to the provided account. So far, pretty normal and business as usual. We do double verification of account details, to avoid common scams, by the way, so we know that the details we sent were correct. Except… the money never arrived.

When we inquired, it turned out that the money transfer was reversed. The reason why? This is the address that the consultant provided (not the actual address, mind, but has the same issue). Note that the

Mr. Great Consultant
1234 Cuba Avenue
Alta Vista, Ottawa, K1G 1L7
Canada

The wire transfer was flagged as potential international sanctions violation and refused.

That was… very strange.

It appears that someone saw Cuba in the address, decided that this is a problem and refuse the transfer.  I’m not sure if I would rather that this is the case of an over active Regex or a human not applying critical thinking.

We are now on week two of trying to resolve that with the bank and it is quite annoying.

Next port of call, buying Monero on the dark web… Smile.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Implementing a file pager in Zig (13):
    21 Jan 2022 - Write behind implementation
  2. re (30):
    14 Jan 2022 - Are You Sure You Want to Use MMAP in Your Database Management System?
  3. Production postmortem (33):
    03 Jan 2022 - An error on the first act will lead to data corruption on the second act…
  4. Negative feature response (2):
    20 Dec 2021 - Protect the user from accidental collection deletion
  5. Challenge (63):
    16 Dec 2021 - Find the slow down–answer
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats