Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 548 words

One of the clearest signs of maturity that I’m looking for when reading code is the scaffolding that were used.

Just like in physical construction, it is often impossible to start by building everything properly from the get go, and you start by building temporary scaffolding to get you going. In some cases, those can be things that you need to actually build the software, but I have found that scaffolding are mostly useful in debugging and understanding issues.

For example, if I’m working on a complex data structure, it would be very useful to be able to dump it into a human readable format, so I can visually inspect it and understand how it works.

In the recent low level trie example, I have a file dedicated to doing just that, it contains some code to print the trie as well as validate the structure, and it was very useful to figure out certain things.

If the project is big enough, and mature enough, those scaffolding take on a permanent nature. In RavenDB, for example, they are debug endpoint and live visualizations that can help an administrator to figure out exactly what is going on with the system. The more complex the system, the more important the scaffolding become.

One very important consideration is what kind of scaffolding is being built. For example, if you throw a bunch pf printf all over the place while you are debugging, that is helpful, but that isn’t something that will remain over time, and in many cases, the second or third time that you find yourself having to add code to help you narrow down a problem, you want to make this sort of code a permeant part of your codebase.

One of the more complex pieces in building Voron was the B+Tree behavior, in particular when dealing with page splits and variable size data. We build a lot of code that would verify that structure and invariants for us, and it is running as part of our CI builds to ensure that stuff doesn’t sneak in.

One of the things that we teach our new hires is that one of the things that we need to have not just working code, but also all of the surrounding infrastructure. Everything that I need to work with, diagnose and manage the system in production over long periods of time. I distinctly remember a debugging session where we had to add a whole bunch of diagnostics code to figure out some really gnarly issue. I was pairing with another dev on that on his machine, and we were working on that for a long time. We finally managed to figure out what the problem was and fix that, and I left and got the final PR with the fix later in the day.

None of the diagnostics code was in it, and when I asked why the answer was: “We fixed the problem, and we didn’t need it, so I removed it.”

That is not the kind of thing that you remove, that is the kind of thing that you keep, because you can bet your bottom dollar that we’ll need it again, when the next problem shows up.

time to read 4 min | 629 words

So a customer of ours has been putting RavenDB 4.0 through it paces, they have a large database with some pretty nasty indexes. They reported that when importing their existing data, they managed to crash RavenDB. The problem was probably some interaction of indexing and documents, but this was pretty consistent.

That was lucky, because I was able to narrow down the problem to the following document.

image

Literally just a week before getting this bug report we found a bug with multi byte Unicode characters (we were counting them in chars, but using that to index into a byte array), so that was our very first suspicion. In fact, review the relevant portions of the code, we were able to identify another location with a similar problem, and resolve it.

That didn’t solve the problem, however. I kept banging my head against the Unicode issue, until I finally just gave up and went to tracing the code very carefully. And then I found it.

The underlying issue was that the data here is part of an array inside a document, and we had an index that looked like this:

image

The idea is that we gather all the titles from the items in the action, and allow to search on those. And somehow, the title from the second entry above was always pointing to invalid memory.

It took a bunch more investigation before I found out the root cause.

Here is where we found the problem:

image

But it is going to take a bit of explaining first. The second title is pretty long, when using UTF8 it is 176 bytes, which is long enough for the blittable format to try to compress it, and it was indeed successfully compressed to a smaller size.

So far, so good, and a pretty awesome feature that we have. But the problem was that when we index a document, we need to access its property, and if it is a compress property, we need to decompress it, and that means that we need to decompress it somewhere. And at the time, using the native temp buffer seemed perfect, after all, it was exactly the kind of thing that it was meant for.

But in this case, a bit further down the document we had another compressed field, one that was even larger, so we again asked for a temp buffer. But since this is a temp buffer, and the expectation is that this will be short lived usage, we returned the old buffer and allocated a new one. However, we still maintained reference to the old buffer, and when we tried to index that title, it spat out garbage (because the memory was reused) or just crashed, depending on how severe we set the memory protection mechanism.

It was actually something that we already figured out and fixed, but we haven’t yet merged that branch, so it was quite annoying when we found the root cause.

We’ll also be removed the temp buffer concept entirely, we’ll have a buffer that you can checkout and return, and not something that can vanish behind you back.

But I have to say, that bug was certainly misleading in the extreme. I was sure it had to do with Unicode handling.

time to read 1 min | 139 words

I have tried to build a low level trie impl in Rust and just gave up because it was too much work making the compiler happy. I then went ahead and wrote in in C++ in a couple of evenings. I would still like to know whatever this is possible / viable in Rust.

I’m assuming that this is the case, but I don’t know how to start. Any dear reader feels like taking this upon themselves to port the C++ code to Rust? It is all working and there are unit tests, and I think that the design should match well the Rust design philosophy, but the borrow checker actively worked to me give up doing that in Rust, and I would like to both see this implemented in Rust and hear what the experience was like.

time to read 2 min | 218 words

imageI’m happy to announce that the next version of the NHibernate and Entity Framework Profilers is ready for beta testing

You can download the new bits from here.

  • Entity Framework – Support for EF Core (on .NET Core and .Net 4.6)
  • NHibernate Profiler – Support for NHibernate 4.1
  • Persisting interesting sessions across restarts.
  • Improved filtering of SQL statements and behaviors, persisting your configuration across restarts and generally behaving as a responsible citizen.
  • The ability to configure profiling via a watched file, to enable admins to enable/disable profiling on the fly.
  • Allowing you to edit and run a SQL statement as well as compare the database query plan generated so you can get to an optimal solution.
  • Various usability improvements (auto tracking of interesting data, always on top, compact mode for quick visualizations, better keyboard navigation, etc).

Please take them for a spin and provide us with feedback on it.

image

time to read 4 min | 682 words

The low level trie challenge is storing a trie structure inside a 32KB buffer, and allowing read, write and delete operations on it. The idea is that you can have no additional data about the trie other than the buffer. The full implementation is available on github.

Because of that, you need to think very differently about how you approach the solution. Here is the header file with the main trie operations:

You’ll note that the only field that I have on the class is a 32KB buffer. All the data is stored inside it. In order to handle that, we are going to treat this buffer as our main memory and “allocate” our data inside it. I guess we could do this by using C++ allocators, but that seem to be very heavy weight and I don’t think it will work very well for this scenario. Instead, we define the following basic structures:

The trie header is located at the very beginning of the buffer, and contains the basic information about the trie. And then we have each node entry, which is about a specific entry. In particular, the offset fields in the node_header_info are actually pointers into the buffer itself. And the next_alloc field in the trie_header_info is used to “allocate” additional memory inside the buffer.

The idea is that whenever we need more memory, we just take it from the top of the buffer, until we run out of room. Here is how I add the very first entry:

I’m forcing it into a specific well known place in the buffer (line 4), and then I handle all rest of the memory assignments from there. The same holds true when we start getting into the nested structure. Each entry has a array of offsets that points to its children, which is pointed to by the children_offset value.

When we need to add a new child to a node, we aren’t modifying the old array, instead, we are allocating a completely new one, adding the new value and sorting the whole thing.

The sorting thing is actually fun, because it means that when I’m reading from the trie, I can reduce the amount of work that I have to do per level.

Here is how I read from the trie:

We first try to find a match in the trie, starting from the root node. The find_match method is then called to find if the current node is a match, and if not, whatever we can go further. It does so by comparing the key to the stored value, and if there is a partial match,recursing into the next level.

A fun part happens inside the find_child method, where we do a sorted search over the children array to find the node with the matching starting byte.

The whole idea is that because I’m using constrained memory, I can make do with very little actual work to manage the memory, I’m just using and discarding it all the time. But I am keeping track of when I can next allocate the memory, and how much of the buffer is in use. When I hit the memory limit, I’m going to defrag the trie. This is like so:

There is quite a lot going on in there, but the basic idea is that we are going to copy the buffer to a temporary buffer, reset our own buffer (this all happens in the first 4 lines), and then we are going to traverse the data in the temp buffer and insert it into the real buffer directly. One of the things that is most important here is that we properly calculate the size of the children array for each node.

I could have probably have done more here:

  • Ensure that the data is aligned
  • Merge nodes that have only a single child and no value of their own

And probably a lot of other stuff, but this has been a fun experiment.

time to read 3 min | 440 words

The low level trie question has been a favorite question of mine for a while. It is simple in concept, but the limitations placed on it make it pretty hard to actually build. Previous posts in this series outlined the approach I had for solving this, but I always got caught up with something and didn’t get around to actually sitting down and resolving this completely.

As part of learning Rust, I decided to go ahead and implement this low level trie using Rust. I have failed, it was just too much babysitting by the compiler and having to fight it. I knew exactly what I wanted to do, but I kept having to jump through hops to get it to it. Eventually, I just called it quits  and decided to abandon the attempt to use Rust.

But I still want to do something out of my comfort zone, so I decided to run this exercise using C++. Now, I used to write quite a lot of C++ (along with VB, VBScript and ASP classic). But that was in the late 90s, and very early 2000s. I heard through the grapevine that someone kicked the C++ standard committee into high gear and started actually improve the language.

The result was three evenings spent on building a low level trie impl in C++, and quite a lot of fun. I’ll have another post about the actual details of the implementation, but in this post I mostly wanted to talk about the experience of getting back to C++. And it is… strange.

On the one hand, because I’m so used to C# and have used C++ before, this is oh so comfortable. Like wearing old set of gloves that you forgot that you even had.

On the other hand, I forgotten quite a lot of details about the language and the libraries, and they changed. My old C++ code would be newing up stuff and fighting to manage memory and very likely leaking like crazy. In this codebase? I don’t have a single new call throughout. And being able to do things like lambdas in C++ feels like magic.

I’ll admit that the codebase is heavily influenced by my Rust work. To start with, I’m using snake_case convention, and I found that I’m using a lot more std::pair that I would expect myself to use.

I would appreciate any code review on this, the core code is about 400 lines or so, and I’m mostly interested to know whatever I managed to write idiomatic modern C++, and if not, how this can be improved.

time to read 5 min | 983 words

So I spent a few evenings with Rust, and I have managed to do some pretty trivial stuff with it, but then I tried to do something non trivial (low level trie that relies on low level memory manipulations). And after realize that I just have to fight the language non stop, I am not going to continue forward with this.

Here is where I gave up:

image

I have a buffer, that I want to mutate using pointers, so I allocated a buffer, with the intent to use the first few bytes for some header, and use the memory directly and efficiently. Unfortunately, I can’t. I need to mutate the buffer in multiple places at the same time (both the trie header and the actual node), but Rust refuses to let me do that because then I’ll have multiple mutable references, which is exactly what I want.

It just feels that there is so much ceremony involved in getting a Rust program to actually compile that there isn’t any time left to do anything else.  This post certainly resonated with me strongly.

That is about the language, and about what it requires. But the environment isn’t really nice either. It starts from the basic, I want to allocated some memory.

Sure, that is easy to do, right?

  • alloc::heap::allocate is only for unstable, and might change underneath you.
  • alloc::raw_vec::RawVec which give you raw memory directly is unstable and likely to remain so. Even though it is much safer to use than directly allocating memory.

We are talking about allocating memory, in a system level language, and unless you are jumping through hops, there is just no way to do that.

I’ll admit that I’m also spoiled in terms of tooling (IDEs, debuggers, etc), but the Rust environment is pretty much “grab a text editor, you’ll have syntax highlighting and maybe something a bit more”, and that is it. I tried three or four different editors, and while some of intellij-rust, for example, was able to do some code analysis, it wasn’t able to actually build anything (I think that I needed to install JDE or some such), VS Code could build and run (but not debug) and it marks every single warning with, and combined with Rust’s eagerness of warning, made it very hard to write code. Consider when all you code looks like this:

image

No debugger beyond println (and yes, I know about GDB, that ain’t a good debugging experience) is another major issue.

I really want to like Rust, and it has some pretty cool ideas, but the problem is that it is just too hard to actually get something done in any reasonable timeframe.

What is really concerning is that any time that I want to do anything really interesting you need to either go and find a crate to do it (without any assurances of quality, maintainability, etc) or you have to use a nightly version or enable various feature flags or use unstable API versions. And you have to do it for anything beyond the most trivial stuff.

The very same trie code that I tried to write in Rust I wrote in one & half evenings in C++ (including copious searches for the correct modern ways to do something), and it works, it is obvious and I don’t have to fight the compiler all the time.

Granted, I’ve written in C++ before, and C# (my main language) is similar, but the differences are staggering. It isn’t just the borrow checker, it is the sum of the language features that make it very hard to reason about what the code is doing. I mentioned before that the fact that generics are resolved on usage, which can happen quite a bit further down from the actual declaration is very confusing. It might have been different if I have been coming from an ML background, but Rust is just too much work for too little gain.

One of my core requirements, the ability to write code and iterate over behavior quickly is blocked because every time that I’m trying to compile, I’m getting weird complication errors and then I need to implement workarounds to make the compiler happy, which introduce complexity into the code for very simple tasks.

Let us take a simple example, I want to cache the result of a DNS lookup. Here is the code:

We’ll ignore the unstable API usage with lookup_host, or the fact that it literally took me over an hour to get 30 lines of code out in a shape that I can actually demonstrate the issue.

There is a lot of stuff going on here. We have a cache of boxed strings in the hash map to maintain ownership on them, and we look them up and then add a cloned key to the cache because the owner of the host string is the caller, etc. But most importantly, this code is simple, expressive and wrong. It won’t compile, because we have both immutable borrow (on line 17) and a mutable one (on line 25 & 26).

And yes, I’m aware of the entry API on the HashMap that is meant to dealt with this situation. The problem is that all those details, and making the compiler happy in a very simple code path, is adding a lot of friction to the code. To the point where you don’t get anything done other than fight the compiler all the time. It’s annoying, and it doesn’t feel like I’m accomplishing anything.

time to read 3 min | 418 words

Any self respecting database needs to be able to provide a whole host of metrics for the user.

Let us talk about something simple, like the requests / second metrics. This seems like a pretty easy metric to have, right? Every second, you have N number of requests, and you just show that.

But it turns out that just showing the latest req/sec number isn’t very useful, primarily because a lot of traffic actually have a valleys & peaks. So you want to have the req/sec not for a specific second, but for some time ago (like the req/sec over the last minute & 15 minutes).

One way to do that is to use an exponentially-weighted moving average. You can read about their use in Unix in these articles. But the idea is that as we add samples, we’ll put more weight on the recent samples, but also take into account historical data.

That has the nice property that it reacts quickly to changes in behavior, but it smooth them out that you see a gradual change over time. The bad thing about it is that it is not accurate (in the sense that this isn’t very easy for us to correlate to exact numbers) and it is smooth out changes.

On the other hand, you can take exact metrics. Going back to the req/sec number, we can allocate an array of 900 longs (so enough for 15 minutes with one measurement per second) and just use this cyclic buffer to store the details. The good thing about that is that it is very accurate, we can easily correlate results to external numbers (such as the results of a benchmark).

With the exact metrics, we get the benefit of being able to get the per second data and look at peaks & valleys and measure them. With exponentially weighted moving average, we have a more immediate response to changes, but it is never actually accurate.

It is a bit more work, but it is much more understandable code. On the other hand, it can result in strangeness. If you have a a burst of traffic, let’s say 1000 requests over 3 seconds, then the average req/sec over the last minute will stay fixed at 50 req/sec for a whole minute. Which is utterly correct and completely misleading.

I’m not sure how to handle this specific scenario in a way that is both accurate and expected by the user.

time to read 2 min | 225 words

This was a surprising shock, this code seems so simple, but it does something very different than what I would expect.

The question is why?

As it turns out, we are missing one character here:

image

Notice the lack of the comma? Let us see how the compiler is treating this code by breaking it apart into steps, shall we?

First, let us break it into two statements:

Note that we moved the name line into the first statement, since there isn’t a command here. But what is it actually doing? This looks very strange, but that is just because we have the dictionary initializer here. If we drop it, we get:

image

And that make a lot more sense. If we’ll break it all down, we have:

And that explains it all, make perfect sense, and a very nasty trap. We run into it accidently in production code, and it was near impossible to figure out what was going on or why it happened.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}