Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,198 | Comments: 50,268

Privacy Policy Terms
filter by tags archive
time to read 3 min | 585 words

imageI recently run into a bit of code that made me go: Stop! Don’t you dare going this way!

The reason that I had such a reaction for the code in question is that I have seen where such code will lead you, and that is not anywhere good. The code in question?

This is a pretty horrible thing to do to your system. Let’s count the ways:

  • Queries are happening fairly deep in your system, which means that you’re now putting this sort of behavior in a place where it is generally invisible for the rest of the code.
  • What happens if the calling code also have something similar? Now we got retries on retries.
  • What happens if the code that you are calling has something similar? Now we got retries on retries on retries.
    • You can absolutely rely on the code you are calling to do retries. If only because that is how TCP behaves. But also because there are usually resiliency measures implemented.
  • What happens if the error actually matters. There is no exception throw in any case, which means that important information is written to the log, which no one ever reads.
  • There is no distinction of the types of errors where retry may help and where it won’t.
  • What is the query has side effects? For example, you may be calling a stored procedure, but multiple times.
  • What happens when you run out of retries? The code will return null, which means that the calling code will like fail with NRE.

What is worst, by the way, is that this piece of code is attempting to fix a very specific issue. Being unable to reach the relevant database. For example, if you are writing a service, you may run into that on reboot, your service may have started before the database, so you need to retry a few times to the let the database to load. A better option would be to specify the load order of the services.

Or maybe there was some network hiccup that you had to deal with? That would sort of work, and probably the one case where this will work. But TCP already does that by resending packets, you are adding this again and it is building up to be a nasty case.

When there is an error, your application is going to sulk, throw strange errors and refuse to tell you what is going on. There are going to be a lot of symptoms that are hard to diagnose and debug.

To quote Release It!:

Connection timeouts vary from one operating system to another, but they’re usually measured in minutes! The calling application’s thread could be blocked waiting for the remote server to respond for ten minutes!

You added a retry on top of that, and then the system just… stops.

Let’s take a look at the usage pattern, shall we?

That will fail pretty badly (and then cause a null reference exception). Let’s say that this is a service code, which is called from a client that uses a similar pattern for “resiliency”.

Question – what do you think will happen the first time that there is an error?  Cascading failures galore.

In general, unknown errors shouldn’t be handled locally, you don’t have a way to do that here. You should raise them up as far as possible. And yes, showing the error to the user is general better than just spinning in place, without giving the user any feedback whatsoever.

time to read 2 min | 234 words

I run into a task that I needed to do in Go, given a PFX file, I needed to get a tls.X509KeyPair from that. However, Go doesn’t have support for PFX. RavenDB makes extensive use of PFX in general, so that made things hard for us. I looked into all sorts of options, but I couldn’t find any way to manage that properly. The nearest find was the pkcs12 package, but that has support for only some DER format, and cannot handle common PFX files. That was a problem.

Luckily, I know how to use OpenSSL, but while there are countless examples on how to use OpenSSL to convert PFX to PEM and the other way around, all of them assume that you are using that from the command line, which isn’t what we want. It took me a bit of time, but I cobbled together a one off code that does the work. The code has a strange shape, I’m aware, because I wrote it to interface with Go, but it does the job.

Now, from Go, I can run the following:

As you can see, most of the code is there to manage error handling. But you can now convert a PFX to PEM and then pass that to X509keyPair easily.

That said, this seems just utterly ridiculous to me. There has got to be a better way to do that, surely.

time to read 2 min | 395 words

A user called us to ask about how they can manage to move a particular report from a legacy system to RavenDB. They need to be able to ask questions such as the following one:

This is an interesting issue, when you think about it from the point of view of a database engine. The distinct issue means that we have to keep state (all the unique values) while we evaluate the query, which can be expensive. One of the design principles of RavenDB was that we want to make it hard to accidently create expensive queries. Indeed, a query like that isn’t trivial to implement in RavenDB. We need to have a two stage approach for implementing this feature.

First, we’ll introduce a Map/Reduce index, which will aggregate the data on Employee, Company and City. Along the way, it will run the distinct operation on the City, because it will group by it. That gives us a model in which we get the distinct amount for free, and in a highly efficient manner. Here is the index in question:

The interesting thing about this index is that querying it will not give us the right results. We don’t want to get the details based on Employee, Company and City. We want just by Employee and Company. This is where the second stage comes into play. Instead of running a simple query on the index, we’ll use a faceted query. Here is what it will look like:

What this does is to aggregate the results (which were already partially aggregated by the Map/Reduce) and give us the totals. And here are the results:

The end result is that we are able to do most of the work an indexing time, and the query time is left working on already aggregated data. That means that the queries should be much faster and that there is a lot less work for the database to do.

It also isn’t RavenDB’s strong suit. Such queries are typically more inline with OLAP systems, to be honest. If you know what your query patterns looks like, you can use this technique to easily handle such queries, but if there is a wide range of dynamic queries, you may want to use RavenDB as the system of record and then use either SQL ETL or OLAP ETL to push that to a reporting system.

Looking into Zig

time to read 5 min | 939 words

I think that it was the Pragmatic Programmer that recommend that you should learn a new language a year. For me, in 2020 that was Rust. I read a bunch of books about the language, I read significant amount of code and wrote some non trivial amount of code in Rust. That was sufficient to get me to grok the language, I’m not a Rust developer by any mean, but I walked with Rusty shoes for long enough to get the feeling.

This year, I decided to look into Zig. Both Zig and Rust are more or less in the same space, replacing C. They couldn’t be more different, however. Zig is a small language. I spent a couple of evenings going through the language reference and that was it, I had a pretty good idea about how to do things.

The learning curve is mostly flat, and that is pretty huge. This is especially because I can’t help but compare Zig to Rust. I spent a lot of effort understanding Rust, but I had spent almost no cycles trying to grok Zig. It was simple, obvious and quite clear.  In terms of power, mind, I would rate both languages on roughly the same spot. You can write really nice code in Zig, it is just that you don’t need to bend your head into funny shapes to get the compiler to accept your code.

One of the key features for Zig is its comptime feature. This is a feature that allow Zig to run code at compilation time. That isn’t a new or exciting feature, to be honest, C++ had it for many years. The key difference is that Zig can use this feature for code generation. For example, to create a generic list, you’ll write the following code:

Note that we are writing here a function that returns a type, which you can then use. That approach is incredibly powerful, but at the same time, this is simple, it is obvious.

Zig is easy to learn, because there isn’t a whole lot more that is hidden from you behind the scenes.

That actually leads to another really important aspect in the design of Zig. There isn’t anything behind the scenes. For example, you cannot allocate memory in Zig, there is no global function that will do that for you. You need to do so using an allocator. That means that the fact that memory allocations can fail is pervasive throughout the API, standard library and your code. There isn’t a "this may fail on rare occasions” scenario that you somehow need to handle, this is clear and in your face.

At the same time, Zig does a lot more to make things easier than C. I want to focus on a few important points:

  • Zig has the concept of Errors. In the code above, the function push() may fail because of an allocation failure. In this case, the function will fail with a return code. That is handled by the try keyword, which will abort the current function and return the error. Note that errors and regular values are separate channels in Zig (there is a union mark with ! at the function declaration).
  • Zig has support for defer and errdefer keyword. The defer keyword works just as you would expect it to, at the function exit, it will run all the deferred statement in reverse order. The errdefer is a lot more interesting, because that will only run if the function exits with an error. This seemingly simple change has a huge impact on the code quality and the complexity that a developer need to keep in their head.
  • Zig has built-in testing, to the point where test is a keyword in the language.

To give you some context, when I was writing C code, I literally wrote the exact same thing (manually, with macros and nastiness) in order to help me get things done.

In the same manner, the fact that allocation are explicit and managed from the top (all types that needs to allocate gets the allocator from their parents) means that you get to do some really cool things with memory. It is easy to say something like “this piece of code gets 10MB of memory only” and let it run like that. It also end up creating a more robust software system, I think, so memory allocations happen aren’t a rare occurrence, they happen all the time.

In general, Zig feel like a lot better C, no additional mental overhead. Compared to Rust, you can get working almost immediately and the compilation speed is excellent, to the point where you don’t really need to think about it. Rust makes you feel the slow compilation cost from the get go, basically, which is noticeable as your system grows bigger.

Thinking about this, I actually feel that we should compare Zig to Go, because it is closer in concept to what I think Go wanted to be. In fact, looking at the most common complaints people has against Go, Zig answers them all.

If you haven’t noticed, I’m quite enjoying working with Zig.

And as an aside, the fact that a language can implement a language server and get automatic IDE support is freaking amazing. You can also debug Zig code inside VS Code, for example, pretty much with no more issues than you would for native code. Zig is implemented on top of LLVM and gains a lot of the benefits from it.

One thing that kept going through my mind when I looked at all that I got out of the package is: standing on the shoulders of giants.

time to read 2 min | 253 words

imageFrom a conceptual model, a thread and a task are very similar. That is very much by design, since the Task is meant to allow you to work with asynchronous code while maintaining the illusion that you are running in a sequential manner. It is tempting to think about this in terms of the Task actually executing the work, but that isn’t actually the case.

The Task doesn’t represent the execution of whatever asynchronous process is running, the Task represent a ticket that will be punched when the asynchronous process is done. Consider the case of going to a restaurant and asking for a table, if there isn’t an available table, you cannot be seated. What the restaurant will do is hand you a pager that will buzz when the table is ready. In the same sense, a Task is just such a token. The restaurant pager doesn’t means that someone is actively clearing a table for you. It is just something that will buzz when a table is ready.

A code sample may make things clearer:

In this case, we are manually coordinating the Task using its completion source and you can see that the Task instance that was handed when trying to get a table doesn’t actually start anything. It is simply waiting to be raised when called.

That is an important aspect of how System.Threading.Tasks.Task works, because it is usually in contrast to the mental model in our head.

time to read 2 min | 399 words

Terraform is all the rage on the DevOps world now, and we decided to take a look. In general, Terraform is written in Go, but it uses a pretty nifty plugins system to work with additional providers. A terraform provider is an executable that you run, but it communicates with the terraform controller using gRPC.

What happens is that Terraform will invoke the executable, and then communicate with it over gRPC whose details are provided in the standard output. That means that you don’t need to write in Go, any language will do. Indeed, Samuel Fisher did all the hard work in making it possible. Please note that this post likely makes no sense if you didn’t read Samuel’s post first.

However, his work assumes that you are running on Linux, and there are a few minor issues that you have to deal with in order to get everything working properly.

For safety, Terraform uses TLS for communication, and ensures that the data is safe in transit. The provider will usually generate a self signed key and provide the public key on the standard output. I would like to understand what the security model they are trying to protect from, but at any rate, that method should be fairly safe against anything that I can think of. The problem is that there is an issue with the way the C# Terraform provider from Samuel handle the certificate in Windows. I sent a pull request to resolve it, it’s just a few lines, and it quite silly.

The next challenge is how to make Terraform actually execute your provider. The suggestion by Samuel is to copy the data to where Terraform will cache it, and use that. I don’t really like that approach, to be honest, and it looks like there are better options.

You can save a %APPDATA%\terraform.rc file with the following content:

This will ensure that your provider will be loaded from the local directory, instead of fetched over the network. Finally, there is another challenge, Terraform expects the paths and names to match, which can be quite annoying for development.

I had to run the following code to get it working:

cp F:\TerraformPluginDotNet\samples\SampleProvider\bin\release\net5.0\win-x64\publish\* F:\terraform\providers\example.com\example\dotnetsample\1.0.0\windows_amd64
mv F:\terraform\providers\example.com\example\dotnetsample\1.0.0\windows_amd64\SampleProvider.exe F:\terraform\providers\example.com\example\dotnetsample\1.0.0\windows_amd64\terraform-provider-dotnetsample.exe

What this does is ensure that the files are in the right location with the right name for Terraform to execute it. From there on, you can go on as usual developing your provider.

time to read 10 min | 1949 words

For a database engine, controlling the amount of work that is being executed is a very important factor for the stability of the system. If we can’t control the work that is being done, it is quite easy to get to the point where we are overwhelmed.

Consider the case of a(n almost) pure computational task, which is completely CPU bound. In some cases, those tasks can easily be parallelized. Here is one such scenario:

This example seems pretty obvious, right? This is complete CPU bound (sorting), and leaving aside that sort itself can be done in parallel, we have many arrays that we need to sort. As you can imagine, I would much rather this task to be done as quickly as possible.

One way to make this happen is to parallelize the work. Here is a pretty simple way to do that:

Parallel.ForEach(arrayOfArrays, array => Array.Sort(array));

A single one liner and we are going to see much better performance from the system, right?

Yep, this particular scenario will run faster, depending on the sizes, that may be much faster. However, that single one liner is a nasty surprise for the system stability as a whole. What’s the issue?

Under the covers, this is going to use the .NET thread pool to do the work. In other words, this is going to be added to the global workload on the system. What else is using the same thread pool? Well, pretty much everything. For example, processing requests in Kestrel is done in the thread pool, all the async machinery uses the thread pool under the covers as well as pretty much everything else.

What is the impact of adding a heavily CPU bounded work to the thread pool, one may ask? Well, the thread pool would start executing these on its threads. This is heavy CPU work, so likely will run for a while, displacing other work. If we consider a single instance of this code, there is going to be a limit of the number of concurrent work that is placed in the thread pool. However, if we consider whatever we run the code above in parallel… we are going to be placing a lot of work on the thread pool. That is going to effectively starve the rest of the system. The thread pool will react by spawning more threads, but this is a slow process, and it is easy to get into a situation where all available threads are busy, leaving nothing for the rest of the application to run.

From the outside, it looks like a 100% CPU status, with the system being entirely frozen. That isn’t actually what is going on, we are simply holding up all the threads and can’t prioritize the work between request handling (important) and speeding up background work (less important). The other problem is that you may be running the operation in an otherwise idle system, and the non parallel code will utilize a single thread out of the many that are available.

In the context of RavenDB, we are talking here about indexing work. It turns out that there is a lot of work here that is purely computational. Analyzing and breaking apart text for full text search, sorting terms for more efficient access patterns, etc. The same problem above remains, how can we balance the indexing speed and the overall workload on the system?

Basically, we have three scenarios that we need to consider:

  1. Busy processing requests – background work should be done in whatever free time the system has (avoiding starvation), with as little impact as possible.
  2. Busy working on many background tasks – concurrent background tasks should not compete with one another and step on each other’s toes.
  3. Busy working on a single large background task – which should utilize as much of the available computing resources that we have.

For the second and third options, we need to take into account that the fact that there isn’t any current request processing work doesn’t matter if there is incoming work. In that case, the system should prioritize the request processing over background work.

Another important factor that I haven’t mentioned is that it would be really good for us to be able to tell what work is taking the CPU time. If we are running a set of tasks on multiple threads, it would be great to be able to see what tasks they are running in a debugger / profiler.

This sounds very much like a class in operating systems, in fact, scheduling work is a core work for an operating system. The question we have here, how do we manage to build a system that would meet all the above requirements, given that we can’t actually schedule CPU work directly.

We cannot use the default thread pool, because there are too many existing users there that can interfere with what we want. For that matter, we don’t actually want to have a dynamic thread pool. We want to maximize the amount of work we do for computational heavy workload. Instead, for the entire process, we will define a set of threads to handle work offloading, like this:

This creates a set of threads, one for each CPU core on the machine. It is important to note that these threads are running with the lowest priority, if there is anything else that is ready to run, it will get a priority. In order to do some background work, such as indexing, we’ll use the following mechanism:

Because indexing is a background operation in RavenDB, we do that in a background thread and we set the priority to below normal. Request process threads are running at normal priority, which means that we can rely on the operating system to run them first and at the same time, the starvation prevention that already exists in the OS scheduler will run our indexing code even if there is extreme load on the system.

So far, so good, but what about those offload workers? We need a way to pass work from the indexing to the offload workers, this is done in the following manner:

Note that the _globalWorkQueue is process wide. For now, I’m using the simple sort an array example because it make things easier, the real code would need a bit more complexity, but it is not important to explain the concept. The global queue contains queues for each task that needs to be run.

The index itself will have a queue of work that it needs to complete. Whenever it needs to start doing the work, it will add that to its own queue and try to publish to the global queue. Note that the size of the global queue is limited, so we won’t use too much memory there.

Once we published the work we want to do, the indexing thread will start working on the work itself, trying to solicit the aim of the other workers occasionally. Finally, once the indexing thread is done process all the remaining work, we need to wait for any pending work that is currently being executed by the workers. That done, we can use the results.

The workers all run very similar code:

In other words, they pull a queue of tasks from the global tasks queue and start working on that exclusively. Once they are done processing a single index queue to completion, the offload worker will try pick another from the global queue, etc.

The whole code is small and fairly simple, but there are a lot of behavior that is hidden here. Let me try to explore all of that now.

The indexing background work push all the work items to its local queue, and it will register the queue itself in the global queue for the offloading threads to process. The indexing queue may be registered multiple times, so multiple offloading threads will take part in this. The indexing code, however, does not rely on that and will also process its own queue. The idea is that if there are offloading threads available, they will help, but we do not rely on them.

The offloading thread, for its part, will grab an indexing thread queue and start processing all the items from the queue until it is done. For sorting arbitrary arrays, it doesn’t matter much, but for other workloads, we’ll likely get much better locality in terms of task execution by issuing the same operation over a large set of data.

The threads priority here is also important, mind. If there is nothing to do, the OS will schedule the offloading threads and give them work to do. If there is a lot of other things happening, it will not be scheduled often. This is fine, we are being assisted by the offloading threads, they aren’t mandatory.

Let’s consider the previous scenarios in light of this architecture and its impact.

If there are a lot of requests, the OS is going to mostly schedule the request processing threads, the indexing threads will also run, but it is mostly going to be when there is nothing else to do. The offload threads are going to get their chance, but mostly that will not have any major impact. That is fine, we want most of the processing power to be on the request processing.

If there is a lot of work, on the other hand, the situation is different. Let’s say that there are 25 indexes running and there are 16 cores available for the machine. In this case, the indexes are going to compete with one another. The offloading threads again not going to get a lot of chance to run, because there isn’t anything that adding more threads in this context will do. There is already competition between the indexing threads on the CPU resources. However, the offloading threads are going to be of some help. Indexing isn’t pure computation, there are enough cases where it needs to do I/O, in which case, the CPU core is available. The offloading threads can then take advantage of the free cores (if there aren’t any other indexing threads that are ready to run instead).

It is in the case that there is just a single task running on a mostly idle system that this really shines. The index is going to submit all its work to the global queue, at which point, the offloading threads will kick in and help it to complete the task much sooner than it would be able to other wise.

There are other things that we need to take into account in this system:

  • It is, by design, not fair. An index that is able to put its queue into the global queue first may have all the offloading threads busy working on its behalf. All the other threads are on their own. That is fine, when that index will complete all the work, the offloading threads will assist another index. We care about overall throughput, not fairness between indexes.
  • There is an issue with this design under certain loads. An offloading thread may be busy doing work on behalf of an index. That index completed all other work and is now waiting for the last item to be processed. However, the offloading thread has the lower priority, so will rarely get to run. I don’t think we’ll see a lot of costs here, to be honest, that requires the system to be at full capacity (rare, and admins usually hate that, so prevent it) to actually be a real issue for a long time. If needed, we can play with thread priorities dynamically, but I would rather avoid it.

This isn’t something that we have implemented, rather something that we are currently playing with. I would love to hear your feedback on this design.

time to read 2 min | 316 words

I run into this question on Hacker News, asking for the best computer science papers. There are a few that I keep getting back to, either because they are so fundamental or they are so useful.

Without any particular order

  • The Raft Paper – a distributed consensus algorithm that made sense to me on first read. There are a lot of subtle issues to consider, but when reading the paper, everything clicked. That is head and shoulders above what Paxos literature is about.
  • The Ubiquitous BTree – talk about a paper that I used daily. Admittedly, I didn’t get started on BTrees from this paper, but this is a very well written one and it does a great job presenting the topic. It is also from 1979, and BTree were already “ubiquitous” at that time, which tells us something.
  • Extendible Hashing – this is also from 1979, and it is well written. I implemented extendible hashing based on this article directly and I grokked it right away.
  • How Complex Systems Fail – not strictly a computer science paper. In fact, I’m fairly certain that this fits more into civil engineering, but it does an amazing job of explaining the internals of complex systems and the why and how they fail. I took a lot from this paper. It is also very short and highly readable.
  • OLTP Through the Looking Glass – discuss the internal structure of database engines and the cost and complexities of their various pieces.
  • You’re doing it wrong – discuss the implementation of Varnish proxy from the point of view of a kernel hacker. Totally different approach to the design of the system. Had a lot of influence on how I build systems.

I’m fairly certain that my criteria won’t be yours, but those are all papers that I have read multiple times and have utilized their insights in my daily work.

time to read 1 min | 85 words

Every now and then I need to do some work with text, and the Enron data set is one of the most well known corpuses.

I ended up writing the parsing code for that so many times that it isn’t even funny. Therefor, I decided to make my life easier and just post it somewhere that I can refer back to it.

This code simply unpack the Enron dataset into a .NET object, from where you can start processing the text in interesting ways.


time to read 2 min | 295 words

I’m talking a lot about candidates and the hiring process we go through right now. I thought it would only be fair to share a story about an interview task that I failed.

That was close to 20 years ago, and I was looking for my first job. Absolutely no professional experience and painfully aware of that. I did have a few years of working on Open Source projects, so I was confident that I had a good way to show my abilities.

The question was simple, write the code to turn the contents of this table into a hierarchical XML file:

image 

In other words, they wanted:

To answer the question, I was given pen and paper, by the way. That made my implementation choices quite hard, since I had to write it all in long hand. I tried to reproduce this from memory, and it looks like this:

This is notepad code, and I wrote it using modern API. At the time, I was using ADO.Net and the XmlDocument. The idea is the same, however, and it will spare you going through a mass of really uninteresting details.

I got so many challenges to this answer,though. I relied on null being sorted first on SQL Server and then on the fact that a parent must exist before its children. Aside from these assumptions, which I feel are fairly safe to make, I couldn’t figure out what the big deal was.

Eventually it turned out that the interviewers were trying to guide me toward a recursive solution. It never even occurred to me, since I was doing that with a single query and a recursive solution would need many such queries.

FUTURE POSTS

  1. Atomic reference counting (with Zig code samples) - 2 days from now

There are posts all the way to Sep 20, 2021

RECENT SERIES

  1. Production postmortem (31):
    17 Sep 2021 - The Guinness record for page faults & high CPU
  2. RavenDB 5.2 (2):
    06 Aug 2021 - Simplifying atomic cluster wide transactions
  3. Postmortem (2):
    23 Jul 2021 - Accidentally quadratic indexing output
  4. re (28):
    23 Jun 2021 - The performance regression odyssey
  5. Challenge (58):
    16 Jun 2021 - Detecting livelihood in a distributed cluster
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats