Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 5 min | 955 words

The Open Closed Principle is part of the SOLID principles. It isn’t new or anything exciting, but I wanted to discuss this today in the context of using that not as a code artifact but as part of your overall architecture.

The Open Closed Principle states that the code should be opened for extension, but closed for modification. That is a fancy way to say that you should spend most of your time writing new code, not modifying old code. Old code is something that is known to be working, it is stable (hopefully), but messing around with old code can break that. Adding new code, on the other hand, carry far less risk. You may break the new things, but the old stuff will continue to work.

There is also another aspect to this, to successfully add new code to a project, you should have a structure that support that. In other words, you typically have very small core of functionality and then the entire system is built on top of this. 

Probably the best example of systems that follow the Open Closed Principle is the vast majority of PHP applications.

imageHold up,I can hear you say. Did you just called out PHP as an architectural best practice? Indeed I did, and more than that, the more basic the PHP application in question, the closer it is to the ideal of Open Closed Principle.

Consider how you’ll typically add a feature to a PHP application. You’ll create a new script file and write the functionality there. You might need to add links to that (or you already have this happen automatically), but that is about it. You aren’t modifying existing code, you are adding new one. The rest of the system just know how to respond to that and handle that appropriately.

Your shared component might be the site’s menu, a site map and the like. Adding a new functionality may occasionally involve adding a link to a new page, but for the most parts, all of those operations are safe, they are isolated and independent from one another.

In C#, on the other hand, you can do the same by adding a new class to a project. It isn’t at the same level of not even touching anything else, since it all compiles to a single binary, but the situation is roughly the same.

That is the Open Closed Principle when it applies to the code inside your application. What happens when you try to apply the same principle to your overall architecture?

I think that Terraform is a great example of doing just that. They have a plugin system that they built, which spawns a new process (so completely independent) and then connect to it via gRPC. Adding a new plugin to Terraform doesn’t involve modifying any code (you do have to update some configuration, but even that can be automated away). You can write everything using separate systems, runtime and versions quite easily.

If we push the idea a bit further, we’ll discover that Open Closed Principle at the architecture level is the Service Oriented Architecture.  Note that I explicitly don’t count Microservices in this role, because they are usually intermixed (yes, I know they aren’t supposed to, I’m talking about what is).

In those situations, adding a new feature to the system would involve adding a new service. For example, in a banking system, if you want to add a new feature to classify fraudulent transactions, how would you do it?

One way is to go to the transaction processing code and write something like:

That, of course, would mean that you are going to have to modify existing code, that is not a good idea. Welcome to six months of meeting about when you can deploy your changes to the code.

On the other hand, applying the Open Closed Principle to the architecture, we won’t ever touch the actual system that process transactions. Instead, we’ll use a side channel. Transactions will be written to a queue and we’ll be able to add listeners to the queue. In such a way, we’ll have the ability to add additional processing seamlessly. Another fraud system will just have to listen to the stream of messages and react accordingly.

Note that there is a big difference here, however, unlike with modifying the code directly, we can no longer just throw an exception to stop the process. By the time that we process the message, the transaction has already been applied. That requires that we’ll build the system in such a way that there are ways to stop transactions after the fact (maybe by actually submitting them to the central bank after a certain amount of time, or releasing them to the system only after all the configured endpoints authorized it).

At the architecture level, we are intentionally building something that is initially more complex, because we have to take into account asynchronous operations and work that happens out of band, including work that we couldn’t expect. In the context of a bank, that means that we need to provide the mechanisms for future code to intervene. For example, we may not know what we’ll want the additional code to do, but we’ll have a way to do things like pause a transaction for manual review, add additional fees, raise alerts, etc.  Those are the capabilities of the system, and the additional behavior would be policy around building that.

There are other things that make this very attractive, you don’t have to run everything at the same time, you can independently upgrade different pieces and you have clear lines of demarcation between the different pieces of your system.

time to read 2 min | 253 words

imageFrom a conceptual model, a thread and a task are very similar. That is very much by design, since the Task is meant to allow you to work with asynchronous code while maintaining the illusion that you are running in a sequential manner. It is tempting to think about this in terms of the Task actually executing the work, but that isn’t actually the case.

The Task doesn’t represent the execution of whatever asynchronous process is running, the Task represent a ticket that will be punched when the asynchronous process is done. Consider the case of going to a restaurant and asking for a table, if there isn’t an available table, you cannot be seated. What the restaurant will do is hand you a pager that will buzz when the table is ready. In the same sense, a Task is just such a token. The restaurant pager doesn’t means that someone is actively clearing a table for you. It is just something that will buzz when a table is ready.

A code sample may make things clearer:

In this case, we are manually coordinating the Task using its completion source and you can see that the Task instance that was handed when trying to get a table doesn’t actually start anything. It is simply waiting to be raised when called.

That is an important aspect of how System.Threading.Tasks.Task works, because it is usually in contrast to the mental model in our head.

time to read 4 min | 738 words

See the source imageIf you build any kind of non trivial system, one of the absolutely best things that you can do for the long term health of your system is to move all significant processing to sit behind a queue.  That is one of those things that is going to pay massive dividends down the line as your system grows. Basically, any time that you want to do something that isn’t “pull data and show it to the user” or “store the data immediately”, throwing a queue at the problem is going to make things easier in the long run.

Yes, this is a bold statement, and I’m sure that you can see that this may be abused. Nevertheless, if you’ll follow this one rule, even if you misuse it, you are likely going to be better off than if you don’t. I should probably explain.

When I’m talking about using a queue, I’m talking about moving actual processing of an operation from the request handler (controller / action / web sever ) to popping a message from a queue and processing that. The actual queue implementation (SQS, Kafka, MSMQ, ring buffer) doesn’t actually matter. It also doesn’t matter if you are writing to the queue in the same process and machine or a distributed system. What matter is that you can created a break in the system between three very important aspects of command processing:

  1. Accepting a request.
  2. Processing the request.
  3. Sending the result of the request back.

A system without a queue will do all of that inline, in this manner:

What is the problem here? If the processing of the request is complex or takes some time, you have an inherent clock here. At some point, the client is going to timeout on the request, which lead to things like this:

image

On the other hand, if you put a queue in the middle, this looks like this:

Note that there is a separation in the processing of the request and sending the accepted answer to the customer.

What is the impact of this change?

Well, it is a bit more complex to manage in the user interface. Instead of getting the response for the request immediately, we have to fetch it in a separate operation. I’m typically really strict on policing the number of remote calls, why am I advocating for an architectural pattern that requires more remote calls?

The answer is that we build, from the first step, the ability of the system to delay processing. The user interface no longer attempts to pretend that the system reacts instantly, and have far more freedom to change what we do behind the scenes.

Just putting the operation on a queue gives us the ability to shift the processing, which means that we can:

  • Maintain speedy responses and responsive system to the users.
  • Can easily bridge spikes in the system by having the queue flatten them.
  • Scale up the processing of the operations without needing to do anything in the front end.
  • Go from a local to a distribute mechanism without changing the overall architecture (that holds even if you previously held the queue in memory and processed that with a separate thread).
  • Monitor the size of the queue to get a really good indication about where we are at in terms of load.
  • Gain the ability to push updates to the backend seamlessly.

At more advanced levels:

  • Can copy message to an audit log, which gives great debugging abilities.
  • Can retry messages that failed.

There are whole patterns of message based operations that are available to you, but the things that I list above are what you get almost for free. The reason that I think you should do that upfront is that your entire system would already be used to that. Your UI (and users’ expectations) would already be set to handle potential delays. That gives you a far better system down the line. And you can play games on the front end to present the illusion that operations are accepted on the server (in pending status) without compromising the health of the system.

In short, for the health of your system, put a queue on that, your future self will thank you later.

One final word of warning, this apply to operations, not queries. Don’t bother putting queries through the queue unless they are intended to be very long lived / complex ones.

time to read 6 min | 1114 words

I had a long conversation with a dev team that are building a non trivial business system. One of the chief problems that they have to deal with is that the “business logic” that they are asked to work with is extremely mutable, situation dependent and changes frequently. That isn’t a new compliant, of course, but given that I have run into this in the past, I can deeply emphasize. The key issue is that the business rules (I refuse to call it logic) are in a constant state of flux. Trying to encode them into the software itself leads to a great deal of mess in both the code and architecture.

For example, consider the field of insurance. There are certain aspects of the insurance field that are pretty much fixed in stone (and codified into law). But there are (many) others that are much more flexible, because they relate to the business of selling insurance rather than the actual insurance itself. Because certain parts of the system cannot change (by law), all the modifications happen in other places, and those places see a lot of churn.

A marketing executive came up with a brilliant idea, let’s market a health insurance policy for young athletic people. This is the same as the usual young policy, but you get a 5% discount on the monthly premium if you have over 20 days in the month with over 10,000 steps recorded. Conversely, you get penalized with 5% surcharge if you don’t have at least 15 days of over 10,000 steps recorded. Please note that this is a real service and not something that I just made up.

Consider what such a feature means? We have to build the integration with FitBit, the ability to pull the data in, etc. But what happens next? You can be sure that there are going to be a lot more plans and offers that will use those options. You can envision another offer for a policy that gives discounts for 40+ who jogs regularly, etc.

What does this kind of thing looks like in your code? The typical answer is that this can be one of a few options:

  1. Just Say No – in some IT organizations, this is just rejected. They don’t have the capacity or ability to implement such options, therefor the business won’t be able to implement it.
  2. Yes man – whatever the business wants, the business gets. And if the code gets a little ugly, well, that is life, isn’t it?
  3. Structured – those organizations were able to figure out how to divide the static pieces and the frequently changing parts in such a way that they can ensure long term maintainability of the system.

In many cases, organizations start as the 2nd option and turn into the 1st.

In the early 2000, cellular phones plans in Israel were expensive. A family plan could cost as much as a mortgage payment. I’m not kidding, it was really that bad. One of the cellular companies had an inherent advantage, however. They were able to make new offers and plans so much faster than the companies.

  • Summer Vacation plan for your teenagers – speak free after midnight with 50 texts a week.
  • Off hours dedicated phones discounts – you can talk to 5 phone numbers between 20:00 – 08:00 and on weekends for fixed price.

All sort of stuff like that, and that worked. Some people would switch plans on a regular basis, trying to find the optimal option. The reason that this company was able to do that had to do with the manner in which they did billing.

What they did was quite amazing, even decades later. Their billing systems aggregated all the usage of a particular plan based and pushed that into a report. Then there was a directory filled with VBScript files that they would run over the report. The VBScripts were responsible for apply the logics for the plans. The fact that they wrote them in VBScript meant that they had a very well defined structure. There was all the backend work that gathered the data, then they applied the policies in the scripts. Making those kind of changes and introducing new policies was easy.

If the technique is familiar to you, that is because I talked about this in the past. In fact, I wrote a book about it. But this isn’t the time to talk about a book a dozen years old or a story from twenty years ago. Let’s talk about how we can apply this today, shall we?

For scripting, I’m going to use MoonSharp, which is a managed Lua implementation. Lua is a great scripting language, it is quite capable and at the same time usable for people without too much technical knowledge. Among other things, it also offers builtin debugging support, which can be a crucial feature for large scale systems.

At any rate, let’s consider the following logic:

As you can see, this script raise the monthly rate for a house insurance policy in a particular location. To execute this code, you’ll need something like:

Let’s look at a slightly more complex example, implementing the FitBit discount:

Those are the mechanics of how this works. How you can use MoonSharp to apply arbitrary logic to a policy. As I mentioned, I literally wrote a book about this, discussing many of the details you need to create such a system. Right now, I want to focus on the architecture impact.

The kind of code we’ll write in those scripts is usually pretty boring. Those are business rules in all their glory, quite specific and narrow, carefully applied. They are simple to understand in isolation, and as long as we keep them this way, we can reduce the overall complexity on the system.

Let’s consider how we’ll actually use them, shall we? Here is what the user will work with to customize the system. I’m saying user, but this is likely going to be used by integrators, rather than the end user.

image

That data is then going to be stored directly in our Policy object, and we can apply it as needed. A more complex solution may have the ability to attach multiple scripts to various events in the system.

This change the entire manner of your system, because you are focused on enabling the behavior of the system, rather than having to figure out how to apply the rules. The end result is that there is a very well defined structure (there has to be, for things to work) and in general an overall more maintainable system.

time to read 4 min | 659 words

About twenty years ago, I remember looking at a typical business application and most of the code was basically about massaging data to and from the database. The situation has changed, but even the most sophisticated of applications today spent an inordinate amount of time just shuffling data around. It may require lot less code, but CRUD still makes the most of the application codebase. On the one hand, that is pretty boring code to write, but on the other hand, that is boring code to write. That is an excellent property. Boring code is predictable, it will work without issues, it is easy to understand and modify and in general it lacks surprises. I love surprises when it comes to birthday parties and refunds from the IRS, I like them a lot less in my production code.

Here is a typical example of such code:

As you can see, we are using a command handling pattern and here we can choose one of a few architectural options:

  • Write the code and behavior directly inside the command handlers.
  • The command handlers are mostly about orchestration and we’ll write the business logic inside the business entities (such as the example above).

There is another aspect to the code here that is interesting, however. Take a look at the first line of code. We define there a record, a data class, that we use to note that an event happened.

You might be familiar with the notion of event sourcing, where we are recording the incoming events to the system so we’ll be able to replay them if our logic changes. In this case, that is the exact inverse of that, our code emits business events that can be processed by other pieces of the system.

The nice thing in the code above is that the business event in this case is simply writing the data record to the database. In this manner, we can participate in the overall transaction and seamlessly integrate into the overarching architecture. There isn’t much to do here, after all. You can utilize this pattern to emit whenever something that is interesting or potentially interesting happens in your application.

Aside from holding up some disk space, why exactly would you want to do this?

Well, now that you have the business events in the database, you can start operating on them. For example, we can create a report based on the paid policies by policy types and month. Of far greater interest, however, is the ability to handle such events in code. You can do that using RavenDB subscriptions.

That gives us a very important channel for extending the behavior for the system. Given the code above, let’s say that we want to add a function that would send a note to the user if their policy isn’t paid in full. I can handle that by writing the following subscription:

And then we can write a script to process that:

I intentionally show the example here as a Python script, because that doesn’t have to be a core part of the system. That can be just something that you add, either directly or as part of the system customization by an integrator.

The point is that this isn’t something that was thought of and envision by the core development team. Instead, we are emitting business events and using RavenDB subscriptions to respond to them and enhance the system with additional behavior.

One end result of this kind of system is that we are going to have two tiers of the system. One, where most of the action happens, is focused almost solely on the data management and the core pieces of the system. All the other behavior in the system is done elsewhere, in a far more dynamic manner. That gives you a lot of flexibility. It also means that there is a lot less hidden behavior, you can track and monitor all the parts much more easily, since everything is done in the open.

time to read 3 min | 496 words

Tracking down a customer’s performance issue, we eventually tracked things down to a single document modification that would grind the entire server to a halt. The actual save was working fine, it was when indexing time came around that we saw the issues. The entire system would spike in terms of memory usage and disk I/O, but CPU utilization wasn’t too bad.

We finally tracked it down to a fairly big document. Let’s assume that we have the following document:

Note that this can be big. As in, multiple megabyte range in some cases, with thousands of reviews. The case we looked at, the document was over 5MB in size and had over 1,500 reviews.

That isn’t ideal, and RavenDB will issue an performance hint when dealing with such documents, but certainly workable.

The problem was with the index, which looked like this:

This index is also setup to store all the fields being indexed. Take a look at the index, and read it a few times. Can you see what the problem is?

This is a fanout index, which I’m not a fan of, but that is certainly something that we can live with. 1,500 results from a single index isn’t even in the top order of magnitude that we have seen. And yet this index will cause RavenDB to consume a lot of resources, even if we have just a single document to index.

What is going on here?

Here is the faulty issue:

image

Give it a moment to sink in, please.

We are indexing the entire document here, once for each of the reviews that you have in the index. When RavenDB encounters a complex value as part of the indexing process, it will index that as a JSON value. There are some scenarios that call for that, but in this case, what this meant is that we would, for each of the reviews in the document:

  • Serialize the entire document to JSON
  • Store that in the index

5MB times 1,500 reviews gives us a single document costing us nearly 8GB in storage space alone. And will allocate close to 100GB (!) of memory during its operation (won’t hold 100GB, just allocate it). Committing such an index to disk requires us to temporarily use about 22GB of RAM and force us to do a single durable write that exceed the 7GB mark. Then there is the background work to clean all of that.

The customer probably meant to index book_id, but got this by mistake, and then we ended up with extreme resource utilization every time that document was modified. Removing this line meant that indexing the document went from ~8GB to 2MB. That is three orders of magnitude difference.

We are going to be adding some additional performance hints to make it clear that something is wrong in such a scenario. We had a few notices, but it was hard to figure out exactly what was going on there.

time to read 2 min | 399 words

Terraform is all the rage on the DevOps world now, and we decided to take a look. In general, Terraform is written in Go, but it uses a pretty nifty plugins system to work with additional providers. A terraform provider is an executable that you run, but it communicates with the terraform controller using gRPC.

What happens is that Terraform will invoke the executable, and then communicate with it over gRPC whose details are provided in the standard output. That means that you don’t need to write in Go, any language will do. Indeed, Samuel Fisher did all the hard work in making it possible. Please note that this post likely makes no sense if you didn’t read Samuel’s post first.

However, his work assumes that you are running on Linux, and there are a few minor issues that you have to deal with in order to get everything working properly.

For safety, Terraform uses TLS for communication, and ensures that the data is safe in transit. The provider will usually generate a self signed key and provide the public key on the standard output. I would like to understand what the security model they are trying to protect from, but at any rate, that method should be fairly safe against anything that I can think of. The problem is that there is an issue with the way the C# Terraform provider from Samuel handle the certificate in Windows. I sent a pull request to resolve it, it’s just a few lines, and it quite silly.

The next challenge is how to make Terraform actually execute your provider. The suggestion by Samuel is to copy the data to where Terraform will cache it, and use that. I don’t really like that approach, to be honest, and it looks like there are better options.

You can save a %APPDATA%\terraform.rc file with the following content:

This will ensure that your provider will be loaded from the local directory, instead of fetched over the network. Finally, there is another challenge, Terraform expects the paths and names to match, which can be quite annoying for development.

I had to run the following code to get it working:

cp F:\TerraformPluginDotNet\samples\SampleProvider\bin\release\net5.0\win-x64\publish\* F:\terraform\providers\example.com\example\dotnetsample\1.0.0\windows_amd64
mv F:\terraform\providers\example.com\example\dotnetsample\1.0.0\windows_amd64\SampleProvider.exe F:\terraform\providers\example.com\example\dotnetsample\1.0.0\windows_amd64\terraform-provider-dotnetsample.exe

What this does is ensure that the files are in the right location with the right name for Terraform to execute it. From there on, you can go on as usual developing your provider.

time to read 6 min | 1027 words

A RavenDB customer called us with an interesting issue. Every now and then, RavenDB will stop process any and all requests. These pauses could last for as long as two to three minutes and occurred on a fairly random, if frequent, basis.

A team of anteaters was dispatched to look at the issue (best bug hunters by far), but we couldn’t figure out what was going on. During these pauses, there was absolutely no activity on the machine. There was hardly any CPU utilization, there was no network or high I/o load and RavenDB was not responding to requests, it was also not doing anything else. The logs just… stopped for that duration. That was something super strange.

We have seen similar pauses in the past, I’ll admit. Around 2014 / 2015 we had a spate of issues very similar, with RavenDB paused for a very long time. Those issues were all because of GC issues. At the time, RavenDB would do a lot of allocations and it wasn’t uncommon to end up with the majority of the time spent on GC cleanup. The behavior at those time, however, was very different. We could see high CPU utilization and all metrics very clearly pointed out that the fault was the GC. In this case, absolutely nothing was going on.

Here is what such a pause looked like when we gathered the ETW metrics:

image004

Curiouser and curiouser, as Alice said.

This was a big instance, with quite a bit of work going on, so we spent some time analyzing the process behavior. And absolutely nothing appeared to be wrong. We finally figured out that the root cause is the GC, as you can see here:

image

The problem is that the GC is doing absolutely nothing here. For that matter, we spend an inordinate amount of time making sure that the GC won’t have much to do. I mentioned 2014/2015 earlier, as a direct result of those issues, we have fixed that by completely re-architecting RavenDB. The database uses a lot less managed memory in general and is far faster. So what the hell is going on here? And why weren’t we able to see those metrics before? It took a lot of time to find this issue, and GC is one of the first things we check.

In order to explain the issue, I would like to refer you to the Book of the Runtime and the discussion of threads suspension. The .NET GC will eventually need to run a blocking collection, when that happens, it needs to ensure that the heap is in a known state. You can read the details in the book, but the short of it is that there are what is known as GC Safe Points. If the GC needs to run a blocking collection, all managed threads must be as a safe point. What happens if they aren’t, however? Well, the GC will let them run until they reach such a point. There is a whole machinery in place to make sure that this happens. I would also recommend reading the discussion here. And Konard’s book is a great resource as well.

Coming back to the real issue, the GC cannot start until all the managed threads are at a safe point, so in order to suspend the threads, it will let them run to a safe point and suspend them there. What is a safe point, it is a bit complex, but the easiest way to think about it is that whenever there is a method call, the runtime ensures that the GC would have stable information. The distance between method calls is typically short, so that is great. The GC is not likely to wait for long for the thread to come to a safe point. And if there are loops that may take a while, the JIT will do the right thing to ensure that we won’t wait too long.

In this scenario, however, that was very much not the case. What is going on?

image

We got a page fault, which can happen anywhere, and until we return from the page fault, we cannot get to the GC Safe Point, so all the threads are suspended waiting for this page fault to complete.

And in this particular case, we had a single page fault, reading 16KB of data, that took close to two minutes to complete.

image

So the actual fault is somewhere in storage, which is out of scope for RavenDB, but a single slow write had a cascading effect to pause the whole server. The investigation continues and I’ll probably have another post on the topic when I get the details.

For what it is worth, this is a “managed language” issue, but a similar scenario can happen when we are running in native code. A page fault while holding the malloc lock would soon have the same scenario (although I think that this would be easier to figure out).

I wanted to see if I can reproduce the same issue on my side, but run into a problem. We don’t know what caused the slow I/O, and there is no easy way to do it in Windows. On the other hand, Linux has userfaultfd(), so I decided to use that.

The userfaultfd() doesn’t have a managed API, so I wrote something that should suffice for my (very limited) needs. With that, I can write the following code:

If you’ll run this with: dotnet run –c release on a Linux system, you’ll get the following output:

139826051907584 about to access
Got page fault at 139826051907584
Calling GC...

And that would be… it. The system is hang. This confirms the theory, and is quite an interesting use of the Linux features to debug a problem that happened on Windows.

time to read 3 min | 571 words

RavenDB 5.2 introduce a new concept for deploying indexes: Rolling indexes deployment.

Typically, deploying an index to production on a loaded database is something that you do only with great trepidation. There are many horror stories about creating a new index and resulting in the entire system locking down for a long period of time.

RavenDB was specifically designed to address those concerns. Deploying a new index in production is something that you are expected to do. In fact, RavenDB will create indexes for you on the fly as needed, in production.

To be perfectly honest, the two aspects are very closely tied together. The fact that we expect to be able to create an index without disruption of  service feeds into us being able to create indexes on the fly. And creating indexes on the fly ensures that we’ll need to keep being able to create indexes without putting too much load on a running system.

RavenDB limits the amount of CPU time that a new index can consume and will control the amount of memory and I/O that is used by an index to prioritize user queries over background work.

In version 5.2, we have extended this behavior to allow RavenDB further. We now allow users to deploy their own code as part of RavenDB indexes, which make it much harder to control what exactly is going on inside RavenDB during indexing. For example, you may have choose to run something like Parallel.For(), which may use more CPU than RavenDB accounts for. The situation is a bit more complex in the real world, because we need to worry about other factors as well (memory, I/O, CPU and network comes to mind).

Consider what happens if a user does something like this in an index:

RavenDB has no way to control what is actually going on there, and this code will use 1GB of RAM and quite a bit of CPU (over multiple cores) without the ability to control that. This is a somewhat contrived example, I’ll admit (can’t think of any reason you’ll want to do this sort of thing in an index). It is far more common to want to do Machine Learning Predictions in indexes now, which can have similar affects.

When pushing a large number of documents through such an index, such as the scenario of deploying a new index, that can put a lot of strain on the system. Enter: Rolling index deployment.

This is an index deployment mode where RavenDB will not immediately deploy the index to all the nodes. Instead, it will choose the least busy node and get it to run the index. At any time, only a single node in the cluster is going to run the index, and that node is going to be in the back of the line for any other work the cluster has for it. Once that node is completed, RavenDB will select the next node (and reassign work as needed).

The idea is that even if you deploy an index that has a negative impact on the system behavior, you have mitigated the potential impact to a single (hopefully unused) node.

The cost of that, of course, is that the indexes are now going to run in a serial fashion, one node at a time. That means that they will take longer to deploy to the entire system, of course, but the resource utilization is going to be far more stable.

time to read 10 min | 1949 words

For a database engine, controlling the amount of work that is being executed is a very important factor for the stability of the system. If we can’t control the work that is being done, it is quite easy to get to the point where we are overwhelmed.

Consider the case of a(n almost) pure computational task, which is completely CPU bound. In some cases, those tasks can easily be parallelized. Here is one such scenario:

This example seems pretty obvious, right? This is complete CPU bound (sorting), and leaving aside that sort itself can be done in parallel, we have many arrays that we need to sort. As you can imagine, I would much rather this task to be done as quickly as possible.

One way to make this happen is to parallelize the work. Here is a pretty simple way to do that:

Parallel.ForEach(arrayOfArrays, array => Array.Sort(array));

A single one liner and we are going to see much better performance from the system, right?

Yep, this particular scenario will run faster, depending on the sizes, that may be much faster. However, that single one liner is a nasty surprise for the system stability as a whole. What’s the issue?

Under the covers, this is going to use the .NET thread pool to do the work. In other words, this is going to be added to the global workload on the system. What else is using the same thread pool? Well, pretty much everything. For example, processing requests in Kestrel is done in the thread pool, all the async machinery uses the thread pool under the covers as well as pretty much everything else.

What is the impact of adding a heavily CPU bounded work to the thread pool, one may ask? Well, the thread pool would start executing these on its threads. This is heavy CPU work, so likely will run for a while, displacing other work. If we consider a single instance of this code, there is going to be a limit of the number of concurrent work that is placed in the thread pool. However, if we consider whatever we run the code above in parallel… we are going to be placing a lot of work on the thread pool. That is going to effectively starve the rest of the system. The thread pool will react by spawning more threads, but this is a slow process, and it is easy to get into a situation where all available threads are busy, leaving nothing for the rest of the application to run.

From the outside, it looks like a 100% CPU status, with the system being entirely frozen. That isn’t actually what is going on, we are simply holding up all the threads and can’t prioritize the work between request handling (important) and speeding up background work (less important). The other problem is that you may be running the operation in an otherwise idle system, and the non parallel code will utilize a single thread out of the many that are available.

In the context of RavenDB, we are talking here about indexing work. It turns out that there is a lot of work here that is purely computational. Analyzing and breaking apart text for full text search, sorting terms for more efficient access patterns, etc. The same problem above remains, how can we balance the indexing speed and the overall workload on the system?

Basically, we have three scenarios that we need to consider:

  1. Busy processing requests – background work should be done in whatever free time the system has (avoiding starvation), with as little impact as possible.
  2. Busy working on many background tasks – concurrent background tasks should not compete with one another and step on each other’s toes.
  3. Busy working on a single large background task – which should utilize as much of the available computing resources that we have.

For the second and third options, we need to take into account that the fact that there isn’t any current request processing work doesn’t matter if there is incoming work. In that case, the system should prioritize the request processing over background work.

Another important factor that I haven’t mentioned is that it would be really good for us to be able to tell what work is taking the CPU time. If we are running a set of tasks on multiple threads, it would be great to be able to see what tasks they are running in a debugger / profiler.

This sounds very much like a class in operating systems, in fact, scheduling work is a core work for an operating system. The question we have here, how do we manage to build a system that would meet all the above requirements, given that we can’t actually schedule CPU work directly.

We cannot use the default thread pool, because there are too many existing users there that can interfere with what we want. For that matter, we don’t actually want to have a dynamic thread pool. We want to maximize the amount of work we do for computational heavy workload. Instead, for the entire process, we will define a set of threads to handle work offloading, like this:

This creates a set of threads, one for each CPU core on the machine. It is important to note that these threads are running with the lowest priority, if there is anything else that is ready to run, it will get a priority. In order to do some background work, such as indexing, we’ll use the following mechanism:

Because indexing is a background operation in RavenDB, we do that in a background thread and we set the priority to below normal. Request process threads are running at normal priority, which means that we can rely on the operating system to run them first and at the same time, the starvation prevention that already exists in the OS scheduler will run our indexing code even if there is extreme load on the system.

So far, so good, but what about those offload workers? We need a way to pass work from the indexing to the offload workers, this is done in the following manner:

Note that the _globalWorkQueue is process wide. For now, I’m using the simple sort an array example because it make things easier, the real code would need a bit more complexity, but it is not important to explain the concept. The global queue contains queues for each task that needs to be run.

The index itself will have a queue of work that it needs to complete. Whenever it needs to start doing the work, it will add that to its own queue and try to publish to the global queue. Note that the size of the global queue is limited, so we won’t use too much memory there.

Once we published the work we want to do, the indexing thread will start working on the work itself, trying to solicit the aim of the other workers occasionally. Finally, once the indexing thread is done process all the remaining work, we need to wait for any pending work that is currently being executed by the workers. That done, we can use the results.

The workers all run very similar code:

In other words, they pull a queue of tasks from the global tasks queue and start working on that exclusively. Once they are done processing a single index queue to completion, the offload worker will try pick another from the global queue, etc.

The whole code is small and fairly simple, but there are a lot of behavior that is hidden here. Let me try to explore all of that now.

The indexing background work push all the work items to its local queue, and it will register the queue itself in the global queue for the offloading threads to process. The indexing queue may be registered multiple times, so multiple offloading threads will take part in this. The indexing code, however, does not rely on that and will also process its own queue. The idea is that if there are offloading threads available, they will help, but we do not rely on them.

The offloading thread, for its part, will grab an indexing thread queue and start processing all the items from the queue until it is done. For sorting arbitrary arrays, it doesn’t matter much, but for other workloads, we’ll likely get much better locality in terms of task execution by issuing the same operation over a large set of data.

The threads priority here is also important, mind. If there is nothing to do, the OS will schedule the offloading threads and give them work to do. If there is a lot of other things happening, it will not be scheduled often. This is fine, we are being assisted by the offloading threads, they aren’t mandatory.

Let’s consider the previous scenarios in light of this architecture and its impact.

If there are a lot of requests, the OS is going to mostly schedule the request processing threads, the indexing threads will also run, but it is mostly going to be when there is nothing else to do. The offload threads are going to get their chance, but mostly that will not have any major impact. That is fine, we want most of the processing power to be on the request processing.

If there is a lot of work, on the other hand, the situation is different. Let’s say that there are 25 indexes running and there are 16 cores available for the machine. In this case, the indexes are going to compete with one another. The offloading threads again not going to get a lot of chance to run, because there isn’t anything that adding more threads in this context will do. There is already competition between the indexing threads on the CPU resources. However, the offloading threads are going to be of some help. Indexing isn’t pure computation, there are enough cases where it needs to do I/O, in which case, the CPU core is available. The offloading threads can then take advantage of the free cores (if there aren’t any other indexing threads that are ready to run instead).

It is in the case that there is just a single task running on a mostly idle system that this really shines. The index is going to submit all its work to the global queue, at which point, the offloading threads will kick in and help it to complete the task much sooner than it would be able to other wise.

There are other things that we need to take into account in this system:

  • It is, by design, not fair. An index that is able to put its queue into the global queue first may have all the offloading threads busy working on its behalf. All the other threads are on their own. That is fine, when that index will complete all the work, the offloading threads will assist another index. We care about overall throughput, not fairness between indexes.
  • There is an issue with this design under certain loads. An offloading thread may be busy doing work on behalf of an index. That index completed all other work and is now waiting for the last item to be processed. However, the offloading thread has the lower priority, so will rarely get to run. I don’t think we’ll see a lot of costs here, to be honest, that requires the system to be at full capacity (rare, and admins usually hate that, so prevent it) to actually be a real issue for a long time. If needed, we can play with thread priorities dynamically, but I would rather avoid it.

This isn’t something that we have implemented, rather something that we are currently playing with. I would love to hear your feedback on this design.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}