I had a chat with Shawn and Wai about building high performance databases.
I think we had a really interesting discussion, and as usual, your feedback is welcome.
I had a chat with Shawn and Wai about building high performance databases.
I think we had a really interesting discussion, and as usual, your feedback is welcome.
An interesting question was raised in the mailing list, how do we handle non trivial transactions within a RavenDB cluster? The scenario in the question is money transfer between accounts, which is almost the default example for transactions.
The catch here is that now we have a set of business rules to apply.
Let’s see how this can look like in code, shall we?
I’m assuming am model similar to bitcoin, where you can break apart coins, but not merge them, by the way. What we can see is that we have non trivial logic going on here, especially as we are hiding all the actual work involved here behind the scenes. The code, as written, will work great, as long as we have a single thread of execution. We can change it easily to support concurrent work using:
With this call: session.Advanced.UseOptimisticConcurrency = true; we tell RavenDB that it should do an optimistic concurrency check on the documents when we save them. If any of the documents has changed while the code is running a concurrency exception will be raised and the whole transaction will be aborted.
This works, until you are running on a distributed cluster and have to support transactions running on different nodes. Luckily, RavenDB has an answer for that as well, with cluster wide transactions.
It is important to understand that with cluster wide transactions, we have two separate layers here. The documents are part of the overall transaction, but do not impact it. We need to use compare exchange values to allow optimistic concurrency on the whole operation. Here is the code, it is a bit dense, but the key here is that we added a compare exchange value that mirrors every account. Whenever we modify the account, we also modify the compare exchange value. Then we can use a cluster wide transaction like so:
With this in place, RavenDB will ensure that this transaction will only be accepted if both compare exchange values are the same as they were when we evaluated them. Because any change touch both the document and the compare exchange values, and because they are modified in a cluster wide transaction, we are safe even in a clustered environment. If there is a change to the data, we’ll get a concurrency exception and have to re-compute.
Important: You’ll note that we are comparing the documents and the compare exchange values to make sure that they have the same funds. Why do we need to do that? Compare exchange and documents live on separate stores and are replicated across the cluster using different mechanisms. This is because compare exchange values are using consensus and documents are using gossip for replication. It is possible that they won’t be in sync when we read them, which is why we do the extra check to ensure that we start from a level playing field.
Note that I’m using API 5.0 here, with 4.2, you need to explicitly call UpdateCompareExchangeValue, but in 5.0, we have fixed things so we do change tracking on the values. That means that it is important that the values would actually change, of course.
And this is how you write a distributed transaction with RavenDB to transfer funds in a clustered environment without needing to really worry about any of the details. Handle & retry a concurrency exception and you are pretty much done.
This post is here because we recently had to add this code to RavenDB:
Yes, we added a sleep to RavenDB, and we did it to increase performance.
The story started out with a reported performance regression. On a previous version of RavenDB, the user was able to insert 32,000 documents per second. Same code, same machine, new version of RavenDB, but the performance is 13,000 documents per second.
That is, as we call it internally, and Issue. More specifically issue: RavenDB-14777.
Deeper investigation revealed that the problem was that we are too fast, therefor we are too slow. Adding a sleep fixed the being too fast thing, so we were faster again.
You might need to read the previous paragraph a few times to make sense of it, I’m particularly proud of it. Here is what actually happened. Our bulk insert code is reading from the network and as soon as we have some data, we start parallelizing the write to disk and the read from the network. The idea is that we want to be reduce the user time, so we maximize the amount of work we do. This is a fairly standard optimization for us and has paid many dividends in performance. The way it works, we read from the network until there is nothing available in memory and we have to wait for I/O, at which point we start writing to the disk and wait for the network I/O to resume the operation.
However, the issue is that the versions that the user was trying also included a runtime change. The old version run on .NET Core 2.2 and the new version run on .NET Core 3.1. There has been many optimizations as a result of this change, and it seems that the read from network path has benefited from these.
As a result, we would be able to read the data from the buffer more quickly, which meant that we would end up faster with waiting for network I/O. And that meant that we would do a lot more disk writes because we were better in reading from the network. And that, in turn, slowed down the whole thing enough to be noticeable.
Our change means that we’ll only queue a new disk operation if there has been 5 milliseconds with no new network traffic (or a bunch of other conditions that you don’t really care about). This way, we retain the parallel work and not saturate the disk with small writes.
As I said earlier, we had to pump the brakes to get into real high speed.
The recording for my webinar about event sourcing and RavenDB is up.
Let me know what you think about the techniques outlined there.
If you have a topic you think would be interesting to have a webinar on, let me know, I’m looking for more things to talk about.
The RavenDB Cloud team has been silently working on a whole bunch of features. Many of them are backend features that as a user, you don’t really care about. But we finally got around to unveiling a lot of the stuff that is customer facing.
RavenDB Cloud now has a public API so you can automate your cluster and deployments. You can go into your cloud account, create an API key and you are done. We provide APIs for .NET and Node.js out of the box, or you can use Swagger to generate a client for your own platform.
Here is how you can use it to create a new cluster:
What might be of more interest to you is the ability to change the cluster type (which impact its capabilities and costs) on the fly. Here is how we can upgrade our previous cluster to a higher tier:
RavenDB Cloud is able to do such operation without taking anything down, and upgrading or downgrading the cluster is a live operation that takes up to 15 minutes.
In other news, as you might have noticed that I’m deploying this to Google Cloud. We are now fully live in GCP, not longer in beta.
On Amazon’s side, we have added support for the new Africa region. We are also planning on adding a whole bunch of regions on Azure side of the pond.
Please let us know what you are doing with the API, I’m sure your ingenuity is going to exceed our expectations.
I’ve started the work to update the Inside RavenDB book to version 5.0, covering all the new things that happened in RavenDB since the 4.0 release.
I have just completed a major milestone, I finished the chapter discussing time series and counters, and I would really appreciate any comments you have to make.
You can find the new chapter here, look at Chapter 5, I removed all the rest.
Please let me know what you think.
I’m going to be talking about event sourcing in RavenDB tomorrow. The registration link and full details are here.
The task today is to do some fraud detection. We have a set of a few million records of transactions, each one of them looking like this:
We want to check a new transaction against the history of data for this particular customer. The problem is that we have a very limited time frame to approve / reject a transaction. Trying to crawl through a lot of data about the customer may very well lead to the transaction timing out. So we prepare in advance. Here is the index I created, which summarize information about a particular customer:
I’ll have to admit that I’m winging all the actual logic here. I’m not sure what you’ll need for approving a transaction, but I pulled up a few things. Here is the result of the index for a particular customer:
What is most important, however, is the time it takes to get this result:
The idea is that we are able to gather information that can quickly determine if the next transaction is valid. For example, we can check if the current merchant is in the list of good destinations. If so, that is a strong indication it is good.
We can also check if the amount of the transaction matches previous amounts for the transaction, as well as many other checks. The key here is that we can get RavenDB to do all the work in the background and immediately produce the right result for us when needed.
I got an interesting question while discussing timeseries, and I thought it would make for a good blog post. Consider the timeseries that we have on the right. It shows a particular sensor value over a period of time.
The sensor in question reports data only when the value changes, but we want to get the data on every 30 minutes scale. How can we do that? It turns out that this is pretty easy to do by just asking RavenDB to do so. No, there isn’t a specific feature for that, but we don’t need to.
All we have to do here is just write the logic as a query function. I’ll admit that date handling in JS isn’t nice, but it works. And it gives us the results in the right tempo for us to do something about it.
Here is how the final data looks like:
Not very exciting, I know, but RavenDB it the kind of database that aims to sit there do stuff, not get you all hot and bothered watching your production systems melt.
There have been a couple of cases where I was working on a feature, usually a big and complex one, that made me realize that I’m just not smart enough to actually build it.
A long time ago (five or six years), I had to tackle free space handling inside of Voron. When an operation cause a page to be released, we need to register that page somewhere, so we’ll be able to reuse it later on. This doesn’t seem like too complex a feature, right? It is just a free list, what is the issue?
The problem was that the free list itself was managed as pages, and changes to the free list might cause pages to be allocated or de-allocated. This can happen while the free list operation is running. Basically, I had to make the free list code re-entrant. That was something that I tried to do for a long while, before just giving up. I wasn’t smart enough to handle that scenario properly. I could plug the leaks, but it was too complex and obvious that I’m going to run into additional cases where this would cause issues.
I had to use another strategy to handle this. Instead of allowing the free list to do dynamic allocations, I allocated the free list once, and then used a bitmap mode to store the data itself. That means that modifications to the free list couldn’t cause us to mutate the structure of the free list itself. The crisis was averted and the feature was saved.
I just checked, and the last meaningful change that happened for this piece of code was in Oct 2016, so it has been really stable for us.
The other feature where I realized I’m not smart enough to handle came up recently. I was working on the documents compression feature. One of the more important aspects of this feature is the ability to retrain, which means that when compressing a document, we’ll use a dictionary to reduce the size, and if the dictionary isn’t well suited, we’ll create a new one. The first implementation used a dictionary per file range. So all the documents that are placed on a particular location will use the same dictionary. That has high correlation to the documents written at the same time and had the benefit of ensuring that new updates to the same document will use its original dictionary. That is likely to result in good compression rate over time.
The problem was that during this process, we may find out that the dictionary isn’t suited for our needs and that we need to retrain. At this point, however, we were already computed the required space. But… the new dictionary may compress the data different (both higher and lower than what was expected). The fact that I could no longer rely on the size of the data during the save process lead to a lot of heartache. The issue was worse because we first try to compress the value using a specific dictionary, find that we can’t place it in the right location and need to put it in another place.
However, to find the new place, we need to be know what is the size that we need to allocate. And the new location may have a different dictionary, and there the compressed value is different…
Data may move inside RavenDB for a host of reasons, compaction, defrag, etc. Whenever we would move the data, we would need to recompress it, which led to a lot of complexity. I tried fighting that for a while, but it was obvious that I can’t manage the complexity.
I changed things around. Instead of having a dictionary per file range, I tagged the compressed value with a dictionary id. That way, each document could store the dictionary that it was using. That simplified the problem space greatly, because I only need to compress the data once, and afterward the size of the data remains the same. It meant that I had to keep a registry of the dictionaries, instead of placing a dictionary at the end of the specified file range, and it somewhat complicates recovery, but the overall system is ever so much simpler and it took a lot less effort and complexity to stabilize.
I’m not sure why just these problems shown themselves to be beyond my capabilities. But it does bring to mind a very important quote:
When I Wrote It, Only God and I Knew the Meaning; Now God Alone Knows
I also have to admit that I have had no idea that this quote predates code and computing. The earlier known reference for it is from 1826.
No future posts left, oh my!