For a new feature in RavenDB, I needed to associate each transaction with a source ID. The underlying idea is that I can aggregate transactions from multiple sources in a single location, but I need to be able to distinguish between transactions from A and B.
Luckily, I had the foresight to reserve space in the Transaction Header, I had a whole 16 bytes available for me. Separately, each Voron database (the underlying storage engine that we use) has a unique Guid identifier. And a Guid is 16 bytes… so everything is pretty awesome.
There was just one issue. I needed to be able to read transactions as part of the recovery of the database, but we stored the database ID inside the database itself. I figured out that I could also put a copy of the database ID in the global file header and was able to move forward.
This is part of a much larger change, so I was going full steam ahead when I realized something pretty awful. That database Guid that I was relying on was already being used as the physical identifier of the storage as part of the way RavenDB distributes data. The reason it matters is that under certain circumstances, we may need to change that.
If we change the database ID, we lose the association with the transactions for that database, leading to a whole big mess. I started sketching out a design for figuring out that the database ID has changed, re-writing all the transactions in storage, and… a colleague said: why don’t we use another ID?
It hit me like a ton of bricks. I was using the existing database Guid because it was already there, so it seemed natural to want to reuse it. But there was no benefit in doing that. Instead, it added a lot more complexity because I was adding (many) additional responsibilities to the value that it didn’t have before.
Creating a Guid is pretty easy, after all, and I was able to dedicate one I called Journal ID to this purpose. The existing Database ID is still there, and it is completely unrelated to it. Changing the Database ID has no impact on the Journal ID, so the problem space is radically simplified.
I had to throw away heaps of complexity because of a single comment. I used the Database ID because it was there, never considering having a dedicated value for it. That single suggestion led to a better, simpler design and faster delivery.
It is funny how you can sometimes be so focused on the problem at hand, when a step back will give you a much wider view and a better path to the solution.
We ran into a memory issue recently in RavenDB, which had a pretty interesting root cause. Take a look at the following code and see if you can spot what is going on:
The code handles flushing data to disk based on the maximum transaction ID. Can you see the memory leak?
If we have a lot of load on the system, this will run just fine. The problem is when the load is over. If we stop writing new items to the system, it will keep a lot of stuff in memory, even though there is no reason for it to do so.
The reason for that is the call to TryPeek(). You can read the source directly, but the basic idea is that when you peek, you have to guard against concurrent TryTake(). If you are not careful, you may encounter something called a torn read.
Let’s try to explain it in detail. Suppose we store a large struct in the queue and call TryPeek() and TryTake() concurrently. The TryPeek() starts copying the struct to the caller at the same time that TryTake() does the same and zeros the value. So it is possible that TryPeek() would get an invalid value.
To handle that, if you are using TryPeek(), the queue will not zero out the values. This means that until that queue segment is completely full and a new one is generated, we’ll retain references to those buffers, leading to an interesting memory leak.
RavenDB is a transactional database, we care deeply about ACID. The D in ACID stands for durability, which means that to acknowledge a transaction, we must write it to a persistent medium. Writing to disk is expensive, writing to the disk and ensuring durability is even more expensive.
After seeing some weird performance numbers on a test machine, I decided to run an experiment to understand exactly how durable writes affect disk performance.
A few words about the term durable writes. Disks are slow, so we use buffering & caches to avoid going to the disk. But a write to a buffer isn’t durable. A failure could cause it to never hit a persistent medium. So we need to tell the disk in some way that we are willing to wait until it can ensure that this write is actually durable.
This is typically done using either fsync or O_DIRECT | O_DSYNC flags. So this is what we are testing in this post.
I wanted to test things out without any of my own code, so I ran the following benchmark.
I pre-allocated a file and then ran the following commands.
Normal writes (buffered) with different sizes (256 KB, 512 KB, etc).
I got myself an i4i.xlarge instance on AWS and started running some tests. That machine has a local NVMe drive of about 858 GB, 32 GB of RAM, and 4 cores. Let’s see what kind of performance I can get out of it.
Write size
Total writes
Buffered writes
256 KB
256 MB
1.3 GB/s
512 KB
512 MB
1.2 GB/s
1 MB
1 GB
1.2 GB/s
2 MB
2 GB
731 Mb/s
8 MB
8 GB
571 MB/s
16 MB
16 GB
561 MB/s
2 MB
8 GB
559 MB/s
1 MB
1 GB
554 MB/s
4 KB
16 GB
557 MB/s
16 KB
16 GB
553 MB/s
What you can see here is that writes are really fast when buffered. But when I hit a certain size (above 1 GB or so), we probably start having to write to the disk itself (which is NVMe, remember). Our top speed is about 550 MB/s at this point, regardless of the size of the buffers I’m passing to the write() syscall.
I’m writing here using cached I/O, which is something that as a database vendor, I don’t really care about. What happens when we run with direct & sync I/O, the way I would with a real database? Here are the numbers for the i4i.xlarge instance for durable writes.
Write size
Total writes
Durable writes
256 KB
256 MB
1.3 GB/s
256 KB
1 GB
1.1 GB/s
16 MB
16 GB
584 GB/s
64 KB
16 GB
394 MB/s
32 KB
16 GB
237 MB/s
16 KB
16 GB
126 MB/s
In other words, when using direct I/O, the smaller the write, the more time it takes. Remember that we are talking about forcing the disk to write the data, and we need to wait for it to complete before moving to the next one.
For 16 KB writes, buffered writes achieve a throughput of 553 MB/s vs. 126 MB/s for durable writes. This makes sense, since those writes are cached, so the OS is probably sending big batches to the disk. The numbers we have here clearly show that bigger batches are better.
My next test was to see what would happen when I try to write things in parallel. In this test, we run 4 processes that write to the disk using direct I/O and measure their output.
I assume that I’m maxing out the throughput on the drive, so the total rate across all commands should be equivalent to the rate I would get from a single command.
To run this in parallel I’m using a really simple mechanism - just spawn processes that would do the same work. Here is the command template I’m using:
This would write to 4 different portions of the same file, but I also tested that on separate files. The idea is to generate a sufficient volume of writes to stress the disk drive.
Write size
Total writes
Durable & Parallel writes
16 MB
8 GB
650 MB/s
16 KB
64 GB
252 MB/s
I also decided to write some low-level C code to test out how this works with threads and a single program. You can find the code here. I basically spawn NUM_THREADS threads, and each will open a file using O_SYNC | O_DIRECT and write to the file WRITE_COUNT times with a buffer of size BUFFER_SIZE.
This code just opens a lot of files and tries to write to them using direct I/O with 8 KB buffers. In total, I’m writing 16 GB (128 MB x 128 threads) to the disk. I’m getting a rate of about 320 MB/sec when using this approach.
As before, increasing the buffer size seems to help here. I also tested a version where we write using buffered I/O and call fsync every now and then, but I got similar results.
The interim conclusion that I can draw from this experiment is that NVMes are pretty cool, but once you hit their limits you can really feel it. There is another aspect to consider though, I’m running this on a disk that is literally called ephemeral storage. I need to repeat those tests on real hardware to verify whether the cloud disk simply ignores the command to persist properly and always uses the cache.
That is supported by the fact that using both direct I/O on small data sizes didn’t have a big impact (and I expected it should). Given that the point of direct I/O in this case is to force the disk to properly persist (so it would be durable in the case of a crash), while at the same time an ephemeral disk is wiped if the host machine is restarted, that gives me good reason to believe that these numbers are because the hardware “lies” to me.
In fact, if I were in charge of those disks, lying about the durability of writes would be the first thing I would do. Those disks are local to the host machine, so we have two failure modes that we need to consider:
The VM crashed - in which case the disk is perfectly fine and “durable”.
The host crashed - in which case the disk is considered lost entirely.
Therefore, there is no point in trying to achieve durability, so we can’t trust those numbers.
The next step is to run it on a real machine. The economics of benchmarks on cloud instances are weird. For a one-off scenario, the cloud is a godsend. But if you want to run benchmarks on a regular basis, it is far more economical to just buy a physical machine. Within a month or two, you’ll already see a return on the money spent.
We got a machine in the office called Kaiju (a Japanese term for enormous monsters, think: Godzilla) that has:
32 cores
188 GB RAM
2 TB NVMe for the system disk
4 TB NVMe for the data disk
I ran the same commands on that machine as well and got really interesting results.
Write size
Total writes
Buffered writes
4 KB
16 GB
1.4 GB/s
256 KB
256 MB
1.4 GB/s
2 MB
2 GB
1.6 GB/s
2 MB
16 GB
1.7 GB/s
4 MB
32 GB
1.8 GB/s
4 MB
64 GB
1.8 GB/s
We are faster than the cloud instance, and we don’t have a drop-off point when we hit a certain size. We are also seeing higher performance when we throw bigger buffers at the system.
But when we test with small buffers, the performance is also great. That is amazing, but what about durable writes with direct I/O?
I tested the same scenario with both buffered and durable writes:
Mode
Buffered
Durable
1 MB buffers, 8 GB write
1.6 GB/s
1.0 GB/s
2 MB buffers, 16 GB write
1.7 GB/s
1.7 GB/s
Wow, that is an interesting result. Because it means that when we use direct I/O with 1 MB buffers, we lose about 600 MB/sec compared to buffered I/O. Note that this is actually a pretty good result. 1 GB/sec is amazing.
And if you use big buffers, then the cost of direct I/O is basically gone. What about when we go the other way around and use smaller buffers?
Mode
Buffered
Durable
128 KB buffers, 8 GB write
1.7 GB/s
169 MB/s
32 KB buffers, 2 GB
1.6 GB/s
49.9 MB/s
Parallel: 8, 1 MB, 8 GB
5.8 GB/s
3.6 GB/s
Parallel: 8, 128 KB, 8 GB
6.0 GB/s
550 MB/s
For buffered I/O - I’m getting simply dreamy numbers, pretty much regardless of what I do 🙂.
For durable writes, the situation is clear. The bigger the buffer we write, the better we perform, and we pay for small buffers. Look at the numbers for 128 KB in the durable column for both single-threaded and parallel scenarios.
169 MB/s in the single-threaded result, but with 8 parallel processes, we didn’t reach 1.3 GB/s (which is 169x8). Instead, we achieved less than half of our expected performance.
It looks like there is a fixed cost for making a direct I/O write to the disk, regardless of the amount of data that we write. When using 32 KB writes, we are not even breaking into the 200 MB/sec. And with 8 KB writes, we are barely breaking into the 50 MB/sec range.
Those are some really interesting results because they show a very strong preference for bigger writes over smaller writes.
I also tried using the same C code as before. As a reminder, we use direct I/O to write to 128 files in batches of 8 KB, writing a total of 128 MB per file. All of that is done concurrently to really stress the system.
When running iotop in this environment, we get:
Total DISK READ:0.00 B/s | Total DISK WRITE: 522.56 M/s
Current DISK READ:0.00 B/s | Current DISK WRITE: 567.13 M/s
TID PRIO USER DISK READ DISK WRITE> COMMAND
142851 be/4 kaiju-1 0.00 B/s 4.09 M/s ./a.out
142901 be/4 kaiju-1 0.00 B/s 4.09 M/s ./a.out
142902 be/4 kaiju-1 0.00 B/s 4.09 M/s ./a.out
142903 be/4 kaiju-1 0.00 B/s 4.09 M/s ./a.out
142904 be/4 kaiju-1 0.00 B/s 4.09 M/s ./a.out
... redacted ...
So each thread is getting about 4.09 MB/sec for writes, but we total 522 MB/sec across all writes. I wondered what would happen if I limited it to fewer threads, so I tried with 16 concurrent threads, resulting in:
Total DISK READ:0.00 B/s | Total DISK WRITE: 89.80 M/s
Current DISK READ:0.00 B/s | Current DISK WRITE: 110.91 M/s
TID PRIO USER DISK READ DISK WRITE> COMMAND
142996 be/4 kaiju-1 0.00 B/s 5.65 M/s ./a.out
143004 be/4 kaiju-1 0.00 B/s 5.62 M/s ./a.out
142989 be/4 kaiju-1 0.00 B/s 5.62 M/s ./a.out
... redacted ..
Here we can see that each thread is getting (slightly) more throughput, but the overall system throughput is greatly reduced.
To give some context, with 128 threads running, the process wrote 16GB in 31 seconds, but with 16 threads, it took 181 seconds to write the same amount. In other words, there is a throughput issue here. I also tested this with various levels of concurrency:
Concurrency(8 KB x 16K times - 128 MB)
Throughput per thread
Time / MB written
1
15.5 MB / sec
8.23 seconds / 128 MB
2
5.95 MB / sec
18.14 seconds / 256 MB
4
5.95 MB / sec
20.75 seconds / 512 MB
8
6.55 MB / sec
20.59 seconds / 1024 MB
16
5.70 MB / sec
22.67 seconds / 2048 MB
To give some context, here are two attempts to write 2GB to the disk:
Concurrency
Write
Throughput
Total written
Total time
16
128 MB in 8 KB writes
5.7 MB / sec
2,048 MB
22.67 sec
8
256 MB in 16 KB writes
12.6 MB / sec
2,048 MB
22.53 sec
16
256 MB in 16 KB writes
10.6 MB / sec
4,096 MB
23.92 sec
In other words, we can see the impact of concurrent writes. There is absolutely some contention at the disk level when making direct I/O writes. The impact is related to the number of writes rather than the amount of data being written.
Bigger writes are far more efficient. And concurrent writes allow you to get more data overall but come with a horrendous latency impact for each individual thread.
The difference between the cloud and physical instances is really interesting, and I have to assume that this is because the cloud instance isn’t actually forcing the data to the physical disk (it doesn’t make sense that it would).
I decided to test that on an m6i.2xlarge instance with a 512 GB io2 disk with 16,000 IOPS.
The idea is that an io2 disk has to be durable, so it will probably have similar behavior to physical hardware.
Disk
Buffer Size
Writes
Durable
Parallel
Total
Rate
io2
256.00
1,024.00
No
1.00
256.00
1,638.40
io2
2,048.00
1,024.00
No
1.00
2,048.00
1,331.20
io2
4.00
4,194,304.00
No
1.00
16,384.00
1,228.80
io2
256.00
1,024.00
Yes
1.00
256.00
144.00
io2
256.00
4,096.00
Yes
1.00
1,024.00
146.00
io2
64.00
8,192.00
Yes
1.00
512.00
50.20
io2
32.00
8,192.00
Yes
1.00
256.00
26.90
io2
8.00
8,192.00
Yes
1.00
64.00
7.10
io2
1,024.00
8,192.00
Yes
1.00
8,192.00
502.00
io2
1,024.00
2,048.00
No
8.00
2,048.00
1,909.00
io2
1,024.00
2,048.00
Yes
8.00
2,048.00
1,832.00
io2
32.00
8,192.00
No
8.00
256.00
3,526.00
io2
32.00
8,192.00
Yes
8.00
256.00
150.9
io2
8.00
8,192.00
Yes
8.00
64.00
37.10
In other words, we are seeing pretty much the same behavior as on the physical machine, unlike the ephemeral drive.
In conclusion, it looks like the limiting factor for direct I/O writes is the number of writes, not their size. There appears to be some benefit for concurrency in this case, but there is also some contention. The best option we got was with big writes.
Interestingly, big writes are a win, period. For example, 16 MB writes, direct I/O:
Single-threaded - 4.4 GB/sec
2 threads - 2.5 GB/sec X 2 - total 5.0 GB/sec
4 threads - 1.4 X 4 - total 5.6 GB/sec
8 threads - ~590 MB/sec x 8 - total 4.6 GB/sec
Writing 16 KB, on the other hand:
8 threads - 11.8 MB/sec x 8 - total 93 MB/sec
4 threads - 12.6 MB/sec x 4- total 50.4 MB/sec
2 threads - 12.3 MB/sec x 2 - total 24.6 MB/sec
1 thread - 23.4 MB/sec
This leads me to believe that there is a bottleneck somewhere in the stack, where we need to handle the durable write, but it isn’t related to the actual amount we write. In short, fewer and bigger writes are more effective, even with concurrency.
As a database developer, that leads to some interesting questions about design. It means that I want to find some way to batch more writes to the disk, especially for durable writes, because it matters so much.
The task is to build a Work Breakdown Structure, where you have:
Projects
Major deliverables
Sub-deliverables
Work packages
The idea is to be able to track EstimatedHours and CompletedHours across the entire tree. For example, let’s say that I have the following:
Project: Bee Keeper Chronicle App
Major deliverable: App Design
Sub-deliverable: Wireframes all screens
Work Package: Login page wireframe
Users can add the EstimatedHours and CompletedHours at any level, and we want to be able to aggregate the data upward. So the Project: “Bee Keeper Chronicle App” should have a total estimated time and the number of hours that were worked on.
The question is how to model & track that in RavenDB. Here is what I think the document structure should look like:
I used a Parent relationship, since that is the most flexible (it allows you to move each item to a completely different part of the tree easily). Now, we need to aggregate the values for all of those items using a Map-Reduce index.
Because of the similar structure, I created the following JS function:
functionprocessWorkBreakdownHours(doc){let hours ={EsimatedHours: doc.EsimatedHours,CompletedHours: doc.CompletedHours
};let results =[Object.assign({Scope:id(doc)}, hours)];let current = doc;while(current.Parent){
current =load(current.Parent.Id, current.Parent.Type);
results.push(Object.assign({Scope:id(current)}, hours));}return results;}
This will scan over the hierarchy and add the number of estimated and completed hours to all the levels. Now we just need to add this file as Additional Sources to an index and call it for all the relevant collections, like this:
The end result is automatic aggregation at all levels. Change one item, and it will propagate upward.
Considerations:
I’m using load() here, which creates a reference from the parent to the child. The idea is that if we move a Work Package from one Sub-deliverable to another (in the same or a different Major & Project), this index will automatically re-index what is required and get you the right result.
However, that also means that whenever the Major document changes, we’ll have to re-index everything below it (because it might have changed the Project). There are two ways to handle that.
Instead of using load(), we’ll use noTracking.load(), which tells RavenDB that when the referenced document changes, we should not re-index.
Provide the relevant scopes at the document level, like this:
When building RavenDB, we occasionally have to deal with some ridiculous numbers in both size and scale. In one of our tests, we ran into an interesting problem. Here are the performance numbers of running a particular query 3 times.
First Run: 19,924 ms
Second Run: 3,181 ms
Third Run: 1,179 ms
Those are not good numbers, so we dug into this to try to figure out what is going on. Here is the query that we are running:
from index 'IntFloatNumbers-Lucene' where Int >0
And the key here is that this index covers 400 million documents, all of which are actually greater than 0. So this is actually a pretty complex task for the database to handle, mostly because of the internals of how Lucene works.
Remember that we provide both the first page of the results as well as its total number. So we have to go through the entire result set to find out how many items we have. That is a lot of work.
But it turns out that most of the time here isn’t actually processing the query, but dealing with the GC. Here are some entries from the GC log while the queries were running:
This isn’t exactly how it works, mind. But the details don’t really matter for this story. The key here is that we have an array with a sorted list of terms, and in this case, we have a lot of terms.
Those values are cached, so they aren’t actually allocated and thrown away each time we query. However, remember that the .NET GC uses a Mark & Sweep algorithm. Here is the core part of the Mark portion of the algorithm:
long _marker;
void Mark(){
var currentMarker = ++_marker;
foreach (var root in GetRoots()){
Mark(root);}
void Mark(object o){
// already visited
if(GetMarket(o)== currentMarker)return;
foreach (var child in GetReferences(node)){
Mark(child);}}}
Basically, start from the roots (static variables, items on the stack, etc.), scan the reachable object graph, and mark all the objects in use. The code above is generic, of course (and basically pseudo-code), but let’s consider what the performance will be like when dealing with an array of 190M strings.
It has to scan the entire thing, which means it is proportional to the number of objects. And we do have quite a lot of those.
The problem was the number of managed objects, so we pulled all of those out. We moved the term storage to unmanaged memory, outside the purview of the GC. As a result, we now have the following Memory Usage & Object Count:
Number of objects for GC (under LuceneIndexPersistence): 168K (~6.64GB)
So the GC time is now in the range of 100ms, instead of several seconds. This change helps both reduce overall GC pause times and greatly reduce the amount of CPU spent on managing garbage.
Those are still big queries, but now we can focus on executing the query, rather than managing maintenance tasks. Incidentally, those sorts of issues are one of the key reasons why we built Corax, which can process queries directly on top of persistent structures, without needing to materialize anything from the disk.
An issue was recently raised with a really scary title:
Intermittent Index corruption: VoronUnrecoverableErrorException.
Those are the kinds of issues that you know are going to be complex. Fixing such issues in the past was usually a Task Force effort and quite a challenge.
We asked for more information and started figuring out who would handle the issue (given the time of the year) when the user came back with:
After pressing the disk check issue with our hosting provider, we found out that one of the disks was reporting an error but according to our hosting, it was only because the manufacturer's guarantee expired, and not the actual disk failure. We swapped the disk anyway, and so far we are not seeing the issue.
I write a transactional database for a living, and the best example of why we want transactions is transferring money between accounts. It is ironic, therefore, that there is no such thing as transactions for money transfers in the real world.
If you care to know why, go back 200 years and consider how a bank would operate in an environment without instant communication. I would actually recommend doing that, it is a great case study in distributed system design. For example, did you know that the Templars used cryptography to send money almost a thousand years ago?
Recently I was reviewing my bank transactions and I found the following surprise. This screenshot is from yesterday (Dec 18), and it looks like a payment that I made is still “stuck in the tubes” two and a half weeks later.
I got in touch with the supplier in question to apologize for the delay. They didn’t understand what I was talking about. Here is what they see when they go to their bank, they got the money.
For fun, look at the number of different dates that you can see in their details.
Also, as of right now, my bank account still shows the money as pending approval (to be sent out from my bank).
I might want to recommend that they use a different database. Or maybe I should just convince the bank to approve the payment by the time of the next invoice and see if I can get a bit of that infinite money glitch.
RavenDB is a database, a transactional one. This means that we have to reach the disk and wait for it to complete persisting the data to stable storage before we can confirm a transaction commit. That represents a major challenge for ensuring high performance because disks are slow.
I’m talking about disks, which can be rate-limited cloud disks, HDD, SSDs, or even NVMe. From the perspective of the database, all of them are slow. RavenDB spends a lot of time and effort making the system run fast, even though the disk is slow.
An interesting problem we routinely encounter is that our test suite would literally cause disks to fail because we stress them beyond warranty limits. We actually keep a couple of those around, drives that have been stressed to the breaking point, because it lets us test unusual I/O patterns.
We recently ran into strange benchmark results, and during the investigation, we realized we are actually running on one of those burnt-out drives. Here is what the performance looks like when writing 100K documents as fast as we can (10 active threads):
As you can see, there is a huge variance in the results. To understand exactly why, we need to dig a bit deeper into how RavenDB handles I/O. You can observe this in the I/O Stats tab in the RavenDB Studio:
There are actually three separate (and concurrent) sets of I/O operations that RavenDB uses:
Blue - journal writes - unbuffered direct I/O - in the critical path for transaction performance because this is how RavenDB ensures that the D(urability) in ACID is maintained.
Green - flushes - where RavenDB writes the modified data to the data file (until the flush, the modifications are kept in scratch buffers).
Red - sync - forcing the data to reside in a persistent medium using fsync().
The writes to the journal (blue) are the most important ones for performance, since we must wait for them to complete successfully before we can acknowledge that the transaction was committed. The other two ensure that the data actually reached the file and that we have safely stored it.
It turns out that there is an interesting interaction between those three types. Both flushes (green) and syncs (red) can run concurrently with journal writes. But on bad disks, we may end up saturating the entire I/O bandwidth for the journal writes while we are flushing or syncing.
In other words, the background work will impact the system performance. That only happens when you reach the physical limits of the hardware, but it is actually quite common when running in the cloud.
To handle this scenario, RavenDB does a number of what I can only describe as shenanigans. Conceptually, here is how RavenDB works:
deftxn_merger(self):while self._running:with self.open_tx()as tx:while tx.total_size < MAX_TX_SIZE and tx.time < MAX_TX_TIME:
curOp = self._operations.take()if curOp isNone:break# no more operations
curOp.exec(tx)
tx.commit()# here we notify the operations that we are done
tx.notify_ops_completed()
The idea is that you submit the operation for the transaction merger, which can significantly improve the performance by merging multiple operations into a single disk write. The actual operations wait to be notified (which happens after the transaction successfully commits).
If you want to know more about this, I have a full blog post on the topic. There is a lot of code to handle all sorts of edge cases, but that is basically the story.
Notice that processing a transaction is actually composed of two steps. First, there is the execution of the transaction operations (which reside in the _operations queue), and then there is the actual commit(), where we write to the disk. It is the commit portion that takes a lot of time.
Here is what the timeline will look like in this model:
We execute the transaction, then wait for the disk. This means that we are unable to saturate either the disk or the CPU. That is a waste.
To address that, RavenDB supports async commits (sometimes called early lock release). The idea is that while we are committing the previous transaction, we execute the next one. The code for that is something like this:
deftxn_merger(self):
prev_txn = completed_txn()while self._running:
executedOps =[]with self.open_tx()as tx:while tx.total_size < MAX_TX_SIZE and tx.time < MAX_TX_TIME:
curOp = self._operations.take()if curOp isNone:break# no more operations
executedOps.append(curOp)
curOp.exec(tx)if prev_txn.completed:break# verify success of previous commit
prev_txn.end_commit()# only here we notify the operations that we are done
prev_txn.notify_ops_completed()# start the commit in async manner
prev_txn = tx.begin_commit()
The idea is that we start writing to the disk, and while that is happening, we are already processing the operations in the next transaction. In other words, this allows both writing to the disk and executing the transaction operations to happen concurrently. Here is what this looks like:
This change has a huge impact on overall performance. Especially because it can smooth out a slow disk by allowing us to process the operations in the transactions while waiting for the disk. I wrote about this as well in the past.
So far, so good, this is how RavenDB has behaved for about a decade or so. So what is the performance optimization?
This deserves an explanation. What this piece of code does is determine whether the transaction would complete in a synchronous or asynchronous manner. It used to do that based on whether there were more operations to process in the queue. If we completed a transaction and needed to decide if to complete it asynchronously, we would check if there are additional operations in the queue (currentOperationsCount).
The change modifies the logic so that we complete in an async manner if we executed any operation. The change is minor but has a really important effect on the system. The idea is that if we are going to write to the disk (since we have operations to commit), we’ll always complete in an async manner, even if there are no more operations in the queue.
The change is that the next operation will start processing immediately, instead of waiting for the commit to complete and only then starting to process. It is such a small change, but it had a huge impact on the system performance.
Here you can see the effect of this change when writing 100K docs with 10 threads. We tested it on both a good disk and a bad one, and the results are really interesting.
The bad disk chokes when we push a lot of data through it (gray line), and you can see it struggling to pick up. On the same disk, using the async version (yellow line), you can see it still struggles (because eventually, you need to hit the disk), but it is able to sustain much higher numbers and complete far more quickly (the yellow line ends before the gray one).
On the good disk, which is able to sustain the entire load, we are still seeing an improvement (Blue is the new version, Orange is the old one). We aren’t sure yet why the initial stage is slower (maybe just because this is the first test we ran), but even with the slower start, it was able to complete more quickly because its throughput is higher.
RavenDB Cloud has a whole bunch of new features that were quietly launched over the past few months. I discuss them in this post. It turns out that the team keeps on delivering new stuff, faster than I can write about it.
The following new auto-scaling feature is a really interesting one because it is pretty simple to understand and has some interesting implications for production.
You need to explicitly enable auto-scaling on your cluster. Here is what that looks like:
Once you enabled auto-scaling - which usually takes under a minute - you can click the Configure button to set your own policies:
Here is what this looks like:
The idea is very simple, we routinely measure the load on the system, and if we detect a high CPU threshold for a long time, we’ll trigger scaling to the next tier (or maybe higher, see the Upscaling / Downscaling step options) to provide additional resources to the system. If there isn’t enough load (as measured in CPU usage), we will downscale back to the lowest instance type.
Conceptually, this is a simple setup. You use a lot of CPU, and you get a bigger machine that has more resources to use, until it all balances out.
Now, let’s talk about the implications of this feature. To start with, it means you only pay based on your actual load, and you don’t need to over-provision for peak load.
The design of this feature and RavenDB in general means that we can make scale-up and scale-down changes without any interruption in service. This allows you to let auto-scaling manage the size of your instances.
In the image above, you may have noticed that I’m using the PB line of products (PB10 … PB50). That stands for burstable instances, which consume CPU credits when in use. How this interacts with auto-scaling is really interesting.
As you use more CPU, you consume all the CPU credits, and your CPU usage becomes high. At this point, auto-scaling kicks in and moves you to a higher tier. That gives you both more baseline CPU credits and a higher CPU credits accrual rate.
Together with zero downtime upscaling and downscaling, this means you can benefit from the burstable instances' lower cost without having to worry about running out of resources.
Note that auto-scaling only applies to instances within the same family. So if you are running on burstable instances, you’ll get scaling from burstable instances, and if you are running on the P series (non-burstable), your auto-scaling will use P instances.
Note that we offer auto-scaling for development instances as well. However, a development instance contains only a single RavenDB instance, so auto-scaling will trigger, but the instance will be inaccessible for up to two minutes while it scales. That isn’t an issue for the production tier.
In RavenDB, we really care about performance. That means that our typical code does not follow idiomatic C# code. Instead, we make use of everything that the framework and the language give us to eke out that additional push for performance. Recently we ran into a bug that was quite puzzling. Here is a simple reproduction of the problem:
usingSystem.Runtime.InteropServices;var counts =newDictionary<int,int>();var totalKey =10_000;refvar total =ref CollectionsMarshal.GetValueRefOrAddDefault(
counts, totalKey,out _);for(int i =0; i <4; i++){var key = i %32;refvar count =ref CollectionsMarshal.GetValueRefOrAddDefault(
counts, key,out _);
count++;
total++;}
Console.WriteLine(counts[totalKey]);
What would you expect this code to output? We are using two important features of C# here:
Value types (in this case, an int, but the real scenario was with a struct)
CollectionMarshal.GetValueRefOrAddDefault()
The latter method is a way to avoid performing two lookups in the dictionary to get the value if it exists and then add or modify it.
If you run the code above, it will output the number 2.
That is not expected, but when I sat down and thought about it, it made sense.
We are keeping track of the reference to a value in the dictionary, and we are mutating the dictionary.
The documentation for the method very clearly explains that this is a Bad Idea. It is an easy mistake to make, but still a mistake. The challenge here is figuring out why this is happening. Can you give it a minute of thought and see if you can figure it out?
A dictionary is basically an array that you access using an index (computed via a hash function), that is all. So if we strip everything away, the code above can be seen as:
var buffer =newint[2];
ref var total = ref var buffer[0];
We simply have a reference to the first element in the array, that’s what this does behind the scenes. And when we insert items into the dictionary, we may need to allocate a bigger backing array for it, so this becomes:
var buffer =newint[2];
ref var total = ref var buffer[0];var newBuffer =newint[4];
buffer.CopyTo(newBuffer);
buffer = newBuffer;
total =1;var newTotal = buffer[0]
In other words, the total variable is pointing to the first element in the two-element array, but we allocated a new array (and copied all the values). That is the reason why the code above gives the wrong result. Makes perfect sense, and yet, was quite puzzling to figure out.