Corax was released just under a year ago, and we are seeing more customers deploying that to production. During a call with a customer, we noticed the following detail:
Let me explain what we are seeing here. The two indexes are the same, operating on the same set of documents. The only difference between those indexes is the indexing engine.
What is really amazing here is that Corax is able to index in 3:21 minutes what Lucene takes 17:15 minutes to index. In other words, Corax is more than 5 times faster than Lucene in a real world scenario.
RavenDB has a hidden feature, enabled by default and not something that you usually need to be aware of. It has built-in support for caching. Consider the following code:
async Task<Dictionary<string,int>>HowMuchWorkToDo(string userId){
using var session = _documentStore.OpenAsyncSession();var results = await session.Query<Item>().GroupBy(x =>new{ x.Status, x.AssignedTo }).Where(g => g.Key.AssignedTo == userId && g.Key.Status !="Closed").Select(g =>new{
Status = g.Key.Status,
Count = g.Count()}).ToListAsync();return results.ToDictionary(x => x.Status, x => x.Count);}
What happens if I call it twice with the same user? The first time, RavenDB will send the query to the server, where it will be evaluated and executed. The server will also send an ETag header with the response. The client will remember the response and its ETag in its own memory.
The next time this is called on the same user, the client will again send a request to the server. This time, however, it will also inform the server that it has a previous response to this query, with the specified ETag. The server, when realizing the client has a cached response, will do a (very cheap) check to see if the cached response matches the current state of the server. If so, it can inform the client (using 304 Not Modified) that it can use its cache.
In this way, we benefit twice:
First, on the server side, we avoid the need to compute the actual query.
Second, on the network side, we aren’t sending a full response back, just a very small notification to use the cached version.
You’ll note, however, that there is still an issue. We have to go to the server to check. That means that we still pay the network costs. So far, this feature is completely transparent to the user. It works behind the scenes to optimize server query costs and network bandwidth costs.
The next stage is to involve the user. Enter the AggressiveCache() feature (see the full documentation here), which allows the user to specify an additional aspect. Now, when the client has the value in the cache, it will skip going to the server entirely and serve the request directly from the cache.
What about cache invalidation? Instead of having the client check on each request if things have changed, we invert the process. The client asks the server to notify it when things change, and until it gets notice from the server, it can serve responses completely from the local cache.
I really love this feature, that was the Good part, now let’s talk about the other pieces:
There are only two hard things in Computer Science: cache invalidation and naming things.
-- Phil Karlton
The bad part of caching is that this introduces more complexity to the system. Consider a system with two clients that are using the same database. An update from one of them may show up at different times in each. Cache invalidation will not happen instantly, and it is possible to get into situations where the server fails to notify the client about the update, meaning that we didn’t clear the cache.
We have a good set of solutions around all of those, I think. But it is important to understand that the problem space itself is a problem.
In particular, let’s talk about dealing with the following query:
var emps = session.Query<Employee>().Include(x => x.Department).Where(x => x.Location.City =="London").ToListAsync();
When an employee is changed on the server, it will send a notice to the client, which can evict the item from the cache, right? But what about when a department is changed?
For that matter, what happens if a new employee is added to London? How do we detect that we need to refresh this query?
There are solutions to those problems, but they are super complicated and have various failure modes that often require more computing power than actually running the query. For that reason, RavenDB uses a much simpler model. If the server notifies us about any change, we’ll mark the entire cache as suspect.
The next request will have to go to the server (again with an ETag, etc) to verify that the response hasn’t changed. Note that if the specific query results haven’t changed, we’ll get OK (304 Not Modified) from the server, and the client will use the cached response.
Conservatively aggressive approach
In other words, even when using aggressive caching, RavenDB still has to go to the server sometimes. What is the impact of this approach when you have a system under load?
We’ll still use aggressive caching, but you’ll see brief periods where we aren’t checking with the server (usually be able to cache for about a second or so), followed by queries to the server to check for any changes.
In most cases, this is what you want. We still benefit from the cache while reducing the number of remote calls by about 50%, and we don’t have to worry about missing updates. The downside is that, as application developers, we know that this particular document and query are independent, so we want to cache them until we get notice about that particular document being changed.
The default aggressive caching in RavenDB will not be of major help here, I’m afraid. But there are a few things you can do.
You can use Aggressive Caching in the NoTracking mode. In that mode, the client will not ask the server for notifications on changes, and will cache the responses in memory until they expire (clock expiration or size expiration only).
Another option is to take this feature higher than RavenDB directly, but still use its capabilities. Since we have a scenario where we know that we want to cache a specific set of documents and refresh the cache only when those documents are updated, let’s write it.
There are a few things to note about this code. We are holding live instances, so we ensure that the values we keep are immutable records. Otherwise, we may hand the same instance to two threads which can be… fun.
Note that document IDs in RavenDB are case insensitive, so we pass the right string comparer.
Finally, the magic happens in the constructor. We register for two important events. Whenever the connection status of the Changes() connection is modified, we clear the cache. This handles any lost updates scenarios that occurred while we were disconnected.
In practice, the subscription to events on that particular collection is where we ensure that after the server notification, we can evict the document from the cache so that the next request will load a fresh version.
Caching + Distributed Systems = 🤯🤯🤯
I’m afraid this isn’t an easy topic once you dive into the specifics and constraints we operate under. As I mentioned, I would love your feedback on the background cache refresh feature, or maybe you have better insight into other ways to address the topic.
RavenDB is a pretty old codebase, hitting 15+ years in production recently. In order to keep it alive & well, we make sure to follow the rule of always leaving the code in a better shape than we found it.
Today’s tale is about the StreamBitArray class, deep in the guts of Voron, RavenDB’s storage engine. The class itself isn’t really that interesting, it is just an implementation of a Bit Array that we have for a bitmap. We wrote it (based on Mono’s code, it looks like) very early in the history of RavenDB and have never really touched it since.
The last time anyone touched it was 5 years ago (fixing the namespace), 7 years ago we created an issue from a TODO comment, etc. Most of the code dates back to 2013, actually. And even then it was moved from a different branch, so we lost the really old history.
To be clear, that class did a full tour of duty. For over a decade, it has served us very well. We never found a reason to change it, never got a trace of it in the profiler, etc. As we chip away at various hurdles inside RavenDB, I ran into this class and really looked at it with modern sensibilities. I think that this makes a great test case for code refactoring from the old style to our modern one.
Here is what the class looks like:
Already, we can see several things that really bug me. That class is only used in one context, to manage the free pages bitmap for Voron. That means we create it whenever Voron frees a page. That can happen a lot, as you might imagine.
A single bitmap here covers 2048 pages, so when we create an instance of this class we also allocate an array with 64 ints. In other words, we need to allocate 312 bytes for each page we free. That isn’t fun, and it actually gets worse. Here is a typical example of using this class:
using (freeSpaceTree.Read(section, out Slice result)){
sba =!result.HasValue ?
newStreamBitArray():newStreamBitArray(result.CreateReader());}
sba.Set((int)(pageNumber % NumberOfPagesInSection),true);
using (sba.ToSlice(tx.Allocator, out Slice val))
freeSpaceTree.Add(section, val);
And inside the ToSlice() call, we have:
public ByteStringContext.InternalScopeToSlice(ByteStringContext context,ByteStringType type, out Slice str){var buffer =ToBuffer();var scope =context.From(buffer,0,buffer.Length,
type, out ByteString byteString);
str =newSlice(byteString);return scope;}
private unsafe byte[]ToBuffer(){var tmpBuffer =new byte[(_inner.Length +1)*sizeof (int)];
unsafe
{
fixed (int* src = _inner)
fixed (byte* dest = tmpBuffer){*(int*) dest =SetCount;Memory.Copy(dest + sizeof (int),(byte*) src,tmpBuffer.Length-1);}}return tmpBuffer;}
In other words, ToSlice() calls ToBuffer(), which allocates an array of bytes (288 bytes are allocated here), copies the data from the inner buffer to a new one (using fixed on the two arrays, which is a performance issue all in itself) and then calls a method to do the actual copy. Then in ToSlice() itself we allocate it again in native memory, which we then write to Voron, and then discard the whole thing.
In short, somehow it turns out that freeing a page in Voron costs us ~1KB of memory allocations. That sucks, I have to say. And the only reasoning I have for this code is that it is old.
This accepts a reader to a piece of memory and does a bunch of things. It calls a few methods, uses fixed on the array, etc., all to get the data from the reader to the class. That is horribly inefficient.
Let’s write it from scratch and see what we can do. The first thing to notice is that this is a very short-lived class, it is only used inside methods and never held for long. This usage pattern tells me that it is a good candidate to be made into a struct, and as long as we do that, we might as well fix the allocation of the array as well.
Note that I have a hard constraint, I cannot change the structure of the data on disk for backward compatibility reasons. So only in-memory changes are allowed.
Here is my first attempt at refactoring the code:
public unsafestructStreamBitArray{
private fixed uint _inner[64];
public int SetCount;
public StreamBitArray(){SetCount=0;Vector256<uint>.Zero.StoreUnsafe(ref _inner[0]);Vector256<uint>.Zero.StoreUnsafe(ref _inner[8]);Vector256<uint>.Zero.StoreUnsafe(ref _inner[16]);Vector256<uint>.Zero.StoreUnsafe(ref _inner[24]);Vector256<uint>.Zero.StoreUnsafe(ref _inner[32]);Vector256<uint>.Zero.StoreUnsafe(ref _inner[40]);Vector256<uint>.Zero.StoreUnsafe(ref _inner[48]);Vector256<uint>.Zero.StoreUnsafe(ref _inner[56]);}
public StreamBitArray(byte* ptr){
var ints =(uint*)ptr;SetCount=(int)*ints;
var a =Vector256.LoadUnsafe(ref ints[1]);
var b =Vector256.LoadUnsafe(ref ints[9]);
var c =Vector256.LoadUnsafe(ref ints[17]);
var d =Vector256.LoadUnsafe(ref ints[25]);
var e =Vector256.LoadUnsafe(ref ints[33]);
var f =Vector256.LoadUnsafe(ref ints[41]);
var g =Vector256.LoadUnsafe(ref ints[49]);
var h =Vector256.LoadUnsafe(ref ints[57]);
a.StoreUnsafe(ref _inner[0]);
b.StoreUnsafe(ref _inner[8]);
c.StoreUnsafe(ref _inner[16]);
d.StoreUnsafe(ref _inner[24]);
e.StoreUnsafe(ref _inner[32]);
f.StoreUnsafe(ref _inner[40]);
g.StoreUnsafe(ref _inner[48]);
h.StoreUnsafe(ref _inner[56]);}}
That looks like a lot of code, but let’s see what changes I brought to bear here.
Using a struct instead of a class saves us an allocation.
Using a fixed array means that we don’t have a separate allocation for the buffer.
Using [SkipLocalsInit] means that we ask the JIT not to zero the struct. We do that directly in the default constructor.
We are loading the data from the ptr in the second constructor directly.
The fact that this is a struct and using a fixed array means that we can create a new instance of this without any allocations, we just need 260 bytes of stack space (the 288 we previously allocated also included object headers).
Let’s look at the actual machine code that these two constructors generate. Looking at the default constructor, we have:
There is the function prolog and epilog, but the code of this method uses 4 256-bit instructions to zero the buffer. If we were to let the JIT handle this, it would use 128-bit instructions and a loop to do it. In this case, our way is better, because we know more than the JIT.
As for the constructor accepting an external pointer, here is what this translates into:
This code is exciting to me because we are also allowing instruction-level parallelism. We effectively allow the CPU to execute all the operations of reading and writing in parallel.
Next on the chopping block is this method:
publicintFirstSetBit(){for(int i =0; i < _inner.Length; i++){if(_inner[i]==0)continue;return i <<5|HighestBitSet(_inner[i]);}return-1;}privatestaticintHighestBitSet(int v){
v |= v >>1;// first round down to one less than a power of 2
v |= v >>2;
v |= v >>4;
v |= v >>8;
v |= v >>16;return MultiplyDeBruijnBitPosition[(uint)(v *0x07C4ACDDU)>>27];}
We are using vector instructions to scan 8 ints at a time, trying to find the first one that is set. Then we find the right int and locate the first set bit there. Here is what the assembly looks like:
In short, the code is simpler, shorter, and more explicit about what it is doing. The machine code that is running there is much tighter. And I don’t have allocations galore.
This particular optimization isn’t about showing better numbers in a specific scenario that I can point to. I don’t think we ever delete enough pages to actually see this in a profiler output in such an obvious way. The goal is to reduce allocations and give the GC less work to do, which has a global impact on the performance of the system.
During a performance evaluation internally, we ran into a strange situation. Our bulk insert performance using the node.js API was significantly worse than the performance of other clients. In particular, when we compared that to the C# version, we saw that the numbers were significantly worse than expected.
To be fair, this comparison is made between our C# client, which has been through the wringer in terms of optimization and attention to performance, and the Node.js client. The focus of the Node.js client was on correctness and usability.
It isn’t fair to expect the same performance from Node.js and C#, after all. However, that difference in performance was annoying enough to make us take a deeper look into what was going on.
Here is the relevant code:
const store =newDocumentStore('http://localhost:8080','bulk');
store.initialize();const bulk = store.bulkInsert();for(let i =0; i <100_000_000; i++){await bulk.store(newUser('user'+ i));}await bulk.finish();
As you can see, the Node.js numbers are respectable. Running at a rate of over 85,000 writes per second is nothing to sneeze at.
But I also ran the exact same test with the C# client, and I got annoyed. The C# client was able to hit close to 100,000 more writes per second than the Node.js client. And in both cases, the actual limit was on the client side, not on the server side.
For fun, I ran a few clients and hit 250,000 writes/second without really doing much. The last time we properly tested ingest performance for RavenDB we achieved 150,000 writes/second. So it certainly looks like we are performing significantly better.
Going back to the Node.js version, I wanted to know what exactly was the problem that we had there. Why are we so much slower than the C# version? It’s possible that this is just the limits of the node.js platform, but you gotta check to know.
Node.js has an --inspect flag that you can use, and Chrome has a built-in profiler (chrome://inspect) that can plug into that. Using the DevTools, you can get a performance profile of a Node.js process.
I did just that and go the following numbers:
That is… curious. Really curious, isn’t it?
Basically, none of my code appears here at all, most of the time is spent dealing with the async machinery. If you look at the code above, you can see that we are issuing an await for each document stored.
The idea with bulk insert is that under the covers, we split the writing to an in-memory buffer and the flushing of the buffer to the network. In the vast majority of cases, we’ll not do any async operations in the store() call. If the buffer is full, we’ll need to flush it to the network, and that may force us to do an actual await operation. In Node.js, awaiting an async function that doesn’t actually perform any async operation appears to be super expensive.
We threw around a bunch of ideas on how to resolve this issue. The problem is that Node.js has no equivalent to C#’s ValueTask. We also have a lot of existing code out there in the field that we must remain compatible with.
Our solution to this dilemma was to add another function that you can call, like so:
for(let i =0; i <100_000_000; i++){const user =newUser('user'+ i);const id ="users/"+ i;if(bulk.tryStoreSync(user, id)==false){await bulk.store(user, id);}}
The idea is that if you call tryStoreSync() we’ll try to do everything in memory, but it may not be possible (e.g. if we need to flush the buffer). In that case, you’ll need to call the async function store() explicitly.
Given that the usual reason for using the dedicated API for bulk insert is performance, this looks like a reasonable thing to ask. Especially when you can see the actual performance results. We are talking about over 55%(!!!) improvement in the performance of bulk insert.
It gets even better. That was just the mechanical fix to avoid generating a promise per operation. While we are addressing this performance issue, there are a few other low-hanging fruits that could improve the bulk insert performance in Node.js.
For example, it turns out that we pay a hefty cost to generate the metadata for all those documents (runtime reflection cost, mostly). We can generate it once and be done with it, like so:
const bulk = store.bulkInsert();const metadata ={"@collection":"Users","Raven-Node-Type":"User"};for(let i =0; i <100_000_000; i++){const user =newUser('user'+ i);const id ="users/"+ i;if(bulk.tryStoreSync(user, id, metadata)==false){await bulk.store(user, id, metadata);}}await bulk.finish();
And this code in particular gives us:
That is basically near enough to the C#’s speed that I don’t think we need to pay more attention to performance. Overall, that was time very well spent in making things go fast.
I got into an interesting discussion on LinkedIn about my previous post, talking about Code Rot. I was asked about Legacy Code defined as code without tests and how I reconcile code rot with having tests.
I started to reply there, but it really got out of hand and became its own post.
I read Working Effectively with Legacy Code for the first time in 2005 or thereabout, I think. It left a massive impression on me and on the industry at large. The book is one of the reasons I started rigorously writing tests for my code, it got me interested in mocking and eventually led me to writing Rhino Mocks.
It is ironic that the point of this post is that I disagree with this statement by Michael because of Rhino Mocks. Let’s start with numbers, last commit to the Rhino Mocks repository was about a decade ago. It has just under 1,000 tests and code coverage that ranges between 95% - 100%.
I can modify this codebase with confidence, knowing that I will not break stuff unintentionally. The design of the code is very explicitly meant to aid in testing and the entire project was developed with a Test First mindset.
I haven’t touched the codebase in a decade (and it has been close to 15 years since I really delved into it). The code itself was written in .NET 1.1 around the 2006 timeframe. It literally predates generics in .NET.
It compiles and runs all tests when I try to run it, which is great. But it is still very much a legacy codebase.
It is a legacy codebase because changing this code is a big undertaking. This code will not run on modern systems. We need to address issues related to dynamic code generation between .NET Framework and .NET.
That in turn requires a high level of expertise and knowledge. I’m fairly certain that given enough time and effort, it is possible to do so. The problem is that this will now require me to reconstitute my understanding of the code.
The tests are going to be invaluable for actually making those changes, but the core issue is that a lot of knowledge has been lost. It will be a Project just to get it back to a normative state.
This scenario is pretty interesting because I am actually looking back at my own project. Thinking about having to do the same to a similar project from someone else’s code is an even bigger challenge.
Legacy code, in this context, means that there is a huge amount of effort required to start moving the project along. Note that if we had kept the knowledge and information within the same codebase, the same process would be far cheaper and easier.
Legacy code isn’t about the state of the codebase in my eyes, it is about the state of the team maintaining it. The team, their knowledge, and expertise, are far more important than the code itself.
An orphaned codebase, one that has no one to take care of, is a legacy project even if it has tests. Conversely, a project with no tests but with an actively knowledgeable team operating on it is not.
Note that I absolutely agree that tests are crucial regardless. The distinction that I make between legacy projects and non-legacy projects is whether we can deliver a change to the system.
Reminder: A codebase that isn’t being actively maintained and has no tests is the worst thing of all. If you are in that situation, go read Working Effectively with Legacy Code, it will be a lifesaver.
I need a feature with an ideal cost of X (time, materials, effort, cost, etc). A project with no tests but people familiar with it will be able to deliver it at a cost of 2-3X. A legacy project will need 10X or more. The second feature may still require 2X from the maintained project, but only 5X from the legacy system. However, that initial cost to get things started is the killer.
In other words, what matters here is the inertia, the ability to actually deliver updates to the system.
RavenDB allows you to query your data freely and cheaply. It is one of those things that makes or breaks a database, after all.
After over a decade of working with Lucene as our backend indexing engine, we built Corax, a new querying & indexing engine that offers far better performance.
Building an indexing engine is a humongous task. It took us close to ten years from the first line of code to Corax actually shipping. But I’m really happy with the way it turned out.
Building a query engine is a big task, and we focused primarily on making the most common queries fast. The issue at hand is that RavenDB has many features, and we don’t have infinite time. So for the less common features, we typically implemented them as a straightforward port from whatever Lucene is doing.
One such feature is facets. Let’s say that I want to buy a jacket. There are way too many choices, so I can use a faceted query to help me narrow it down. Here is what this looks like in code:
from Products
where search(Description, "suit jacket")select facet(Brand),
facet(Price <200,
Price between 200 and 400,
Price between 400 and 800,
Price >800)
And here is what this looks like as a website:
I mentioned that we implemented some features as a straightforward port from Lucene, right?
We did that because RavenDB offers very rich querying semantics, and we couldn’t spend the time to craft every single bit upfront. The idea was that we would get Corax out the door and be faster in most common scenarios, and at least at parity with everything else.
It works for most scenarios, but not all of them. We recently got a query similar to the one above that was slower in Corax than in Lucene. That is usually good news since we have far more optimization opportunities in Corax. Lucene (and especially our usage of it) has already been through the wringer so many times that it is really hard to eke out any more meaningful performance gains. Corax’s architecture, on the other hand, gives us many more chances to do so.
In the case of facets, the way Lucene handles that is roughly similar to this:
defbrand_facet(matches: List[int]):
facet =dict()for term, docsForTerm in reader.terms("Brand"):
facet[term]= count_intersect(matches,docsForTerm)
Given the results of the query, run over all the terms for a particular field and intersect the documents for every term with the matches for the query. Lucene is able to do that efficiently because it materializes all its data into managed memory. That has costs associated with it:
The benefit of this approach is that many operations are simple, which is great. Corax, on the other hand, does not materialize all its data into managed memory. It uses persistent data structures on disk (leading to reduced memory usage and faster responses on the first query).
The advantage we have with Corax is that the architecture allows us to optimize a lot more deeply. In this case, however, it turned out to be unnecessary, as we are already keeping track of all the relevant information. We just needed to re-implement faceted search in a Corax-native manner.
You can see the changes here. But here is the summary. For a dataset with 10,000,000 records, with hundreds of brands to facet on, we get:
Yes, that isn’t a mistake. Corax is so fast here that you can barely observe it 🙂.
A customer called us about some pretty weird-looking numbers in their system:
You’ll note that the total number of entries in the index across all the nodes does not match. Notice that node C has 1 less entry than the rest of the system.
At the same time, all the indicators are green. As far as the administrator can tell, there is no issue, except for the number discrepancy. Why is it behaving in this manner?
Well, let’s zoom out a bit. What are we actually looking at here? We are looking at the state of a particular index in a single database within a cluster of machines. When examining the index, there is no apparent problem. Indexing is running properly, after all.
The actual problem was a replication issue, which prevented replication from proceeding to the third node. When looking at the index status, you can only see that the entry count is different.
When we zoom out and look at the state of the cluster, we can see this: