Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,173 | Comments: 50,177

Privacy Policy Terms
filter by tags archive
time to read 3 min | 496 words

Tracking down a customer’s performance issue, we eventually tracked things down to a single document modification that would grind the entire server to a halt. The actual save was working fine, it was when indexing time came around that we saw the issues. The entire system would spike in terms of memory usage and disk I/O, but CPU utilization wasn’t too bad.

We finally tracked it down to a fairly big document. Let’s assume that we have the following document:

Note that this can be big. As in, multiple megabyte range in some cases, with thousands of reviews. The case we looked at, the document was over 5MB in size and had over 1,500 reviews.

That isn’t ideal, and RavenDB will issue an performance hint when dealing with such documents, but certainly workable.

The problem was with the index, which looked like this:

This index is also setup to store all the fields being indexed. Take a look at the index, and read it a few times. Can you see what the problem is?

This is a fanout index, which I’m not a fan of, but that is certainly something that we can live with. 1,500 results from a single index isn’t even in the top order of magnitude that we have seen. And yet this index will cause RavenDB to consume a lot of resources, even if we have just a single document to index.

What is going on here?

Here is the faulty issue:

image

Give it a moment to sink in, please.

We are indexing the entire document here, once for each of the reviews that you have in the index. When RavenDB encounters a complex value as part of the indexing process, it will index that as a JSON value. There are some scenarios that call for that, but in this case, what this meant is that we would, for each of the reviews in the document:

  • Serialize the entire document to JSON
  • Store that in the index

5MB times 1,500 reviews gives us a single document costing us nearly 8GB in storage space alone. And will allocate close to 100GB (!) of memory during its operation (won’t hold 100GB, just allocate it). Committing such an index to disk requires us to temporarily use about 22GB of RAM and force us to do a single durable write that exceed the 7GB mark. Then there is the background work to clean all of that.

The customer probably meant to index book_id, but got this by mistake, and then we ended up with extreme resource utilization every time that document was modified. Removing this line meant that indexing the document went from ~8GB to 2MB. That is three orders of magnitude difference.

We are going to be adding some additional performance hints to make it clear that something is wrong in such a scenario. We had a few notices, but it was hard to figure out exactly what was going on there.

time to read 6 min | 1027 words

A RavenDB customer called us with an interesting issue. Every now and then, RavenDB will stop process any and all requests. These pauses could last for as long as two to three minutes and occurred on a fairly random, if frequent, basis.

A team of anteaters was dispatched to look at the issue (best bug hunters by far), but we couldn’t figure out what was going on. During these pauses, there was absolutely no activity on the machine. There was hardly any CPU utilization, there was no network or high I/o load and RavenDB was not responding to requests, it was also not doing anything else. The logs just… stopped for that duration. That was something super strange.

We have seen similar pauses in the past, I’ll admit. Around 2014 / 2015 we had a spate of issues very similar, with RavenDB paused for a very long time. Those issues were all because of GC issues. At the time, RavenDB would do a lot of allocations and it wasn’t uncommon to end up with the majority of the time spent on GC cleanup. The behavior at those time, however, was very different. We could see high CPU utilization and all metrics very clearly pointed out that the fault was the GC. In this case, absolutely nothing was going on.

Here is what such a pause looked like when we gathered the ETW metrics:

image004

Curiouser and curiouser, as Alice said.

This was a big instance, with quite a bit of work going on, so we spent some time analyzing the process behavior. And absolutely nothing appeared to be wrong. We finally figured out that the root cause is the GC, as you can see here:

image

The problem is that the GC is doing absolutely nothing here. For that matter, we spend an inordinate amount of time making sure that the GC won’t have much to do. I mentioned 2014/2015 earlier, as a direct result of those issues, we have fixed that by completely re-architecting RavenDB. The database uses a lot less managed memory in general and is far faster. So what the hell is going on here? And why weren’t we able to see those metrics before? It took a lot of time to find this issue, and GC is one of the first things we check.

In order to explain the issue, I would like to refer you to the Book of the Runtime and the discussion of threads suspension. The .NET GC will eventually need to run a blocking collection, when that happens, it needs to ensure that the heap is in a known state. You can read the details in the book, but the short of it is that there are what is known as GC Safe Points. If the GC needs to run a blocking collection, all managed threads must be as a safe point. What happens if they aren’t, however? Well, the GC will let them run until they reach such a point. There is a whole machinery in place to make sure that this happens. I would also recommend reading the discussion here. And Konard’s book is a great resource as well.

Coming back to the real issue, the GC cannot start until all the managed threads are at a safe point, so in order to suspend the threads, it will let them run to a safe point and suspend them there. What is a safe point, it is a bit complex, but the easiest way to think about it is that whenever there is a method call, the runtime ensures that the GC would have stable information. The distance between method calls is typically short, so that is great. The GC is not likely to wait for long for the thread to come to a safe point. And if there are loops that may take a while, the JIT will do the right thing to ensure that we won’t wait too long.

In this scenario, however, that was very much not the case. What is going on?

image

We got a page fault, which can happen anywhere, and until we return from the page fault, we cannot get to the GC Safe Point, so all the threads are suspended waiting for this page fault to complete.

And in this particular case, we had a single page fault, reading 16KB of data, that took close to two minutes to complete.

image

So the actual fault is somewhere in storage, which is out of scope for RavenDB, but a single slow write had a cascading effect to pause the whole server. The investigation continues and I’ll probably have another post on the topic when I get the details.

For what it is worth, this is a “managed language” issue, but a similar scenario can happen when we are running in native code. A page fault while holding the malloc lock would soon have the same scenario (although I think that this would be easier to figure out).

I wanted to see if I can reproduce the same issue on my side, but run into a problem. We don’t know what caused the slow I/O, and there is no easy way to do it in Windows. On the other hand, Linux has userfaultfd(), so I decided to use that.

The userfaultfd() doesn’t have a managed API, so I wrote something that should suffice for my (very limited) needs. With that, I can write the following code:

If you’ll run this with: dotnet run –c release on a Linux system, you’ll get the following output:

139826051907584 about to access
Got page fault at 139826051907584
Calling GC...

And that would be… it. The system is hang. This confirms the theory, and is quite an interesting use of the Linux features to debug a problem that happened on Windows.

time to read 3 min | 571 words

RavenDB 5.2 introduce a new concept for deploying indexes: Rolling indexes deployment.

Typically, deploying an index to production on a loaded database is something that you do only with great trepidation. There are many horror stories about creating a new index and resulting in the entire system locking down for a long period of time.

RavenDB was specifically designed to address those concerns. Deploying a new index in production is something that you are expected to do. In fact, RavenDB will create indexes for you on the fly as needed, in production.

To be perfectly honest, the two aspects are very closely tied together. The fact that we expect to be able to create an index without disruption of  service feeds into us being able to create indexes on the fly. And creating indexes on the fly ensures that we’ll need to keep being able to create indexes without putting too much load on a running system.

RavenDB limits the amount of CPU time that a new index can consume and will control the amount of memory and I/O that is used by an index to prioritize user queries over background work.

In version 5.2, we have extended this behavior to allow RavenDB further. We now allow users to deploy their own code as part of RavenDB indexes, which make it much harder to control what exactly is going on inside RavenDB during indexing. For example, you may have choose to run something like Parallel.For(), which may use more CPU than RavenDB accounts for. The situation is a bit more complex in the real world, because we need to worry about other factors as well (memory, I/O, CPU and network comes to mind).

Consider what happens if a user does something like this in an index:

RavenDB has no way to control what is actually going on there, and this code will use 1GB of RAM and quite a bit of CPU (over multiple cores) without the ability to control that. This is a somewhat contrived example, I’ll admit (can’t think of any reason you’ll want to do this sort of thing in an index). It is far more common to want to do Machine Learning Predictions in indexes now, which can have similar affects.

When pushing a large number of documents through such an index, such as the scenario of deploying a new index, that can put a lot of strain on the system. Enter: Rolling index deployment.

This is an index deployment mode where RavenDB will not immediately deploy the index to all the nodes. Instead, it will choose the least busy node and get it to run the index. At any time, only a single node in the cluster is going to run the index, and that node is going to be in the back of the line for any other work the cluster has for it. Once that node is completed, RavenDB will select the next node (and reassign work as needed).

The idea is that even if you deploy an index that has a negative impact on the system behavior, you have mitigated the potential impact to a single (hopefully unused) node.

The cost of that, of course, is that the indexes are now going to run in a serial fashion, one node at a time. That means that they will take longer to deploy to the entire system, of course, but the resource utilization is going to be far more stable.

time to read 10 min | 1949 words

For a database engine, controlling the amount of work that is being executed is a very important factor for the stability of the system. If we can’t control the work that is being done, it is quite easy to get to the point where we are overwhelmed.

Consider the case of a(n almost) pure computational task, which is completely CPU bound. In some cases, those tasks can easily be parallelized. Here is one such scenario:

This example seems pretty obvious, right? This is complete CPU bound (sorting), and leaving aside that sort itself can be done in parallel, we have many arrays that we need to sort. As you can imagine, I would much rather this task to be done as quickly as possible.

One way to make this happen is to parallelize the work. Here is a pretty simple way to do that:

Parallel.ForEach(arrayOfArrays, array => Array.Sort(array));

A single one liner and we are going to see much better performance from the system, right?

Yep, this particular scenario will run faster, depending on the sizes, that may be much faster. However, that single one liner is a nasty surprise for the system stability as a whole. What’s the issue?

Under the covers, this is going to use the .NET thread pool to do the work. In other words, this is going to be added to the global workload on the system. What else is using the same thread pool? Well, pretty much everything. For example, processing requests in Kestrel is done in the thread pool, all the async machinery uses the thread pool under the covers as well as pretty much everything else.

What is the impact of adding a heavily CPU bounded work to the thread pool, one may ask? Well, the thread pool would start executing these on its threads. This is heavy CPU work, so likely will run for a while, displacing other work. If we consider a single instance of this code, there is going to be a limit of the number of concurrent work that is placed in the thread pool. However, if we consider whatever we run the code above in parallel… we are going to be placing a lot of work on the thread pool. That is going to effectively starve the rest of the system. The thread pool will react by spawning more threads, but this is a slow process, and it is easy to get into a situation where all available threads are busy, leaving nothing for the rest of the application to run.

From the outside, it looks like a 100% CPU status, with the system being entirely frozen. That isn’t actually what is going on, we are simply holding up all the threads and can’t prioritize the work between request handling (important) and speeding up background work (less important). The other problem is that you may be running the operation in an otherwise idle system, and the non parallel code will utilize a single thread out of the many that are available.

In the context of RavenDB, we are talking here about indexing work. It turns out that there is a lot of work here that is purely computational. Analyzing and breaking apart text for full text search, sorting terms for more efficient access patterns, etc. The same problem above remains, how can we balance the indexing speed and the overall workload on the system?

Basically, we have three scenarios that we need to consider:

  1. Busy processing requests – background work should be done in whatever free time the system has (avoiding starvation), with as little impact as possible.
  2. Busy working on many background tasks – concurrent background tasks should not compete with one another and step on each other’s toes.
  3. Busy working on a single large background task – which should utilize as much of the available computing resources that we have.

For the second and third options, we need to take into account that the fact that there isn’t any current request processing work doesn’t matter if there is incoming work. In that case, the system should prioritize the request processing over background work.

Another important factor that I haven’t mentioned is that it would be really good for us to be able to tell what work is taking the CPU time. If we are running a set of tasks on multiple threads, it would be great to be able to see what tasks they are running in a debugger / profiler.

This sounds very much like a class in operating systems, in fact, scheduling work is a core work for an operating system. The question we have here, how do we manage to build a system that would meet all the above requirements, given that we can’t actually schedule CPU work directly.

We cannot use the default thread pool, because there are too many existing users there that can interfere with what we want. For that matter, we don’t actually want to have a dynamic thread pool. We want to maximize the amount of work we do for computational heavy workload. Instead, for the entire process, we will define a set of threads to handle work offloading, like this:

This creates a set of threads, one for each CPU core on the machine. It is important to note that these threads are running with the lowest priority, if there is anything else that is ready to run, it will get a priority. In order to do some background work, such as indexing, we’ll use the following mechanism:

Because indexing is a background operation in RavenDB, we do that in a background thread and we set the priority to below normal. Request process threads are running at normal priority, which means that we can rely on the operating system to run them first and at the same time, the starvation prevention that already exists in the OS scheduler will run our indexing code even if there is extreme load on the system.

So far, so good, but what about those offload workers? We need a way to pass work from the indexing to the offload workers, this is done in the following manner:

Note that the _globalWorkQueue is process wide. For now, I’m using the simple sort an array example because it make things easier, the real code would need a bit more complexity, but it is not important to explain the concept. The global queue contains queues for each task that needs to be run.

The index itself will have a queue of work that it needs to complete. Whenever it needs to start doing the work, it will add that to its own queue and try to publish to the global queue. Note that the size of the global queue is limited, so we won’t use too much memory there.

Once we published the work we want to do, the indexing thread will start working on the work itself, trying to solicit the aim of the other workers occasionally. Finally, once the indexing thread is done process all the remaining work, we need to wait for any pending work that is currently being executed by the workers. That done, we can use the results.

The workers all run very similar code:

In other words, they pull a queue of tasks from the global tasks queue and start working on that exclusively. Once they are done processing a single index queue to completion, the offload worker will try pick another from the global queue, etc.

The whole code is small and fairly simple, but there are a lot of behavior that is hidden here. Let me try to explore all of that now.

The indexing background work push all the work items to its local queue, and it will register the queue itself in the global queue for the offloading threads to process. The indexing queue may be registered multiple times, so multiple offloading threads will take part in this. The indexing code, however, does not rely on that and will also process its own queue. The idea is that if there are offloading threads available, they will help, but we do not rely on them.

The offloading thread, for its part, will grab an indexing thread queue and start processing all the items from the queue until it is done. For sorting arbitrary arrays, it doesn’t matter much, but for other workloads, we’ll likely get much better locality in terms of task execution by issuing the same operation over a large set of data.

The threads priority here is also important, mind. If there is nothing to do, the OS will schedule the offloading threads and give them work to do. If there is a lot of other things happening, it will not be scheduled often. This is fine, we are being assisted by the offloading threads, they aren’t mandatory.

Let’s consider the previous scenarios in light of this architecture and its impact.

If there are a lot of requests, the OS is going to mostly schedule the request processing threads, the indexing threads will also run, but it is mostly going to be when there is nothing else to do. The offload threads are going to get their chance, but mostly that will not have any major impact. That is fine, we want most of the processing power to be on the request processing.

If there is a lot of work, on the other hand, the situation is different. Let’s say that there are 25 indexes running and there are 16 cores available for the machine. In this case, the indexes are going to compete with one another. The offloading threads again not going to get a lot of chance to run, because there isn’t anything that adding more threads in this context will do. There is already competition between the indexing threads on the CPU resources. However, the offloading threads are going to be of some help. Indexing isn’t pure computation, there are enough cases where it needs to do I/O, in which case, the CPU core is available. The offloading threads can then take advantage of the free cores (if there aren’t any other indexing threads that are ready to run instead).

It is in the case that there is just a single task running on a mostly idle system that this really shines. The index is going to submit all its work to the global queue, at which point, the offloading threads will kick in and help it to complete the task much sooner than it would be able to other wise.

There are other things that we need to take into account in this system:

  • It is, by design, not fair. An index that is able to put its queue into the global queue first may have all the offloading threads busy working on its behalf. All the other threads are on their own. That is fine, when that index will complete all the work, the offloading threads will assist another index. We care about overall throughput, not fairness between indexes.
  • There is an issue with this design under certain loads. An offloading thread may be busy doing work on behalf of an index. That index completed all other work and is now waiting for the last item to be processed. However, the offloading thread has the lower priority, so will rarely get to run. I don’t think we’ll see a lot of costs here, to be honest, that requires the system to be at full capacity (rare, and admins usually hate that, so prevent it) to actually be a real issue for a long time. If needed, we can play with thread priorities dynamically, but I would rather avoid it.

This isn’t something that we have implemented, rather something that we are currently playing with. I would love to hear your feedback on this design.

time to read 6 min | 1011 words

noun_Stable Graph_1295028 (2)I want to talk about an aspect of RavenDB that is usually ignored. Its pricing structure. The model we use for pricing RavenDB is pretty simple, we charge on a per core basis, with two tiers depending on what features you want exactly. The details and the actual numbers can be found here. We recently had a potential customer contact us about doing a new project with RavenDB. They had a bunch of questions to us about technical details that pertain to their usage scenario. They also asked a lot of questions about licensing. That is pretty normal and something that we field on a daily basis, nothing to write a blog post about.  So why am I writing this?

For licensing, we pointed to the buy page and that was it. We sometimes need to clarify what we mean exactly by “core” and occasionally the customer will ask us about more specialized scenarios (such as massive deployments, running on multiple purpose hardware, etc). The customer in question was quite surprised by the interaction. The contact person for us was one of the sales people, and they got all the information in the first couple of emails from us (we needed some clarification about what exactly their scenario was).

It turned out that this customer was considering using MongoDB for their solution, but they have had a very different interaction with them. I was interested enough in this to ask them for a timeline of the events.

It took three and a half months from initial contact until the customer actually had a number that they could work with.

 
Day Event Result
1 Login to MongoDB Atlas, use the chat feature to ask for pricing for running a MongoDB cluster in an on-premise configuration Got an email address to send in MongoDB to discuss this issue
1 Email to Sales Development Representative to explain about the project and the required needs. Ask for pricing information. The representative suggests to keep using Atlas on the cloud, propose a phone call.
3 30 minutes call with Sales Development Representative. Representative is pushing hard to remain on Atlas in cloud instead of on-premise solution. After insisting that only on-premise solution is viable for project, representative asks to get current scenarios and projected growth, MongoDB will  generate a quote based on those numbers.
6 After researching various growth scenarios, sending an email with the details to the representative.  
37 Enough time has passed with no reply from MongoDB. Asking for status of our inquiry and ETA for the quote.  
65 After multiple emails, managed to setup another call with the representative (same one as before). Another 30 minutes call. The representative cannot locate previous detailed scenario, pushed to using Atlas cloud option again. The Sales Development Representative pulls a Corporate Account Executive to email conversation. A new meeting is scheduled.
72 Meeting with Corporate Account Executive, explain the scenario again (3rd time). Being pressured to use the cloud option on Atlas. Setting up a call with MongoDB's Solution Architect.
80 Before meeting with Solution Architect, another meeting by request of Corporate Account Executive. Tried to convince to use Atlas again, need to explain why this needs to be an on-premise solution.
87 Meeting with Solution Architect, presenting the proposed solution (again). Had to explain (yes, again) why this is an on-premise solution and cannot use Atlas. Solution Architect sends a script to run to verify various numbers on test project MongoDB instance.
94 Seven days since all requested details were sent, requested quote again, asked if something is still missing and whether can finally get the quote.  
101 Got the quote! Fairly hard to figure out from the quote what exactly the final number would be, asking a question.
102 Got an answer explaining how to interpret the numbers that were sent to us.  

To repeat that again. From having a scenario to having a quote (or even just a guesstimate pricing) took over 100 days! It also took five different meetings with various MongoDB representatives (and having to keep explaining the same thing many times over) to get a number.

During that timeframe, the customer is not sure yet whether the technology stack that they are considering is going to be viable for their scenario. During that time frame, they can’t really proceed (since the cost may be prohibitive for their scenario) and are stuck in a place of uncertainty and doubt about their choices.

For reference, the time frame from a customer asking about RavenDB for the first time and when they have at least a fairly good idea about what kind of expense we are talking about is measured in hours. Admittedly, sometimes it would be 96 hours (because of weekends, usually), but the idea that it can take so long to actually get a dollar value for a customer is absolutely insane.

Quite aside from any technical differences between the products, the fact that you have all the information upfront with RavenDB turns out to be a major advantage. For that matter, for most scenarios, customers don’t have to talk to us, they can do everything on their own.

The best customer service, in my opinion, is to not have to go through that. For the majority of the cases, you can do everything on your own. Some customers would need more details and reach out, in which case the only sensible thing is to actually make it easy for them.

We have had cases where we closed the loop (got the details from the customer, sent a quote and the quote was accepted) before competitors even respond to the first email. This isn’t as technically exciting as a new algorithm or a major performance improvement, but it has a serious and real impact on the acceptance of the product.

Oh, and as a final factor, it turns out that we are the more cost efficient option, by a lot.

time to read 3 min | 509 words

Ransomware, Cyber, Crime, Attack, Malware, HackerNewsBlur (a personal news aggregator) suffered from a data breech / ransomware “attack”. I’m using the term “attack” here in quotes because this is the equivalent to having your car broken into after you left it with the engine running with the keys inside in a bad part of town.

As a result of the breech, users’ data including personal RSS feeds, access tokens for social media, email addresses and other sundry items of various import. It looks like about 250GB of data was taken hostage, by the way.

The explanation about what exactly happened is really interesting, however.

NewsBlur moved their MongoDB instance from its own server to a container. Along the way, they accidently (looks like a Docker default configuration) opened the MongoDB port to the whole wide world. By default, MongoDB will only listen to the localhost, in this case, I think that from the perspective of MongoDB, it was listening to the local port, it is Docker infrastructure that did the port forwarding and tied the public port to the instance. From that point on, it was just a matter of time. It apparently took two hours or so for some automated script to run into the welcome mat and jump in, wreak havoc and move on.

I’m actually surprised that it took so long. In some cases, machines are attacked in under a minute from showing up on the public internet. I used the term bad part of town earlier, but it is more accurate to say that the entire internet is a hostile environment and should be threated as such.

That lead to the next problem. You should never assume that you are running in anywhere else. In the case above, we have NewsBlur assuming that they are running on a private network where only the internal servers can access. About a year ago, Microsoft had a similar issue, they exposed an Elastic cluster that was supposed to be on an internal network only and lost 250 million customer support records.

In both cases, the problem was lack of defense in depth. Once the attacker was able to connect to the system, it was game over. There are monitoring solutions that you can use, but in general, the idea is that you don’t trust your network. You authenticate and encrypt all the traffic, regardless of where you are running it. The additional encryption cost is not usually meaningful for typical workloads (even for demanding workloads), given that most CPUs have dedicated encryption instructions.

When using RavenDB, we have taken the steps to ensure that:

  • It is simple and easy to run in a secure mode, using X509 client certificate for authentication and all network communications are encrypted.
  • It is hard and complex to run without security.

If you run the RavenDB setup wizard, it takes under two minutes to end up with a secured solution, one that you can expose to the outside world and not worry about your data taking a walk.

time to read 4 min | 714 words

imageA RavenDB database can reside on multiple nodes in the cluster. RavenDB uses a multi master protocol to handle writes. Any node holding the database will be able to accept writes. This is in contrast to other databases that use the leader/follower model. In such systems, only a single instance is able to accept writes at any given point in time.

The node that accepted the write is responsible for disseminating the write to the rest of the cluster. This should work even if there are some breaks in communication, mind, which makes things more interesting.

Consider the case of a write to node A. Node A will accept the write and then replicate that as part of its normal operations to the other nodes in the cluster.

In other words, we’ll have:

  • A –> B
  • A –> C
  • A –> D
  • A –> E

In a distributed system, we need to be prepare for all sort of strange failures. Consider the case where node A cannot talk to node C, but the other nodes can. In this scenario, we still expect node C to have the full data. Even if node A cannot send the data to it directly.

The simple solution would be to simply have each node replicate the data it get from any source to all its siblings. However, consider the cost involved?

  • Write to node A (1KB document) will result in 4 replication (4KB)
  • Node B will replicate to 4 nodes (including A, mind), so that it another 4KB.
  • Node C will replicate to 4 nodes, so that it another 4KB.
  • Node D will replicate to 4 nodes, so that it another 4KB.
  • Node E will replicate to 4 nodes, so that it another 4KB.

In other words, in a 5 nodes cluster, a single 1KB write will generate 20KB of network traffic, the vast majority of it totally unnecessary.

There are many gossip algorithms, and they are quite interesting, but they are usually not meant for a continuous stream of updates. They are focus on robustness over efficiency.

RavenDB takes the following approach, when a node accept a write from a client directly, it will send the new write to all its siblings immediately. However, if a node accept a write from replication, the situation is different. We assume that the node that replicate the document to us will also replicate the document to other nodes in the cluster. As such, we’ll not initiate replication immediately. What we’ll do, instead, it let all the nodes that replicate to us, that we got the new document.

If we don’t have any writes on the node, we’ll check every 15 seconds whatever we have documents that aren’t present on our siblings. Remember that the siblings will report to us what documents they currently have, proactively. There is no need to chat over the network about that.

In other words, during normal operations, what we’ll have is node A replicating the document to all the other nodes. They’ll each inform the other nodes that they have this document and nothing further needs to be done. However, in the case of a break between node A and node C, the other nodes will realize that they have a document that isn’t on node C, in which case they’ll complete the cycle and send it to node C, healing the gap in the network.

I’m using the term “tell the other nodes what documents we have”, but that isn’t what is actually going on. We use change vectors to track the replication state across the cluster. We don’t need to send each individual document write to the other side, instead, we can send a single change vector (a short string) that will tell the other side all the documents that we have in one shot.  You can read more about change vectors here.

In short, the behavior on the part of the node is simple:

  • On document write to the node, replicate the document to all siblings immediately.
  • On document replication, notify all siblings about the new change vector.
  • Every 15 seconds, replicate to siblings the documents that they missed.

Just these rules allow us to have a sophisticated system in practice, because we’ll not have excessive writes over the network but we’ll bypass any errors in the network layer without issue.

time to read 5 min | 812 words

Earlier this week, we have released RavenDB 5.2 to the world. This is an exciting release for a few reasons. We have a bunch of new features available and as usual, the x.2 release is our LTS release.

RavenDB 5.2 is compatible with all 4.x and 5.x releases, you can simply update your server binaries and you’ll be up and running in no time. RavenDB 4.x clients can talk to RavenDB 5.2 server with no issue. Upgrading in a cluster (from 4.x or 5.x versions) can be done using rolling update mode and mixed version clusters are supported (some features will not be available unless a majority of the cluster is running on 5.2, though).

Let’s start by talking about the new features, they are more interesting, I’ll admit.

I’m going to be posting details about all those features, but I want to point out what is probably the most important aspect of RavenDB, even beyond the feature, OLAP ETL. RavenDB 5.2 is a LTS release.

Long Term Support release

LTS stands for Long Term Support, we support such releases for an extended period of time and they are recommended for production deployments and long term projects.

Our previous LTS release, RavenDB 4.2, was released in May 2019 and is still fully supported. Standard support for RavenDB 4.2 will lapse in July 2022 (a year from now), we’ll offer extended support for users who want to use that version afterward.

We encourage all RavenDB users to migrate to RavenDB 5.2 as soon as they are able.

OLAP ETL

This new feature deserve its own post (which will show up next week), but I wanted to say a few words on that. RavenDB is meant to serve as an application database, serving OLTP workloads. It has some features aimed at reporting, but that isn’t the primary focus.

For almost a decade, RavenDB has supported native ETL process that will push data on the fly from RavenDB to a relational database. The idea is that you can push the data into your reporting solution and continue using that infrastructure.

Nowadays, people are working with much larger dataset and there is usually not a single reporting database to work with. There are data lakes (and presumably data seas and oceans, I guess) and the cloud has a much higher presence in the story.  Furthermore, there is another interesting aspect for us. RavenDB is often deployed on the edge, but there is a strong desire to see what is going across the entire system. That means pushing data from all the edge locations into the cloud and offering reports based on that.

To answer those needs, RavenDB 5.2 has the OLAP ETL feature. At its core, it is a simple concept. RavenDB allows you to define a script that will transform your data into a tabular format. So far, that is very much the same as the SQL ETL. The interesting bit happens afterward. Instead of pushing the data into a relational database, we’ll generate a set of Parquet files (columnar data store) and push them to a cloud location.

On the cloud, you can use your any data lake solution to process such file, issue reports, etc. For example, you can use Presto or AWS Athena to run queries over the uploaded files.

You can define the ETL process in a single location or across your entire fleet of databases on the edge, they’ll push the data to the cloud automatically and transparently. All the how, when, failure management, reties and other details are handled for you. RavenDB is also capable of integrating with custom solution, such as generating a one time token on each upload (no need to expose your credentials on the edge).

The end result is that you have a simple and native option to run large scale queries across your data, even if you are working in a widely distributed system. And even for those who run just a single cluster, you have a wider set of options on how to report and aggregate your data.

time to read 1 min | 88 words

I posted a few weeks ago about a performance regression in our metrics that we tracked down the to the disk being exhausted.

We replaced the hard disk to a new one, can you see what the results were?

image (1)

This is mostly because we were pretty sure that this is the problem, but couldn’t rule out that this was something else. Good to know that we were on track.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Postmortem (2):
    23 Jul 2021 - Accidentally quadratic indexing output
  2. re (28):
    23 Jun 2021 - The performance regression odyssey
  3. Challenge (58):
    16 Jun 2021 - Detecting livelihood in a distributed cluster
  4. Webinar (4):
    11 Jun 2021 - Machine Learning and Time Series in RavenDB with Live Examples
  5. Webinar recording (13):
    24 May 2021 - The Rewards of Escaping the Relational Mindset
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats