Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,148 | Comments: 50,125

Privacy Policy Terms
filter by tags archive
time to read 3 min | 439 words

RavenDB has been using the Raft protocol for the past years. In fact, we have written three or four different implementations of Raft along the way. I implemented Raft using pure message passing, on top of async RPC and on top of TCP. I did that using actor model and using direct parallel programming as well as the usual spaghettis mode.

The Raft paper is beautiful in how it explain a non trivial problem in a way that is easy to grok, but it is also something that can require dealing with a number of subtleties. I want to discuss some of the ways to successfully implement it. Note that I’m assuming that you are familiar with Raft, so I won’t explain anything here.

A key problem with Raft implementations is that you have multiple concurrent things happening all at once, on different machines. And you always have the election timer waiting in the background. In order to deal with that, I divide the system into independent threads that each has their own task.

I’m going to talk specifically about the leader mode, which is the most complex aspect, usually. In this mode, we have:

  • Leader thread – responsible for determining the current progress in the cluster.
  • Follower thread – once per follower – responsible for communicating with a particular follower.

In addition, we may have values being appended to our log concurrently to all of the above. The key here is that the followers threads will communicate with their follower and push data to it. The overall structure for a follower thread looks like this:

What is the idea? We have a dedicated thread that will communicate with the follower. It will either ping the follower with an empty AppendEntries (every 1/3 of the election timeout) or it will send a batch of up to 50 entries to update the follower. Note that there is nothing in this code about the machinery of Raft, that isn’t the responsibility of the follower thread. The leader, on the other hand, listen to the notifications from the followers threads, like so:

The idea is that each aspect of the system is running independently, and the only communication that they have with each other is the fact that they can signal the other that they did some work. We then can compute whatever that work changed the state of the system.

Note that the code here is merely drafts, missing many details. For example, we aren’t sending the last commit index on AppendEntries, and committing the log is an asynchronous operation, since it can take a long time and we need to keep the system in operation.

time to read 2 min | 363 words

A few days ago I posted about looking at GitHub projects for junior developer candidates. One of the things that is very common in such scenario is to see them use string concatenation for queries, I hate that. I just reached to a random candidate GitHub profile right now and found this gem:

The number of issues that I have with this code is legion.

  • Not closing the connection or disposing the command.
  • The if can be dropped entirely.
  • And, of course, the actual SQL INJECTION vulnerability in the code.

There is a reason that I have such a reaction for this type of code, even when looking at junior developer candidates. For them, this is acceptable, I guess. They are learning and focusing mostly on what is going on, not the myriad of taxes that you have to pay in order to get something to production. This is never meant to be production code (I hope, at least). I’m not judging this on that level. But I have to very consciously remind myself of this fact whenever I run into code like this (and as I said, this is all too common).

The reason I have such a visceral reaction to this type of code is that I see it in production systems all too often. And that leads to nasty stuff like this:

And this code led to a 70GB data leak on Gab. The killer for me that this code was written by someone with 23 years of experience.

I actually had to triple check what I was seeing when I read the code the first time, because I wasn’t sure that this is actually possible. I thought maybe this is some fancy processing done to avoid SQL injection, not that this is basically string interpolation.

Some bugs are things that you can excuse. A memory leak or a double free are things that will happen to anyone who is writing in C, regardless of experience and how careful they write. They are often subtle and easy to miss, happening in corner cases of error handling.

This sort of bug is a big box of red flags. It is also on fire.

time to read 2 min | 204 words

imageRavenDB Cloud is now offering HIPAA Compliant Accounts.

HIPAA stands for Health Insurance Portability and Accountability Act and is a set of rules and regulations that health care providers and their business associates need to apply.

That refers to strictly limiting access to Personal Health Information (PHI) and Personally Identifying Information (PII) as well as audit and security requirements. In short, if you deal with medical information in the states, this is something that you need to deal with. In the rest of the world, there are similar standards and requirements.

With HIPAA compliant accounts, RavenDB Cloud takes on itself a lot of the details around ensuring that your data is stored in a safe environment and in a manner that match the HIPAA requirements. For example, the audit logs are maintained for a minimum of six years. In addition, there are further protections on accessing your cluster and we enforce a set of rules to ensure that you don’t accidently expose private data.

This feature ensures that you can easily run HIPAA compliant systems on top of RavenDB Cloud with a minimum of hassle.

time to read 5 min | 970 words

A customer reported that on their system, they suffered from frequent cluster elections in some cases. That is usually an indication that the system resources are hit in some manner. From experience, that usually means that the I/O on the machine is capped (very common in the cloud) or that there is some network issue.

The customer was able to rule these issues out. The latency to the storage was typically withing less than a millisecond and the maximum latency never exceed 5 ms. The network monitoring showed that everything was fine, as well. The CPU was hovering around the 7% CPU and there was no reason for the issue.

Looking at the logs, we saw very suspicious gaps in the servers activity, but with no good reason for them. Furthermore, the customer was able to narrow the issue down to a single scenario. Resetting the indexes would cause the cluster state to become unstable. And it did so with very high frequency.

“That is flat out impossible”, I said. And I meant it. Indexing workload is something that we have a lot of experience managing and in RavenDB 4.0 we have made some major changes to our indexing structure to handle this scenario. In particular, this meant that indexing:

  • Will run in dedicated threads.
  • Are scoped to run outside certain cores, to leave the OS capacity to run other tasks.
  • Self monitor and know when to wind down to avoid impacting system performance.
  • Indexing threads are run with lower priority.
  • The cluster state, on the other hand, is run with high priority.

The entire thing didn’t make sense. However… the customer did a great job in setting up an environment where they could show us: Click on the reset button, and the cluster become unstable.  So it is impossible, but it happens.

We explored a lot of stuff around this issue. The machine is big and running multiple NUMA node, maybe it was some interaction with that? It was highly unlikely, and eventually didn’t pan out, but that is one example of the things that we tried.

We setup a similar cluster on our end and gave it 10x more load than what the customer did, on a set of machines that had a fraction of the customer’s power. The cluster and the system state remain absolutely stable.

I officially declared that we were in a state of perplexation.

When we run the customer’s own scenario on our system, we saw something, but nothing like what we saw on their end. One of the things that we usually do when we investigate resource constraint issues is to give the machines under test a lot less capability. Less memory and slower disks, for example, means that it is much easier to surface many problems. But the worse we made the situation for the test cluster, the better the results became.

We changed things up. We gave the cluster machines with 128 GB of RAM and fast disks and tried it again. The situation immediately reproduced.

Cue facepalm sound here.

Why would giving more resources to the system cause instability in the cluster? Note that the other metrics also suffered, which made absolutely no sense.

We started digging deeper and we found the following index:

It is about as simple an index as you can imagine it would be and should cause absolutely no issue for RavenDB. So what was going on? We then looked at the documents…

image

I would expect the State field to be a simple enum property. But it is an array that describe the state changes in the system. This array also holds complex objects. The size of the array is on average about 450 items and I saw it hit a max of 13,000 items.

That help clarify things. During index, we have to process the State property, and because it is an array, we index each of the elements inside it. That means that for each document, we’ll index between 400 – 13,000 items for the State field. What is more, we have a complex object to index. RavenDB will index that as a JSON string, so effectively the indexing would generate a lot of strings. These strings are going to be held in memory until the end of the indexing batch. So far, so good, but the problem in this case was that there were enough resources to have a big batch of documents.

That means that we would generate over 64 million string objects in one of those batches.

Enter GC, stage left.

The GC will be invoked based on how many allocations you have (in this case, a lot) and how many live objects you have. In this case, also a lot, until the index batch is completed. However, because we run GC multiple times during the indexing batch, we had promoted significant numbers of objects to the next generation, and Gen1 or Gen2 collections are far more expensive.

Once we knew what the problem was, it was easy to find a fix. Don’t index the State field. Given that the values that were indexed were JSON strings, it is unlikely that the customer actually queried on them (later confirmed by talking to the customer).

On the RavenDB side, we added monitoring for the allocation frequency and will close the indexing batch early to prevent handing the GC too much work all at once.

The reason we failed to reproduce that on lower end machine was simple, RavenDB already used enough memory so we closed the batch early, before we could gather enough objects to cause the GC to really work hard. When running on a big machine, it had time to get the ball rolling and hand the whole big mess to the GC for cleanup.

time to read 1 min | 172 words

I’m really happy to announce that RavenDB Cloud has now deployed shared instances support.  These are full production systems, with three separate nodes deployed in separate availability zones for maximum availability.

Shared instances are meant to answer users that have light load but still want to ensure high availability at a low cost.

image

We currently offer PS10 (about 30$ / month) and PS20 (about 50$ month) plans in both AWS and Azure.

Shared plans are just that, you are getting shared compute resources. You are still fully isolated from other customers and have full isolation from other instances. However, because the hardware resources you are using are utilized by multiple customers, you can expect lower capacity.  In our tests, such instances were able to handle nicely light load and typical hobbyists applications with no issue. You can also freely scale up from shared plans to basic or higher at any time.

Give it a try.

time to read 4 min | 628 words

I posted about the @refresh feature in RavenDB, explaining why it is useful and how it can work. Now, I want to discuss a possible extension to this feature. It might be easier to show than to explain, so let’s take a look at the following document:

The idea is that in addition to the data inside the document, we also specify behaviors that will run at specified times. In this case, if the user is three days late in paying the rent, they’ll have a late fee tacked on. If enough time have passed, we’ll mark this payment as past due.

The basic idea is that in addition to just having a @refresh timer, you can also apply actions. And you may want to apply a set of actions, at different times. I think that the lease payment processing is a great example of the kind of use cases we envision for this feature. Note that when a payment is made, the code will need to clear the @refresh array, to avoid it being run on a completed payment.

The idea is that you can apply operations to the documents at a future time, automatically. This is a way to enhance your documents with behaviors and policies with ease. The idea is that you don’t need to setup your own code to execute this, you can simply let RavenDB handle it for you.

Some technical details:

  • RavenDB will take the time from the first item in the @refresh array. At the specified time, it will execute the script, passing it the document to be modified. The @refresh item we are executing will be removed from the array. And if there are additional items, the next one will be schedule for execution.
  • Only the first element in the @refresh array only. So if the items aren’t sorted by date, the first one will be executed and the persisted again. The next one (which was earlier than the first one) is already ready for execution, so will be run on the next tick.
  • Once all the items in the @refresh array has been processed, RavenDB will remove the @refresh metadata property.
  • Modifications to the document because of the execution of @refresh scripts are going to be handled as normal writes. It is just that they are executed by RavenDB directly. In other words, features such as optimistic concurrency, revisions and conflicts are all going to apply normally.
  • If any of the scripts cause an error to be raised, the following will happen:
    • RavenDB will not process any future scripts for this document.
    • The full error information will be saved into the document with the @error property on the failing script.
    • An alert will be raised for the operations team to investigate.
  • The scripts can do anything that a patch script can do. In other words, you can put(), load(), del() documents in here.
  • We’ll also provide a debugger experience for this in the Studio, naturally.
  • Amusingly enough, the script is able to modify the document, which obviously include the @refresh metadata property. I’m sure you can imagine some interesting possibilities for this.

We also considered another option (look at the Script property):

The idea is that instead of specifying the script to run inline, we can reference a property on a document. The advantage being is that we can apply changes globally much easily. We can fix a bug in the script once. The disadvantage here is that you may be modifying a script for new values, but not accounting for the old documents that may be referencing it. I’m still in two minds about whatever we should allow a script reference like this.

This is still an idea, but I would like to solicit your feedback on it, because I think that this can add quite a bit of power to RavenDB.

time to read 3 min | 415 words

imageI run into this article that talks about building a cache service in Go to handle millions of entries. Go ahead and read the article, there is also an associated project on GitHub.

I don’t get it. Rather, I don’t get the need here.

The authors seem to want to have a way to store a lot of data (for a given value of lots) that is accessible over REST.  The need to be able to run 5,000 – 10,000 requests per second over this. And also be able to expire things.

I decided to take a look into what it would take to run this in RavenDB. It is pretty late here, so I was lazy. I run the following command against our live-test instance:

image

This say to create 1,024 connections and get the same document. On the right you can see the live-test machine stats while this was running. It peaked at about 80% CPU. I should note that the live-test instance is pretty much the cheapest one that we could get away with, and it is far from me.

Ping time from my laptop to the live-test is around 230 – 250 ms. Right around the numbers that wrk is reporting. I’m using 1,024 connections here to compensate for the distance. What happens when I’m running this locally, without the huge distance?

image

So I can do more than 22,000 requests per second (on a 2016 era laptop, mind) with max latency of 5.5 ms (which the original article called for average time). Granted, I’m simplifying things here, because I’m checking a single document and not including writes. But 5,000 – 10,000 requests per second are small numbers for RavenDB. Very easily achievable.

RavenDB even has the @expires feature, which allows you to specify a time a document will automatically be removed.

The nice thing about using RavenDB for this sort of feature is that millions of objects and gigabytes of data are not something that are of particular concern for us. Raise that by an orders of magnitude, and that is our standard benchmark. You’ll need to raise it by a few more orders of magnitudes before we start taking things seriously.

time to read 5 min | 822 words

This post asked an interesting question, why are hash table so prevalent for in memory usage and (relatively) rare in the case of databases. There is some good points in the post, as well as in the Hacker News thread.

Given that I just did a spike of persistent hash table and have been working on database engines for the past decade, I thought that I might throw my own two cents into the ring.

B+Tree is a profoundly simple concept. You can explain it in 30 minutes, and it make sense. There are some tricky bits to a proper implementation, for sure, but they are more related to performance than correctness.

Hash tables sounds simple, but the moment you have to handle collisions gracefully, you are going to run into real challenges. It is easy to get into nasty bugs with hash tables, the kind that silently corrupt your state without you realizing it.

For example, consider the following code:

This is a hash table using linear addressing. Collisions are handled by adding them to the next available node. And in this case, we have a problem. We want to put “ghi” in position zero, but we can’t, because it is already full. We move it to the first available location. That is well understood and easy. But when we delete “def”, we remove the entry from the array, but we forgot to do fixups for the relocated “ghi”, that value is now gone from the table, effectively. This is the kind of bug you need the moon to be in a certain position while a cat sneeze to figure out.

A B+Tree also maps very nicely to persistent model, but it is entirely non obvious how you can go from the notion of a hash table in memory to one on disk. Extendible hashing exists, and has for a very long time. Literally for more time than I’m alive, but it is not very well known / generically used. It is a beautiful algorithm, mind you. But just mapping the concept to a persistence model isn’t enough, typically, you also had a bunch of additional requirements from disk data structure. In particular, concurrency in database systems is frequently tied closely to the structure of the tree (page level locks).

There is also the cost issue. When talking about disk based data access, we are rarely interested in the actual O(N) complexity, we are far more interested in the number of disk seeks that are involved. Using extendible hashing, you’ll typically get 1 – 2 disk seeks. If the directory is in memory, you have only one, which is great. But with a B+Tree, you can easily make sure that the top levels of the tree will also be memory resident (similar to the extendible hash directory), that leads to typical 1 disk access to read the data, so in many cases, they are roughly the same performance for either option.

Related to the cost issue, you have to also consider security risks. There have been a number of attacks against hash tables that relied on generating hash collisions. The typical in memory fix is to randomize the hash to avoid this, but if you are persistent, you have to use the same hash function forever. That means that an attacker can very easily kill your database server, by generating bad keys.

But these are all relatively minor concerns. The key issue is that B+Tree is just so much more useful. A B+Tree can allow me to:

  • Store / retrieve my data by key
  • Perform range queries
  • Index using a single column
  • Index using multiple columns (and then search based on full / partial key)
  • Iterate over the data in specified order

Hashes allow me to:

  • Store / retrieve my data by key

And that is pretty much it. So B+Tree can do everything that Hashes can, but also so much more. They are typically as fast where it matters (disk reads) and more than sufficiently fast regardless.

Hashes are only good for that one particular scenario of doing lookup by exact key. That is actually a lot more limited than what you’ll consider.

Finally, and quite important, you have to consider the fact that B+Tree has certain access patterns that they excel at. For example, inserting sorted data into a B+Tree is going to be a joy. Scanning the B+Tree in order is also trivial and highly performant.

With hashes? There isn’t an optimal access pattern for inserting data into a hash. And while you can scan a hash at roughly the same cost as you would a B+Tree, you are going to get the data out of order. That means that it is a lot less useful than it would appear to upfront.

All of that said, hashes are still widely used in databases. But they tend to be used as specialty tools. Deployed carefully and for very specific tasks. This isn’t the first thing that you’ll reach to, you need to justify its use.

time to read 8 min | 1528 words

A common question I field on many customer inquiries is comparing RavenDB to one relational database or another. Recently we got a whole spate of questions on RavenDB vs. PostgreSQL and I though that it might be easier to just post about it here rather than answer the question each time. Some of the differences are general, for all or most relational databases, but I’m also going to touch on specific features of PostgreSQL that matter for this comparison.

The aim of this post is to provide highlights to the differences between RavenDB and PostgreSQL, not to be an in depth comparison of the two.

PostgreSQL is a relational database, storing the data using tables, columns and rows. The tabular model is quite entrenched here, although PostgreSQL has the notion of JSON columns. 

RavenDB is a document database, which store JSON documents natively. These documents are schema-free and allow arbitrarily complex structure.

The first key point that distinguish these databases is with the overall mindset. RavenDB is meant to be a database for OLTP systems (business applications) and has been designed explicitly for this. PostgreSQL is trying to achieve both OLTP and OLAP scenarios and tends to place a lot more emphasis on the role of the administrator and operations teams. For example, PostgreSQL requires VACUUM, statistics refresh, manual index creation, etc. RavenDB, on the other hand, it design to run in a fully automated fashion. There isn’t any action that an administrator needs to take (or schedule) to ensure that RavenDB will run properly.

RavenDB is also capable of configuring itself dynamically, adjusting to the real world load it has based on feedback from the operational environment. For example, the more queries a particular index has, the more resources it will be granted by RavenDB. Another example is how RavenDB processes queries in general. Its query analyzer will run through the incoming queries and figure out what is the best set of indexes that you need to answer them. RavenDB will then go ahead and create these indexes on the fly. Such an action tends to be scary for users coming from relational databases, but RavenDB was designed upfront for specifically these scenarios. It is able to build the new indexes without adding too much load to the server and without taking any locks. Other tasks that are typically handled by the DBA, such as configuring the system, are handled dynamically by RavenDB based on actual operational behavior. RavenDB will also cleanup superfluous indexes and reduce the resources available for indexes that aren’t in common use. All of that without a DBA to perform acts of arcane magic.

Another major difference between the databases is the notion of schema. PostgreSQL requires you to define your schema upfront and adhere to that. The fact that you can use JSON at times to store data provides an important escape hatch, but while PostgreSQL allows most operations on JSON data (including indexing them), it is unable to collect statistics information on such data, leading to slower queries. RavenDB uses a schema-free model, documents are grouped into collections (similar to tables, but without the constraint of having the same schema), but have no fixed schema. Two documents at the same collection can have distinct structure.  Typical projects using JSON columns in PostgreSQL will tend to pull specific columns from the JSON to the table itself, to allow for better integration with the query engine. Nevertheless, PostgreSQL’s ability to handle both relational and document data gives it a lot of brownie points and enable a lot of sophisticated scenarios for users.

RavenDB, on the other hand, is a pure JSON database, which natively understand JSON. It means that the querying language is much nicer for querying that involve JSON and comparable for queries that don’t have a dynamic structure. In addition to being able to query the JSON data, RavenDB also allows you to run aggregation using Map/Reduce indexes. These are similar to materialized views in PostgreSQL, but unlike those, RavenDB is going to update the indexes automatically and incrementally. That means that you can query on large scale aggregation in microseconds, regardless of data sizes.

For complex queries, that touch on relationships between pieces of data, we have very different behaviors. If the relations inside PostgreSQL are stored as columns and using foreign keys, it is going to be efficient to deal with them. However, if the data is dynamic or complex, you’ll want to put it in a JSON column. At this point, the cost of joining relations skyrockets for most data sets. RavenDB, on the other hand, allow you to follow relationships between documents naturally, at indexing or querying time. For more complex relationships work, RavenDB also has graph querying which allow you to run complex queries on the shape of your data.

I mentioned before that RavenDB was designed explicitly for business applications, that means that it has a much better feature set around their use case. Consider the humble Customers page, which needs to show the Customers details, Recent Orders (and their total), Recent Support Calls, etc.

When querying PostgreSQL, you’ll need to make multiple queries to fetch this information. That means that you’ll have to deal with multiple network roundtrips, which in many cases can be the most expensive piece of actually querying the database. RavenDB, on the other hand, has the Lazy feature, which allow you to combine multiple separate queries into a single network roundtrip. This seemingly simple feature can have a massive impact on your overall performance.

A similar feature is related to the includes feature. It is very common when you load one piece of data that you want to get related information. With RavenDB, you can indicate that to the database engine, which will send you all the results in one shot. With a relational database, you can use a join (with the impact on the shape of the results, Cartesian products issue and possible performance impact) or issue multiple queries. Simple change, but significant improvement over the alternative.

RavenDB is a distributed database by default while PostgreSQL is a single node by default. There exists features and options (log shipping, logical replication, etc), which allow PostgreSQL to run as a cluster, but they tend to be non trivial to setup, configure and maintain. With RavenDB, even if you are running a single node, you are actually running a cluster. And when you have multiple nodes, it is trivial to join them into a cluster, from which point on, you can just manage everything as usual. Features such as multi-master, the ability for disconnected work and widely distributed clusters are native parts of RavenDB and integrate seamlessly, while they tend to be of the “some assembly required” in PostgreSQL.

The two databases are very different from one another and tend to be used for separate purposes. RavenDB is meant to be the application database, it excels in being the backend of OTLP systems and focus on that to the exclusion of all else. PostgreSQL tend to be more general, suitable for dynamic queries, reports and exploration as well as OLTP scenarios. It may not be a fair comparison, but I have literally built RavenDB specifically to be better than a relational database for the common needs of business applications, and ten years in, I think it still shows significant advantages in that area.

Finally, let’s talk about performance. RavenDB was designed based on failures in the relational model. I spent years as a database performance consultant, going from customer to customer fixing the same underlying issues. When RavenDB was designed, we took that to account. The paper OLTP – Through the Looking Glass, and What We Found There has some really interesting information. Including the issue of about 30% of a relational database performance is spent on locking.

RavenDB is using MVCC, just like PostgreSQL. Unlike PostgreSQL, RavenDB doesn’t need to deal with transaction id wraparound, VACUUM costs, etc. Instead, we maintain MVCC not on the row level, but on the page level. There is a lot less locks to manage and deal with and far less complexity internally. This means that read transactions are completely lock free and don’t have to do any coordination with write transactions. That has an impact on performance and RavenDB can routinely manage to achieve benchmark numbers on commodity hardware that are usually reserved for expensive benchmark machines.

One of our developers got a new machine recently and did some light benchmarking. Running in WSL (Ubuntu on Windows), RavenDB was able to exceed 115,000 writes / sec and 275,000 reads / sec. Hare the specs:

image

And let’s be honest, we weren’t really trying hard here, but we still got nice numbers. A lot of that is by designing how we interact internally to have a much simpler architecture and shape, and it shows. And the nice thing is that these advantages are cumulative. RavenDB is fast, but you also gain the benefits of the protocol allowing you to issue multiple queries in a single roundtrip, the ability to include additional results and dynamically adjusting to the operation environment.

It Just Works.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Webinar (3):
    12 May 2021 - Real Time Architecture
  2. Building a phone book (3):
    02 Apr 2021 - Part III
  3. Building a social media platform without going bankrupt (10):
    05 Feb 2021 - Part X–Optimizing for whales
  4. Webinar recording (12):
    15 Jan 2021 - Filtered Replication in RavenDB
  5. Production postmortem (30):
    07 Jan 2021 - The file system limitation
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats