Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:


+972 52-548-6969

Posts: 7,145 | Comments: 50,118

Privacy Policy Terms
filter by tags archive
time to read 4 min | 663 words

This is a tale of two options that we took for an exhaustive test. Amazon recently came out with a new disk type on the cloud. As a database vendor, that is of immediate interest to me, so we took a deep look into that.

GP3 disks are about 20% cheaper than their GP2 equivalent. What is more, they come with a guarantee level of performance even before you purchase additional IOPS. Consider the following two disks:

  Size IOPS MB/S Price
GP2 512GB 1,536 250 51.2 USD
GP3 512GB 3,000 125 40.9 USD
GP3 512GB 4,075 250 51.2 USD

In other words, for the same disk, we can get a much better baseline performance at a cheaper price. What isn’t there not to like?

The major difference between GP2 and GP3, however, is their latency. In practice, we see an additional 1 – 2 milliseconds in response times from the GP3 disk vs. the GP2 disk. In other words, GP3 disks are somewhat slower, even if they are able to run more IOPS, their latency is higher.

A really key observation from us, however, is that GP3 does not offer burst I/O capabilities. And that means that I can breath a huge sigh of relief.

RavenDB as a database is meant to run on anything from an SD card to HDD to SSD to NVMe drives. We are used to account for the I/O being the slowest thing around and have already mostly coded around that. An additional millisecond in disk latency doesn’t matter that much in the grand scheme of things.

However… the fact that this doesn’t provide I/O burst is a huge plus for us. RavenDB can easily deal with slow I/O, what it find it very hard to deal with is an environment that very rapidly change its operational characteristics.

Let’s assume that we have a 100 GB GP2 disk, which means that we have a baseline of around 300 IOPS and 75MB / sec of throughput. RavenDB is under some high load, and it is using the maximum capabilities of the hardware. However, because of burstiness, we are actually able to utilize 3,000 IOPS and 250MB/sec for a while.

Until all the I/O credits are gone and we are forced into a screeching halt. That means, for example, that we read from the network at a rate of 250MB/sec, but we are unable to write to the disk at this level. There is a negative balance of 125MB/sec that needs to be stored some where. We can buffer that in memory, of course, but that only work for so long. That means that we have to put a huge break all of a sudden, which the rest of the eco system isn’t happy with. For example, the other side that is sending us data at 250MB /sec, they are likely not going to be able to respond in time to the shift is our behavior. It is very likely that the network connection would congest and break in this case.

All of the internal optimizations inside of RavenDB will also be skewed for a while, until we are used to the new level of speed. If this was gradual, we could adjust a lot more easily, but this is basically like hitting the brakes at speed. You will slow down, sure, but you are also likely to cause an accident.

As a simple example, RavenDB can compress the data that it writes to disk, and it balances the compression ratio vs. the cost to write to the disk. If we know that the disk is slow, we can spend more time trying to reduce the amount of data we write. If this changes rapidly, we are operating under the old assumptions and may create a true traffic jam

The fact that GP3 disks have a predictable performance profile means that we are much better suited to run on them. A more predictable platform from which to operate gives me a much better opportunity to handle optimizations.

time to read 3 min | 548 words

Voron is RavenDB’s storage engine. It is how we store data, keep transactions and in generally get a lot of our abilities. In this post, I want to talk about the way RavenDB manages disk space internally.  Voron uses a single data file to do its work, the data file is divided into 8KB pages, like so:

Voron uses eager disk allocations to reserve disk space from the operating system. Each time the space inside the file runs out, Voron will double the size of the file. That last until the file size reaches 2GB, after which RavenDB will grow by 1GB at a time. This behavior ensures that Voron gives the underlying file system enough information to provide the database with a continuous range of disk space. In other words, we grab disk space in large chunks to avoid fragmentation of the data file.  

What happens when you delete data, however? Voron mark the free space in its free list and will use that space before it will allocate more disk space from the operating system.

Why aren’t we releasing the disk space back to the operating system? The simplest reason is that it isn’t an just the data at the end of the file that is freed. In fact, like in the image above, free and busy segments are interwoven in the file. We can’t just truncate the file.

Internal references inside of RavenDB make use of the position data inside the file, so just moving the data won’t help. Instead, you have to compact the data. That forces us to re-write the entire database layout from scratch and fixes those references.  That is an offline operation, however.

For the most part, it doesn’t actually matter. RavenDB will use the internal free space as needed, so it isn’t like it is actually lost.

One feature that we are considering for version 6.0 of RavenDB is hole punching. That means that we’ll make use of advanced file system API to free the disk space allocated to RavenDB even mid file.

On Linux, that means using FALLOC_FL_PUNCH_HOLE. On Windows, that means using FSCTL_SET_ZERO_DATA.

That will have the advantage of freeing disk space back to the operating system without needing user intervention. In particular, that is going to make it so a user that delete data to free disk space see the free space reflected in the OS metrics.

There are problems with this approach, however. First, the size of the file remains the same, which leads to interesting questions. Consider:


Second, this defeats the purpose of wanting to optimize disk allocations. If we free disk space in this manner, when we get it back, it may no longer be continuous on the disk. That said, it is not that big a problem in the days of SSDs and NVMes as it was at the time of the rotational hard disk.

Then, you may get into a very bad situation in which Voron tries to use disk space that it had allocated mid file (but was already freed) but it can’t, because the disk is full. Right now, this is simply an impossible error, with hole punching, we need to consider how to deal with this.

time to read 2 min | 370 words

RavenDB is written in C# and .NET, unlike most of the database engines out there. The other databases are mostly written in C, C++ and sometimes Java.

I credit the fact that I wrote RavenDB in C# as a major part of the reason I was able to drive it forward to the point it is today. That wasn’t easy and there are a number of issues that we had to struggle with as a result of that decision. And, of course, all the other databases at the playground look at RavenDB strangely for being written in C#.

In RavenDB 4.0, we have made a lot of architectural changes. One of them was to replace some of the core functionality of RavenDB with a C library to handle core operations. Here is what this looks like:

However, there is still a lot to be done in this regard and that is just a small core.

Due to the COVID restrictions, I found myself with some time on my hands and decided that I can spare a few weekends to re-write RavenDB from scratch in C. I considered using Rust, but that seemed like to be over the top.

The results of that can be seen here. I kept meticulous records of the process of building this, which I may end up publishing at some point. Here is an example of how the code looks like:

The end result is that I was able to take all the knowledge of building and running RavenDB for so long and create a compatible system in not that much code. When reading the code, you’ll note methods like defer() and ensure(). I’m using compiler extensions and some macro magic to get a much nicer language support for RAII. That is pretty awesome to do in C, even if I say so myself and has dramatically reduced the cognitive load of writing with manual memory management.

An, of course, following my naming convention, Gavran is Raven in Croatian.

I’ll probably take some time to finish the actual integration, but I have very high hopes for the future of Gavran and its capabilities. I’m currently running benchmarks, you can expect them by May 35th.

time to read 7 min | 1398 words

A couple of days, a fire started in OVH’s datacenter. You can read more about this here:

They use slightly different terminology, but translating that to the AWS terminology, an entire "region” is down, with SGB1-4 being “availability zones” in the region.

For reference, there are some videos from the location that make me very sad. This is what it looked like under active fire:


I’m going to assume that this is a total loss of everything that was in there.

RavenDB Cloud isn’t offering any services in any OVH data centers, but this is a good time to go over the Disaster Recovery Plan for RavenDB and RavenDB Cloud. It is worth noting that the entire data center has been hit, with the equivalent to an entire AWS region going down.

I’m not sure that this is a fair comparison, it doesn’t looks like that SBG 1-4 are exactly the same thing as AWS availability, but it is close enough to draw parallels.

So far, at least, there have been no cases where Amazon has lost an entire region. There were occasions were a whole availability zone was down, but never a complete region. The way Amazon is handling Availability Zones seems to most paranoid, with each availability zone distanced “many kilometers” from each other in the same region. Contrast that with the four SGB that all went down. For Azure, on the other hand, they explicitly call out the fact that availability zones may not provide sufficient cover for DR scenarios. Google Cloud Platform also provides no information on the matter. For that matter, we also have direct criticism on the topic from AWS.

Yesterday, on the other hand, Oracle Cloud had a DNS configuration error that took effectively the entire cloud down.  The good news is that this is just inability to access the cloud, not actual loss of a region, as was the case on OVH. However, when doing Disaster Recovery Planning, having the the entire cloud drop off the face of the earth is also something that you have to consider.

With that background out of the way, let’s consider the impact of losing two availability zones inside AWS, losing a entire region in Azure or GCP or even losing an entire cloud platform. What would be the impact on a RavenDB cluster running in that scenario?

RavenDB is designed to be resilient. Using RavenDB Cloud, we make sure that each of the nodes in the cluster is running on a separate availability zone. If we lose two zones in a region, there is still a single RavenDB instance that can operate. Note that in this case, we won’t have a quorum. That means that certain operations won’t be available (you won’t be able to create new databases, for example) but read and write operations will work normally and your application will fail over silently to the remaining server. When the remaining servers recover, RavenDB will update them with all the missing data that was modified while they were down.

The situation with OVH is actually worse than that. In this case, a datacenter is gone. In other words, these nodes aren’t coming back. RavenDB will allow you to perform emergency operations to remove the missing nodes and rebuilt the cluster from the single surviving node.

What about the scenario where the entire region is gone? In this case, if there are no more servers for RavenDB to run on, it is going to be down. That is the time to enact the Disaster Recovery Plan. In this case, it means deploying a new RavenDB Cluster to a new region and restoring from backup.

RavenDB Cloud ensures full daily backups as well as hourly incremental backups for all databases, so the amount of data loss will be minimal. That said, where are the backups located?

By default, RavenDB stores the backups in S3, in the same region as the cluster itself. Amazon S3 has the best durability guarantees in the market. This is beyond just the number of nines that they provide in terms of data durability. A standard S3 object is going to be residing in three separate availability zones. As mentioned, for AWS, we have guarantees about distance between those availability zones that we haven’t seen from other cloud providers. For that reason, when your cluster reside in AWS, we’ll store the backups on S3 in the same region. For Azure and GCP, on the other hand, we’ll also use AWS S3 for backup storage. For a whole host of reasons, we select a nearby region. So a cluster on Azure US East would store its backups on AWS S3 on US-East-1, for example. And a cluster on Azure in the Netherlands will store its backups on AWS S3 on the Frankfurt region. In addition to all other safeguards, the backups are encrypted, naturally.

The cloud has been around for 15 years already (amazing, I know) and so far, AWS has had a great history with not suffering catastrophic failures like the one that OVH has run into. Then again, until last week, you could say the same about OVH, but with 20+ years of operations. Part of the Disaster Recovery Process is knowing what risks are you willing to accept. And the purpose of this post is to expand on what we do and how we plan to react to such scenarios, be they ever so unlikely.

RavenDB actually has a number of features that are applicable for handling these sorts of scenarios. They aren’t enabled in the cloud by default, but they are important to discuss for users who need to have business continuity.

  • Multi-region or Multi-cloud clusters are available. You can setup RavenDB across multiple disparate location in several manners, but the end result is that you can ensure that you have no single point of failure, while still using RavenDB to its fullest potential. This is commonly used in large applications that are naturally geo distributed, but it also serve as a way to not put all your eggs in a single basket.
  • In addition to the default backup strategy (same AWS region on AWS or nearby AWS region for Azure or GCP), you can setup backups to additional regions.

One of the key aspects of business continuity is the issue of the speed in which you can go back to normal operations. If you are running a large database, just the time to restore from backup can be a significant amount of time. If you have a database (or databases) whose backup are in the hundreds of GB range, just the time it takes to get the backups can be measures in many hours, leaving aside the restore costs.

For that reason, RavenDB also support the notion of an offsite observer. That can be a single isolated node or a whole cluster. We can take advantage of the fact that the observer is not in active use and under provision it, in that case, when we need to switch over, we can quickly allocate additional resources to it to ramp it up to production grade. For example, let’s assume that we have a cluster running in Azure Northern Europe region, composed of 3 nodes with 8 cores each. We also have setup an observer cluster in Azure Norway East region. Instead of having to allocate a cluster of 3 nodes with 8 cores each, we can allocate a much smaller size, say 2 cores only (paying less than a third of the original cluster cost as a premium). In the case of disaster, we can respond quickly and within a matter of minutes, the Norway East cluster will be expanded so each of the nodes will have 8 cores and can handle full production traffic.

Naturally, RavenDB is likely to be only a part of your strategy. There is a lot more to ensuring that your users won’t notice that something is happening while your datacenter is on fire, but at least as it relates to your database, you know that you are in good hands.

time to read 2 min | 363 words

A few days ago I posted about looking at GitHub projects for junior developer candidates. One of the things that is very common in such scenario is to see them use string concatenation for queries, I hate that. I just reached to a random candidate GitHub profile right now and found this gem:

The number of issues that I have with this code is legion.

  • Not closing the connection or disposing the command.
  • The if can be dropped entirely.
  • And, of course, the actual SQL INJECTION vulnerability in the code.

There is a reason that I have such a reaction for this type of code, even when looking at junior developer candidates. For them, this is acceptable, I guess. They are learning and focusing mostly on what is going on, not the myriad of taxes that you have to pay in order to get something to production. This is never meant to be production code (I hope, at least). I’m not judging this on that level. But I have to very consciously remind myself of this fact whenever I run into code like this (and as I said, this is all too common).

The reason I have such a visceral reaction to this type of code is that I see it in production systems all too often. And that leads to nasty stuff like this:

And this code led to a 70GB data leak on Gab. The killer for me that this code was written by someone with 23 years of experience.

I actually had to triple check what I was seeing when I read the code the first time, because I wasn’t sure that this is actually possible. I thought maybe this is some fancy processing done to avoid SQL injection, not that this is basically string interpolation.

Some bugs are things that you can excuse. A memory leak or a double free are things that will happen to anyone who is writing in C, regardless of experience and how careful they write. They are often subtle and easy to miss, happening in corner cases of error handling.

This sort of bug is a big box of red flags. It is also on fire.

time to read 2 min | 398 words

I run into an interesting scenario the other day. As part of the sign in process, the application will generate a random token that will be used to authenticate the following user requests. Basically, an auth cookie. Here is the code that was used to generate it:

This is a pretty nice mechanism. We use cryptographically secured random bytes as the token, so no one can guess what this will be.

There was a serious issue with this implementation, however. The problem was that on application startup, two different components would race to complete the signup. As you can see from the code, this is meant to be done in such as a way that two such calls within the space of one hour will return the same result. In most cases, this is exactly what happens. In some cases, however, we got into a situation where the two calls would race each other. Both would load the document from the database at the same time, get a(n obviously) different security token and write it out, then one of them would return the “wrong” security token.

At that point, it meant that we got an authentication attempt that was successful, but gave us the wrong token back. The first proposed solution was to handle that using a cluster wide transaction in RavenDB. That would allow us to ensure that in the case of racing operations, we’ll fail one of the transactions and then have to repeat it.

Another way to resolve this issue without the need for distributed transactions is to make the sign up operation idempotent. Concurrent calls at the same time will end up with the same result, like so:

In this case, we generate the security token using the current time, rounded to an hour basis. We use Argon2i (a password hashing algorithm) to generate the required security token from the user’s own hashed password, the current time and some pepper to make it impossible for outsiders to guess what the security token is even if they know what the password is. By making the output predictable, we make the rest of the system easier.

Note that the code above is till not completely okay. If the two request can with millisecond difference from one another, but on different hours, we have the same problem. I’ll leave the problem of fixing that to you, dear reader.

time to read 9 min | 1604 words

imageI was talking to a developer recently and had a really interesting discussion around the notion of consistency. For simplicity’s sake, let’s imagine that we are talking about a game and we need to deal with awarding achievements.

The story in question begins with a seemingly innocent business requirement:

We want to announce a unique achievement in the game. The first player to kill 10,000 rabbits will get a unique class-appropriate item.

Conceptually, that means that we need to write the following code:

The code is simple, easy and trivial. I wish we could end the post at this point, but as you can imagine, the situation is a bit more complex.

Consider the typical architecture of a game, we have the game world, which is composed of multiple servers, often located in different parts of the world. In this case, we can assume that we have three data centers, in separate continents.


Notice that we have a world first scenario here? That means that we need to synchronize the state across all of them in some manner, including when there are issues in the network, failures, etc.

For that matter, when we are talking about an achievement for a character, we can at least be sure that there is a single chain of events that we can follow, but what happens if we were to apply this achievement to a guild? In this case, multiple players may be competing to complete this achievement. If we allow to to also run on different servers (and regions), the situation now become quite complex.

Are there are technical ways to resolve this? Of course there are.

We can use distributed transactions to handle this, at first glance. However, that introduce an utterly unworkable spanner in the works (See what I did here Smile). Games care a lot about performance and latency, forcing us to run a globally distributed transaction on every rabbit kill is not a possible solution. There are stiff performance costs associated with it. For that matter, in a partial failure scenario, you still want to allow players to kill rabbits, even if you lost some servers in another continent. That lead to several additional scenarios that you need to cover:

  • What happened if a player killed rabbits but wasn’t able to record that using the world scope state? Do we also store that in a character / server level state? What happens if we already awarded the achievement and then find out that there are delayed updates in the mix?
  • What happens when two players kill their 10,000th rabbit at times T1 and T2 (where T2 > T1), but this happens near enough to one another that the write about the 2nd player is committed first?
    • Note that this can happen on the same server, on the same region or across different regions.
  • What happens when two players kill their 10,000th rabbit at the same instant?
    • Note that this can happen on the same server, on the same region or across different regions.
    • Note that this can actually happen at different times, but clock divergence will say it happened at the same time
  • What happens when a player kill their 10,000 rabbit, but we had a network hiccup and couldn’t record the action in time?

In fact, there are a multitude of issues that we have to deal with, if we accept the scenario as is. And the impact on the entire system because of the unstated requirements of a single achievement are huge.

That is likely not what the intended result is. We can do some minimal changes to the system and get pretty much what we want at a much reduced cost.  We start by giving up the implicit assumption that we have to award the achievement for the 10,000th within the same tick as the actual kill.  If we give up this requirement, it means that we have a far more relaxed environment to deal with.

We can say that we’ll process the achievements at the end of the next hour, for example. That gives us enough time to get updates from the rest of the system, settle the dust and avoid millisecond level decision making processes. In almost all businesses, there is no such thing as a race condition. The best example of that is “the check is in the mail”. The payment date for a check is not when you cashed it but when you posted it. My uncle used to go to the post office at 4:50 PM on a Friday to post checks. They would get the right time stamp, but would sit in the post office over the weekend before actually being delivered. That gave his 3 – 5 extra days before the check was cashed.

In the case of our game, giving us the time for things to settle make sense. It also make sense from a marketing and community perspective. World wide announcements don’t just happen, they can be scheduled for a particular time frame to maximize impact. Delaying the announcement of such an achievement make it a lot easier from a technical perspective but also give us more business level benefits to work with.

In many situation, we jump to assumptions about the kind of requirements that we have to meet, but usually there is a lot more flexibility. And discovering where you can get some slack can be of tremendous value.

Possible reaction from the business in this scenario can be:

  • We don’t actually care if this is exact. We want to give the achievement when this happens but it is okay if there is a little fudge factor and the “wrong” person won if there is a tiny amount of time difference.
  • It is actually fine if two players get the achievement when it happens at the “same” time.
  • Oh, I didn’t realize that this is so hard. Can we make it a server-wide achievement, then? Would that be easier?

Any of those responses will translate to a great simplification of what we have to deliver. In the first case, we can track the state of bunny killing without the use of distributed transactions and only apply a distributed transaction when we get to the 10,000th rabbit. That is going to be a rare event, so it is fine to get to pay a little extra there.

For that matter, a good question is to ask is how important the operation is. What happened when we have a failure in the distributed transaction when we record the 10,000th rabbit? I’m not talking about someone else getting there first, I”m talking bout a network hiccup or a faulty wire that cause the operation to fail. Do we retry? Do we output an error (and if so, to where?) or just ignore this? Do we need to have a secondary mechanism for checking for errors in the process?

The answer is that it depends, you can get into effectively infinite loop of trying to solve ever more unlikely scenarios. The question is what is the impact on the system at large and is it worth the cost?

For a game achievement, the answer is probably no (but then again, I’m not a gamer). People usually draw the line at money as the place where they’ll do their utmost to avoid issues.  This is a strange scenario, because money specifically has so many ways that you can run compensating actions as part of the normal workflow that you shouldn’t bend yourself into a pretzel to avoid that.

Let’s say that we are offering three unique mounts in our game, selling them for 9.99$. We obviously have to deal with race conditions here. There are only three mounts, but there are more users who want it and are willing to pay for the privilege. We are dealing with money here, so we can assume that we want to be careful about that, the question is, how careful?

We have three mounts, but more users. The payment process itself takes at least a few seconds, so how do you deal with it?

  • Throw the purchase attempts into a queue.
  • Pull the first three offers from the queue and attempt to charge them.
  • If charging failed, we mark the purchase attempt as errored and move on to the next item in the queue.
  • Once we successfully processed three purchases, we mark all the other purchase attempts as failed.

Simple, right?

What happens when a charge took longer than expected and a request timed out, but the charge actually happened? This can happen if you have SMS authentication in the process. So the card company will send a message to the card holder and they have to send an SMS back to approve the transaction. That can take long enough that the transaction will time out. But the user did authorize the transaction, funds were transferred, but you already sold the unique mount to someone else, with worse credit card security features.

What happens then? You can refund the money, provide another unique mount, etc.

The key here is that even for something that may consider critical, the number of failure modes is too high. Trying to handle them and ensuring a consistent world is usually too expensive. It is far better to have the hooks in place to handle failures and apply compensations. They are far rarer and much easier to deal with in isolation.

When you start seeing that something happens on a frequent basis, that is when you want to figure out a way to automate that failure mode.

time to read 3 min | 418 words

We are hiring again (this time for Junior C# Dev positions in Israel). That means that I go through CVs (quite a few, actually).  I like going over the resumes directly, to get a feel for not just a particular candidate but what is, for lack of a better term, the state of the market.

This time, I noticed a much higher percentage of resumes with a GitHub repository link. Anytime that I see such a link, I go and look at what they have there. That is often really interesting. Then again, you run into things like this:


On the one hand, this is non production code, it is obviously a teaching project, which is awesome. On the other hand, I find such code painful to look at.

In the past, I would rate highly anyone that would show a GitHub account in the CV, since I could expect to see some of their projects there, usually unique ones. This time? I’m seeing a lot of basically homework assignments, and those aren’t really that interesting to review or look at. Especially since a lot of the candidates apparently had the same courses, so I saw the same 5 projects repeated over and over again.

In other words, just a GitHub account with some repositories are no longer that interesting or meaningful.

Another thing that I noticed was that a lot of those candidates had profiles with profile pictures like:


A small tip, if you expect people to visit your profile (and I assume you do, since you provided the link in the resume), it is worth it to put a (professional) picture of yourself there. The profiler readme on GitHub is also surprising attractive when looking at a candidate.

Another tip, if you see a position for a C# Junior Developer, it is acceptable to apply if you don’t have all the requirements, or if you exceed them. But if you are trying to find a new job as a lawyer specializing in family law, maybe don’t try to apply to a tech company.

And yes, I’m using this post as a want to vent while going over so many CVs.

Most CVs are dry, but one candidate just got bumped to the next stage based solely on the fact that in they had a “Making awesome pancakes” in the CV, which made me laugh.

time to read 5 min | 801 words

We got an interesting modeling question from a customer: “What is the optimal manner to represent time sensitive information in RavenDB?”

The initial thought was that they would use revisions and they asked about querying those. The issue is that this isn’t really the purpose for revisions, they are great if you want to see what the state of the system at a particular point in time, but not so good if your business logic has meaning over time.

The best scenario for temporal data that I’m aware of is payroll. You have a lot of data that make sense only in the context of the time if was relevant for. For example, consider an employee that is hired at a given salary level, then given raises over time. The data in this case is divided into several layers.

I’m going to use paper documents as the model here, because it makes it much easier to consider the modeling implications than when talking about JSON or class structures.

On the most basic model, we have the Payslip document, which represent the amount (work, deductions, taxes, etc) that was paid to an employee at a particular point in time. This is similar to this:


Once created, such a document will not change. It represent an action that happened in the past and is immutable. From this you can figure out taxes, overall payments, etc.

The Payslip is computed from the Timesheet document, which is similar to this one:


A Timehseet document is updated during the Payroll Period whenever an employee signs in or out. At the end of the Payroll Period, a manager will sign off on the Timesheet and approve it for payment. At this point, all the relevant business rules are run and the final Payslip is generated for each employee. Once the Timesheet is signed off and paid, it is no longer mutable and will not change. This make sense, since it represent something that already happened.

In some cases, you’ll have new information, such as an employee that worked, but didn’t report their hours. They will need to do so in a new Timesheet and a new Payslip will be generated.

Using the real world analogy, the Timesheet document is stored at the head office, and you cannot go and update that once it was submitted.

So far, we haven’t seen anything related to things that change over time. In fact, the fact that we have separate documents for Payslips and Timesheets means that we can ignore a lot of the complexity that you’ll usually have to deal with in temporal databases.

We can’t completely get away from it, however. We need to consider the employee’s Contract, however. Usually when we think about the employment contract we think about something like this:


The contract specify details such as the hourly rate, overtime payment, vacation time, etc. In payroll systems, contracts are actually more complex than that, because you have to take into account that they change.

For example, consider the following scenario:

  • 1996 – Hired as mailroom clerk – 4.75$ / hour
  • 1998 – Promoted to junior clerk – 5.25$ / hour
  • 1999 – Promoted to clerk – 5.40$ / hour
  • 2002 – Promoted to senior clerk – 6.20$ / hour

How do you handle something like that, in terms of modeling?

The answer depends quite heavily on how your business logic handles this. One way to handle this is to create revisions. Using the real world logic, we are talking about signing a new contract and expiring the old one. But in reality, that isn’t how things are done. You’ll usually just update the payment terms.

How does this looks like in terms of the data modeling when using RavenDB,however? Well, there are two options. We can represent the data as simple values, like so:


When the data changes, we update those values (which generates revisions for the old data). However, that isn’t usually ideal, because business logic usually want to access the past values. For example, your contract may change mid payroll period, so your hourly rate is different depending if the hours worked on Monday or Thursday.

In this case, you’ll want to represent the values changing in the model directly, like so:


In most cases, this is the best option for modeling data where the temporal aspects of the data needs to be directly exposed to the business logic.

time to read 5 min | 925 words

Unless I get good feedback / questions on the other posts in the series, this is likely to be the last post on the topic. I was trying to show what kind of system and constraints you have to deal with if you wanted to build a social media platform without breaking the bank.

I talked about the expected numbers that we have for the system, and then set out to explain each part of it independently. Along the way, I was pretty careful not to mention any one particular technological solution. We are going to need:

  • Caching
  • Object storage (S3 compatible API)
  • Content Delivery Network
  • Key/value store
  • Queuing and worker infrastructure

Note that the whole thing is generic and there are very little constraints on the architecture. That is by design, because if your architecture can hit the lowest common denominator, you have a lot more freedom. Instead of tying yourself to a particular provider, you have a lot more freedom. For that matter, you can likely set things up so you can have multiple disparate providers without too much of a hassle.

My goal with this system was to be able to accept 2,500 posts per second and to handle reads of 250,000 per second. This sounds like a lot, but a most of the load is meant to be handled by CDN and the infrastructure, not the core servers. Caching in a social network is somewhat problematic, since you’ll have a lot of the work is obviously personalized. That said, there is still quite a lot that can be cached, especially the more popular posts and threads.

If we’ll assume that only about 10% of the reading load hits our servers, that is 25,000 reads per second. If we have just 25 servers for handling this (assuming five each in five separate data centers) we can accept the load at 1,000 requests per second. On the one hand, that is a lot, but on the other hand…. most of the cost is supposed to be about authorization, minor logic, etc. We can also at this point add more application servers and scale linearly.

Just to give some indication of costs, a dedicated server with 8 cores & 32 GB disk will cost 100$ a month, and there is no charge for traffic. Assuming that I’m running 25 of these, that will cost me 2,500 USD a month. I can safely double or triple that amount without much trouble, I think.

Having to deal with 1,000 requests per server is something that requires paying attention to what you are doing, but it isn’t really that hard, to be frank. RavenDB can handle more than a million queries a second, for example.

One thing that I didn’t touch on, however, which is quite important, is the notion of whales. In this case, a whale is a user that has a lot of followers. Let’s take Mr. Beat as an example, he has 15 million followers and is a prolific poster. In our current implementation, we’ll need to add to the timeline of all his followers every time that he posts something. Mrs. Bold, on the other hand, has 12 million followers. At one time Mr. Beat and Mrs. Bold got into a post fight. This looks like this:

  1. Mr. Beat: I think that Mrs. Bold has a Broccoli’s bandana.
  2. Mrs. Bold: @mrBeat How dare you, you sniveling knave
  3. Mr. Beat: @boldMr2 I dare, you green teeth monster
  4. Mrs. Bold: @mrBeat You are a yellow belly deer
  5. Mr. Beat: @boldMr2 Your momma is a dear

This incredibly witty post exchange happened during a three minute span. Let’s consider what this will do, given the architecture that we outlined so far:

  • Post #1 – written to 15 million timelines.
  • Post #2 - 5 – written to the timelines of everyone that follows both of them (mention), let’s call that 10 million.

That is 55 million timeline writes to process within the span of a few minutes. If other whales also join in (and they might) the number of writes we’ll have to process will sky rocket.

Instead, we are going to take advantage of the fact that only a small number of accounts are actually followed by many people. We’ll place the limit at 10,000 followers. At which point, we’ll no longer process writes for such accounts. Instead, we’ll place the burden at the client’s side. The code for showing the timeline will then become something like this:

In other words, we record the high profile users in the system and instead of doing the work for them on write, we’ll do that on read. The benefit of doing it in this manner is that the high profile users tiimeline reads will have very high cache utilization.

Given that the number of high profile people you’ll follow are naturally limited, that can save quite a lot of work.

The code above can be improved, of course, there are usually a lot of difference in the timeline posts, so we may have a high profile user that is off for a day or two, they shouldn’t show up in the current timeline and can be removed entirely. You need to do a bit more work around the time frames as well, which means that timeline should also allow us to query itself by most recent post id, but that is also not too hard to implement.

And with that, we are at the end. I think that I covered quite a few edge cases and interesting details, and hopefully that was interesting for you to read.

As usual, I really appreciate any and all feedback.


No future posts left, oh my!


  1. Building a phone book (3):
    02 Apr 2021 - Part III
  2. Building a social media platform without going bankrupt (10):
    05 Feb 2021 - Part X–Optimizing for whales
  3. Webinar recording (12):
    15 Jan 2021 - Filtered Replication in RavenDB
  4. Production postmortem (30):
    07 Jan 2021 - The file system limitation
  5. Open Source & Money (2):
    19 Nov 2020 - Part II
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats