Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,322 | Comments: 50,640

Privacy Policy Terms
filter by tags archive
time to read 3 min | 407 words

I asked why this code is broken, and now is the time to dig into this. The issue is in this block of code. Take a look at that for a moment, if you please:

The code is trying to gather the async upload of all the files, and then it will await them. This code compile and runs successfully, but it will not do what you expect it to do. Let’s break it up a bit to understand what is going on:

We are passing the variable task to the list of tasks we have. We just extract a variable, nothing much going on here. Let’s explore further, what is the type of task? We know it must be a subtype of Task, since that is what the tasks collection will take. It turns out that this isn’t that simple:

What is that, Task<Task> thingie? Well, let’s look at the signature of Task.Factory.StartNew(), shall we?

public Task<TResult> StartNew<TResult>(Func<TResult> function);

Just from the signature, we can tell what is going on. StartNew() accepts a function that returns a value and will then return a task for the eventual result of this function. However, the function we are actually passing to the StartNew() doesn’t produce a value. Except that it does…

Let’s explore that thing for a bit:

var func = async () => { };

What is the type of func in this case?

Func<Task> func = async() => {};

The idea is that when the compiler sees the async keyword, it transforms the function to one that returns a Task. Combining both those features together means that our original code actually registers the start of the asynchronous process to happen and will return as soon as it got started. Basically, we’ll only wait for the actual opening of the file, not for the actual network work that has to happen here.

The right way to express what we want here is:

Task.Run(async () => {});

The signature for this is:

public static Task Run<TResult>(Func<Task> function);

You can see here that we get a function that returns a task, but we aren’t actually wrapping that in another Task instance. The task that will be returned will be completed once the full work has been completed.

It is an interesting pitfall in the API, and can be quite hard to figure out exactly what is going on. Mostly because there are several different things happening all at once.

time to read 3 min | 543 words

We run into a strange situation deep in the guts of RavenDB. A cluster command (the backbone of how RavenDB is coordinating action in a distributed cluster) failed because of an allocation failure. That is something that we are ready for, since RavenDB is a robust system that handles such memory allocation failures. The problem was that this was a persistent allocation failure. Looking at the actual error explained what was going on. We allocate memory in units that are powers of two, and we had an allocation request that would overflow a 32 bits integer.

Let me reiterate that, we have a single cluster command that would need more memory than can fit in 32 bits. A cluster command isn’t strictly limited, but a 1MB cluster command is huge, as far as we are concerned. Seeing something that exceeds the GB mark was horrifying. The actual issue here was somewhere completely different, there was a bug that caused quadratic growth in the size of a database record. This post isn’t about that problem, it is about the fix.

We believe in defense in depth for such issues. So aside from fixing the actual cause for this problem, the issue was how we can prevent similar issues in the future. We decided that we’ll place a reasonable size limit on the cluster commands, and we chose 128MB as the limit (this is far higher than any expected value, mind). We chose that value since it is both big enough to be outside anyone's actual usage, but at the same time, it is small enough that we can increase this if we need to. That means that this needs to be a configuration value, so the user can modify that in place if needed. The idea is that we’ll stop the generation of a command of this size, before it hits the actual cluster and poison it.

Which brings me to this piece of code, which was the reason for this blog post:

This is where we are actually throwing the error if we found a command that is too big (the check is done by the caller, not important here).

Looking at the code, it does what is needed, but it is missing a couple of really important features:

  • We mention the size of the command, but not the actual size limit.
  • We don’t mention that this isn’t a hard coded limit.

The fix here would be to include both those details in the message. The idea is that the user will not only be informed about what the problem is, but also be made aware of how they can fix it themselves. No need to contact support (and if support is called, we can tell right away what is going on).

This idea, the notion that we should be quite explicit about not only what the problem is but also how to fix it, is very important to the overall design of RavenDB. It allows us to produce software that is self supporting, instead of ErrorCode: 413, you get not only the full details, but how you can fix it.

Admittedly, I fully expect to never ever hear about this issue again in my lifetime. But in case I’m wrong, we’ll be in a much better position to respond to it.

time to read 3 min | 502 words

I got a great question about using RavenDB from Serverless applications:

DocumentStore should be created as a singleton. For Serverless applications, there are no long-running instances that would be able to satisfy this condition. DocumentStore is listed as being "heavy weight", which would likely cause issues every time a new insurance is created.

RavenDB’s documentation explicitly calls out that the DocumentStore should be a singleton in your application:

We recommend that your Document Store implement the Singleton Pattern as demonstrated in the example code below. Creating more than one Document Store may be resource intensive, and one instance is sufficient for most use cases.

But the point of Serverless systems is that there is no such thing as a long running instance. As the question points out, that is likely going to cause problems, no? On the one hand we have RavenDB’s DocumentStore, which is optimized for long running processes and on the other hand we have Serverless systems, which focus on minimal invocations. Is this really a problem?

The answer is that there is no real contradiction between those two desires, because while the Serverless model is all about a single function invocation, the actual manner in which it runs means that there exists a backing process that is reused between invocations. Taking AWS Lambda as an example, you can define a function that will be invoked for SQS (Simple Queuing Service), the signature for the function will look something like this:

async Task HandleSQSEvent(SQSEvent sqsEvent, ILambdaContext context);

The Serverless infrastructure will invoke this function for messages arriving on the SQS queue. Depending on its settings, the load and defined policies, the Serverless infrastructure may invoke many parallel instances of this function.

What is important about Serverless infrastructure is that a single function instance will be reused to process multiple events. It is the Serverless infrastructure's responsibility to decide how many instances it will spawn, but it will usually not spawn a separate instance per message in the queue. It will let an instance handle the messages and spawn more as they are needed to handle the ingest load. I’m using SQS as an example here, but the same applies for handling HTTP requests, S3 events, etc.

Note that this is relevant for AWS Lambda, Azure Functions, GCP Cloud Functions, etc. A single instance is reused across multiple invocations. This ensure far more efficient processing (you avoid startup costs) and can make use of caching patterns and common optimizations.

When it comes to RavenDB usage, the same thing applies. We need to make sure that we won’t be creating a separate DocumentStore for each invocation, but once per instance. Here is a simplified example of how you can do this:

We define the DocumentStore when we initialize the instance, then we reuse the same DocumentStore for each invocation of the lambda (the handler code).

We can now satisfy both RavenDB’s desire to use a singleton DocumentStore for best performance and the Serverless programming model that abstracts how we actually run the code, without really needing to think about it.

time to read 6 min | 1134 words

When you have a distributed system, one of the key issues that you have to deal with is the notion of data ownership. The problem is that it can be a pretty hard issue to explain properly, given the required background material. I recently came up with an explanation for the issue that I think is both clear and shows the problem in a way that doesn’t require a lot of prerequisites.

First, let’s consider two types of distributed systems:

  • A distributed system that is strongly consistent – such a system requires coordination between at least a majority of the nodes in the system to do anything meaningful. Such systems are inherently limited in their ability to scale out, since the number of nodes that you need for a consensus will become unrealistic quite quickly.
  • A distributed system that is eventually consistent – such a system allows individual components to perform operations on their own, which will be replicated to the other nodes in due time. Such systems are easier to scale, but there is no true global state here.

A strongly consistent system with ten nodes requires each operation to reach at least 6 members before it can proceed. With 100 nodes, you’ll need 51 members to act, etc. There is a limit to how many nodes you can add to such a system before it is untenable. The advantage here, of course, is that you have a globally consistent state. An eventually consistent system has no such limit and can grow without bound. The downside is that it is possible for different parts of the system to make decisions that each make sense individually, but are not taken together. The classic example is the notion of a unique username, a new username that is added in two distinct portions of the system can be stored, but we’ll later find out that we have a duplicate.

A strongly consistent system will prevent that, of course, but has its limititations. A common way to handle that is to split the strongly consistent system in some manner. For example, we may have 100 servers, but we split it into 20 groups, of 5 servers each. Now each username belongs to one of those groups. We can now have our cake and eat it too, we have 100 servers in the system, but we can make strongly consistent operations with a majority of 3 nodes out of 5 for each username. That is great, unless you need to do a strongly consistent operation on two usernames that belong in different groups.

I mentioned that distributed system can be tough, right? And that you may need some background to understand how to solve that.

Instead of trying to treat all the data in the same manner, we can define data ownership rules. Let’s consider a real world example, we have a company that has three branches, in London, New York City and Brisbane. The company needs to issue invoices to customers and it has a requirement that the invoice numbers will be consecutive numbers. I used World Clock Planner to pull the intersection of availability of those offices, which you can see below:

image

Given the requirement for consecutive numbers, what do we know?

Each time that we need to generate a new invoice number, each office will need to coordinate with at least another office (2 out of 3 majority).  For London, that is easy, there are swaths of times where both London and New York business hours are overlapping.

For Brisbane, not so much. Maybe if someone is staying late in the New York office, but Brisbane will not be able to issue any invoices on Friday past 11 AM. I think you’ll agree that being able to issue an invoice on Friday’s noon is not an unreasonable requirement.

The problem here is that we are dealing with a requirement that we cannot fulfill. We cannot issue globally consecutive numbers for invoices with this sort of setup.

I’m using business hours for availability here, but the exact same problem occurs if we are using servers located around the world. If we have to have a consensus, then the cost of getting it will escalate significantly as the system becomes more distributed.

What can we do, then? We can change the requirement. There are two ways to do so. The first is to assign a range of numbers to each office, which they are able to allocate without needing to coordinate with anyone else. The second is to declare that the invoice numbers are local to their office and use the following scheme:

  • LDN-2921
  • NYC-1023
  • BNE-3483

This is making the notion of data ownership explicit. Each office owns its set of invoice numbers and can generate them completely independently. Branch offices may get an invoice from another office, but it is clear that it is not something that they can generate.

In a distributed system, defining the data ownership rules can drastically simplify the overall architecture and complexity that you have to deal with.

As a simple example, assume that I need a particular shirt from a store. The branch that I’m currently at doesn’t have the particular shirt I need. They are able to lookup inventory in other stores and direct me to them. However, they aren’t able to reserve that shirt for me.

The ownership on the shirt is in another branch, changing the data in the local database (even if it is quickly reflected in the other store) isn’t sufficient. Consider the following sequence of events:

  1. Branch A is “reserving” the shirt for me on Branch B’s inventory
  2. At Branch B, the shirt is being sold at the same time

What do you think will be the outcome of that? And how much time and headache do you think you’ll need to resolve this sort of distributed race condition.

On the other hand, a phone call to the other store and a request to hold the shirt until I arrive is a perfect solution to the issue, isn’t it? If we are talking in terms of data ownership, we aren’t trying to modify the other store’s inventory directly. Instead we are calling them and asking them to hold that. The data ownership is respected (and if I can’t get a hold of them, it is clear that there was no reservation).

Note that in the real world it is often easier to just ignore such race conditions since they are rare and “sorry” is usually good enough, but if we are talking about building a distributed system architecture, race conditions are something that happens yesterday, today and tomorrow, but not necessarily in that order.

Dealing with them properly can be a huge hassle, or negligible cost, depending on how you setup your system. I find that proper data ownership rules can be a huge help here.

time to read 2 min | 367 words

Yesterday I gave a webinar about Database Security in a Hostile World and I got a really interesting question at the end:

If your RavenDB and app are on the same server, would you recommend using certificates?

The premise of the talk was that you should always run in a secured mode, even if you are running on a supposedly secured network. That is still very much relevant if you are running on a single server.

Consider the following deployment model:

image
(Icons from https://www.flaticon.com/)

As you can see, both the application and the server are running on the same machine. We can usually assume that there is no possibility for an attacker not running on the same machine to eavesdrop on the communication. Can we skip encryption and authentication in this case?

The answer is no, even for this scenario. That may be a viable model if you are running on your development machine, with nothing really valuable in terms of data, but it isn’t a good model everywhere else.

Why is that? The answer is quite simple, there are two issues that you have to deal with:

  • At some future time, the firewall rules will be relaxed (by an admin debugging something, not realizing what they are doing) and the “local” server may be publicly exposed.
  • Even if you are listening only to port 127.0.0.1, without authentication, you are exposed to anything that is local that can be tricked to contact you. That is a real attack.

In short, for proper security, assume that even if you are running on the local network, with no outside access, you are still in a hostile environment. The key reason for that is that things change. In two years, as the system grows, you’ll want to split the database & application to separate servers. How certain are you that the person doing this split (assume that you are no longer involved) will do the Right Thing versus the minimum configuration changes needed to make it “work”?

Given that the whole design of RavenDB’s security was to make it easy to do the right thing, we should apply it globally.

time to read 1 min | 175 words

During code review, I ran into the following code (simplified):

We have some wrapper object for IEnumerable and allow to access it by index.

I don’t like this code, and suggested this version, instead:

The problem is that if you’ll look at the code of ElementAt(), it is already doing that, so why the duplicate work? It is specialized to make things fast for similar scenarios, why do I write the same code again?

Because I’m familiar with the usage scenario. A really common usage pattern for the wrapper object is something like this:

The difference between what ElementAt() does and what I do in the second version of the code is that my version will materialize the values. If you are calling that in a for loop, you’ll pay the cost of iterating over all the items once.

On the other hand, since the instance we pass to ElementAt() isn’t one that has an indexer, we’ll have to scan through the enumerator multiple times. A for loop with this implementation is a quadratic accident waiting to happen.

time to read 5 min | 828 words

imageMy first project as a professional software developer was to build a scheduling system for a dental clinics chain. That was a huge project (multiple years) and was really quite interesting. Looking back, I have done a lot of really cool technical things there. I also learned quite a lot from that project. The danger of complexity being one of the chief issues.

Consider a dental clinic, where we have the following schedule for a dentist:

  • Monday – 10:00 – 15:00
  • Wednesday – 09:00 – 13:00
  • Thursday – 11:30 – 16:30

The task is to be able to schedule an appointment for the dentist given those rules.

In addition to the schedule of the dentist, we also have actual Appointments, those looks like this:

Assume that you have quite a few of those, and you want to schedule a new appointment for a patient. How would you do that? I’m a database guy, let’s see if I can specify the task as a query?

We need a dentist that has availability of a particular length (different tasks have different schedules) and particular qualifications. However, there is no such thing as availability in our model. We have just:

  • Dentist
  • Schedule
  • Appointment

The complexity here is that we need to search for something that isn’t there.

I actually found some of my posts on this topic, from 2006. That isn’t a simple problem. And the solution is usually to generate the missing data and query on that. My old posts on the topic actually generate an in memory table and operate on that, which is great for small datasets, but will fail in interesting ways for real world datasets.

For what it’s worth, RavenDB allows you to generate the missing data during the indexing process, so at least the queries are fast, but the indexing process is now compute-intensive and a change in the dentist schedule can result in a lot of work.

All of that is because of two issues:

  • We are trying to query for the data that isn’t there.
  • The information is never used as queried.

These two points are strongly related to one another. Consider how you would schedule a dentist appointment. You first need to find the rough time frame that you need (“come back in six months”) and then you need to match it to your schedule (“I can’t on Monday, I got the kids”, etc).

There is a better way to handle that, by filling in the missing pieces. Instead of trying to compute the schedule of a dentist from the specification that we have, go the other way around. Generate the schedule based on the template you have. The result should be something like this:

In other words, based on the schedule provided, we’ll generate an entry per day for the dentist. That entry will contain the appointments for the day as well as the maximum duration for an available appointment. That means that on query time, we can do something as simple as:

from Schedules
where Dentist = $dentistId
and     At between $start and $end
and     MaximumDuration >= $reqDuration

And that gives us the relevant times that we can schedule the appointment. This is cheap to do, easy to work and it actually matches the common ways that users will use the system.

This has a bunch of other advantages, that are not immediately apparent but end up being quite important. Working with time sucks. The schedule above is a nice guideline, but it isn’t a really useful one when you need to run actual computations. Why is that? Well, it doesn’t account for vacations days. If there is a public holiday on Wednesday, the dentist isn’t working, but that is an implied assumption in the schedule.

For that matter, you now need to figure out which calendar to use. A Christian and a Jewish dentist are going to have different holiday calendars. Trying to push that into a query is going to be quite annoying, if not impossibly so. Putting that on the generator simplifies things, because you can “unroll” the schedule, apply the holiday calendar you want and then not think about it.

Other factors, such as vacation days, reserved time for emergencies and similar issues make it a lot easier to manage in a concrete form. Another important aspect is that the schedule changes, for any decent size clinic, the schedule changes all the time. You may have the dentist requesting to close a couple of hours early on Monday because of a dance recital and add more hours on Thursday. If the schedule is generated, this is a simple matter to do (manual adjusting). If we have just the schedule template, on the other hand… that becomes a lot more complex.

In short, the best way to handle this is to take the template schedule, generate it to a concrete schedule and operate from that point on.

time to read 3 min | 447 words

I mentioned in a previous post that nonce reuse is a serious problem. It is enough to reuse a nonce even once and you can have catastrophic results at your hands. The problem occurs, interestingly enough, when we are able to capture two separate messages generated with the same key & nonce. If we capture the same message twice, that is not an issue (we can XOR the values, they will be all zeroes).

The question is whether there is something that can be done about this. The answer to that is yes, we can create a construction that would be safe from nonce reuse. This is called SIV mode (Syntactic IV).

The way to do that is to make the nonce itself depend on the value that we are encrypting. Take a look at the following code:

The idea is that we get a nonce, as usual, but instead of relying on the nonce, we’ll compute a hash of the plain text using the nonce as a key. Aside from that, the rest of the system behaves as usual. In particular, there are no changes to the decryption.

Let’s see what the results are, shall we?

With a separate nonce per use:

attack at dawn! 838E1CE1A64D97E114237DE161A544DA5030FC5ECAB1C2
0D34AF838634D1C591AE208FC0AEE706690669E9F56F45C1
attack at dusk! EEA7DE8A51A06FE6CA9374CDDEC1053249F8B1F0BF1995A
0EEEE7D6EBF68868ECAE7CBEFD6EE23017480ACD494D634

Now, what happens when we use the same nonce?

attack at dusk! 0442EFA977919327C92B47C7F6A0CD617AE4FD3138DF07D4
5994EBC2C4B709ACDE1130422924B7206354D03569FDAA
attack at dawn! 324A996C22F7FFDE62596C0E9EE37D7EE1F89569A10A1188B
A4A03EE7B8C47DF347A20D1B73EB4523D3511F2F46FF2

As you can see, the values are completely different, even though we used the same key and nonce and the plain text is mostly the same.

Because we are generating the actual nonce we use from the hash of the input, reusing the nonce with different data will result in a wildly different nonce being actually used. We are safe from the catastrophe that is nonce reuse.

With SIV mode, we are paying with an additional hashing pass over the plain text data. In return, we have the following guarantees:

  • If the nonce isn’t reused, we have nothing to worry about.
  • If the nonce is reused, an attacker will not be able to learn anything about the content of the message.
  • However, an attacker may be able to detect that the same message is being sent, if the nonce is reused.

Given the cost of nonce reuse without SIV, it is gratifying to see that the worst case scenario for SIV with nonce reuse is that an adversary can detect duplicate messages.

I’m not sure how up to date this is, but this report shows that SIV adds about 1.5 – 2.2 cycles to the overall cost of encryption. Note that this is for actual viable implementations, instead of what I’m doing, which is by design, not a good idea.

time to read 2 min | 334 words

In the previous post, I showed some code that compared two MAC values (binary buffers) and I mentioned that the manner in which I did that was bad.

Here is the code in question:

When you are looking at code that is used in a cryptographic context, you should be aware that any call that compares buffers (or strings) cannot short circuit. What do I mean by that? Let’s look at the implementation of those two functions:

Those two functions are doing the same thing, but in a very different manner. The issue with eql() is that it will stop at the first mismatch byte, while timingSafeEql() will always scan through the two buffers first and then return the result.

Why do we need that?

Well, the issue is that if I can time (and you can, even over the network) the duration of a call like that, I’ll be able to test various values until I match whatever secret value the code is comparing against. In this case, I don’t believe that the use of eql() is an actual problem. We tested that on the output of HMAC operation vs. the expected value. The caller has no way to control the HMAC computation and already knows what we are comparing against. I can’t think of any reason where that would be a problem. However, I’m not a cryptographer and any call to buffer comparison in crypto related code should use a constant time method.

For that matter, side channels are a huge worry in cryptography. AES, for example, is nearly impossible to implement in software at this point, because it is vulnerable to timings side channels and requires the hardware to help here. Other side channels include watching caches, power signatures and more. I don’t actually have much to say about this, except that when working with cryptography, even something as simple as multiplication is suspect, because it may not complete in constant time. As a good example of the problem, see this page.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Challenge (66):
    06 May 2022 - Spot the optimization–solution
  2. Production postmortem (37):
    29 Apr 2022 - Deduplicating replication speed
  3. Recording (3):
    11 Apr 2022 - Clean Architecture with RavenDB
  4. Answer (10):
    07 Apr 2022 - Why is this code broken?
  5. Request for comments (2):
    10 Mar 2022 - Removing graph queries from RavenDB
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats