Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 85 words

Every now and then I need to do some work with text, and the Enron data set is one of the most well known corpuses.

I ended up writing the parsing code for that so many times that it isn’t even funny. Therefor, I decided to make my life easier and just post it somewhere that I can refer back to it.

This code simply unpack the Enron dataset into a .NET object, from where you can start processing the text in interesting ways.


time to read 3 min | 422 words

In the previous post, I discussed how I can (fairly naively) solve the phone book problem. In this post, I want to take this a bit further. Whenever a developer hears the terms sorted or ordered, I expect them to think about tree. Indeed, trees are the very first thing that I would expect to pop to mind.

Assuming that they aren’t versed with persistent data structures, they are likely going to look at in memory trees and map the idea directly to a file. I decided to take the same approach. For algorithms, I find Python to be the best language for me to grok. Mostly because it looks so much like pseudo code. Searching for AVLTree Python got me to this implementation. One issue with AVLTrees is their code size. Even in Python, the code is about 200 lines of code. Here is the structure of a node, which we’ll need to store on the disk.

image

You can see the full code here. It is about twice as long as the previous implementation (around 300 lines of code). I’m not going to cover that in depth, mostly because this is an AVL tree, with the only funny thing here is that I’m using that I’m writing the nodes to the file, not holding them in memory.

For example, I have to have some way to find the root node. I do that by writing its position to the end of the file after each write. That means that there are some limits to what I’m doing, but nothing too bad.

I don’t have a way to recover disk space, and updates to the data will use new space, not the old one. That is because we have to take into account that the size of the data may change.

This implementation is also going to be quite wasteful in terms of the disk seeks, given that it is an AVL Tree with a branching factor of 2. One of the reasons that binary search trees aren’t really used with persistent data structures is that the cost of seeking to another location in the file is enormous. B+Tree solves the problem by having a much higher branching factor.

A proper B+Tree, however, is probably going to take about 1,500 lines of code to implement, I think. So if you want to read that code, go ahead and peek into Voron Smile.

time to read 5 min | 891 words

A couple of weeks ago I asked you to rate an interview question that we had sent to candidates. The idea is to build a persistent phone book, with the idea that we care about the amount of I/O traffic that is used more than anything else.

The scenario we presented to the candidates was:

The rules are that you can’t maintain any state in the class itself and that the code should be prepared to handled a lot of records. Of particular interest is the IterateOrderedByName() call, which allows you to do an ordered iteration (by name) from a given name. That pretty much forces us to store the data in a sorted format.

Note that we don’t explicitly state that in the requirements for the task, we expect the candidates to understand that this is a requirement given the requirements for the operations. The most naïve option to complete this challenge is to write a CSV file. Each entry in its own line and you are done. However, that wouldn’t allow us to meet the other requirements.

In other words, reading the whole file to memory, adding the item, sorting the whole thing and writing it back again is a no go.  As a note, this is a task that we give to Junior developers. We expect zero to very little experience.

After going over dozens such answers, I can tell you that this task does its primary job, which is to filter people out. That is an explicit goal, by the way. We have had over 200 applicants to this position and the time it takes to go through the that many CVs, interviews, etc is horrendous. I found that this question filters enough people to make it far more manageable. And given the answers I got to my previous post, this is absolutely the kind of task that I would consider a junior developer suitable for. I remember solving similar problems in high school.

I like this problem because it is a good way to see if a person is able to do several things at once:

  • Can take a non trivial problem and break it to manageable pieces.
  • Can figure out several pitfalls along the way and handle them.
  • Can recognize what the impact of the various requirements are for the solution.
  • Can apply knowledge from one field to another. In this case, in memory data structures vs. persistent files.
  • Understand that data and representations are different things.

I have to admit that I was quite surprised by the results. Pretty much no one chose to use binary format, they all went to the textual format. This makes the problem harder. Also, there were a number of bytes vs. chars errors (that isn’t an issue for a junior position, though).

My first attempt is going to be a bit brute force. Instead of trying to implement any fancy data structures, I’m going to simply write the data out to the file in an append only manner. At the end of the file, however, I’ll keep a sorted array of the positions of the items in the file. Here is what this looks like:

image

To do a search in this format, I can use the sorted positions at the end of the file. Whenever I add a new record, I add it at the end of the records (overwriting the sorted positions sections and then writing the new sorted positions at the end). The overhead per write is the size of the sorted array, basically. The bigger the file, the more we’ll spend writing to the array. For example, assuming each record is around 64 bytes, when we get to 10 million records, we’ll have data size of 610MB, but the metadata we’ll have will be around 38 MB (and we’ll need to write all of it each time we modify a record).

The advantages, however, is that the code is simple and there are several ways that we can optimize the code without too much trouble. Here is what this looks like in code, there are some tests attached that you can run and the entire code runs at roughly 150 lines of code.

That isn’t a great solution, I’ll admit, but it is a straightforward one. It seems like it would be the natural consequence of moving from appending each record to the file to having a sorted access. There are some nice things there, nevertheless. You can “batch” several inserts together and close them by calling Flush() once, instead on each record, for example.

Given that the biggest issue here is the size of the sorted positions, we can seek to reduce it in several ways:

  • We can compress the data on write, you can see the changes required for that here.
  • We can create multiple sorted positions and merge them occasionally.

For that matter, we can store the sorted positions in another file entirely, which will simplify things like supporting efficient in order inserts.

Things that a junior developer might not do here? I’m using BinaryReader and BinaryWriter. That simplify things for me, since I can ignore the cost of textual parsing. That said, I’m not using them in any interesting way and the concepts translates 1:1 to textual interface, with a bit more work.

Can you do better than this implementation?

time to read 7 min | 1398 words

A couple of days, a fire started in OVH’s datacenter. You can read more about this here:

They use slightly different terminology, but translating that to the AWS terminology, an entire "region” is down, with SGB1-4 being “availability zones” in the region.

For reference, there are some videos from the location that make me very sad. This is what it looked like under active fire:

https://i.imgur.com/Sbt0IoR.jpg

I’m going to assume that this is a total loss of everything that was in there.

RavenDB Cloud isn’t offering any services in any OVH data centers, but this is a good time to go over the Disaster Recovery Plan for RavenDB and RavenDB Cloud. It is worth noting that the entire data center has been hit, with the equivalent to an entire AWS region going down.

I’m not sure that this is a fair comparison, it doesn’t looks like that SBG 1-4 are exactly the same thing as AWS availability, but it is close enough to draw parallels.

So far, at least, there have been no cases where Amazon has lost an entire region. There were occasions were a whole availability zone was down, but never a complete region. The way Amazon is handling Availability Zones seems to most paranoid, with each availability zone distanced “many kilometers” from each other in the same region. Contrast that with the four SGB that all went down. For Azure, on the other hand, they explicitly call out the fact that availability zones may not provide sufficient cover for DR scenarios. Google Cloud Platform also provides no information on the matter. For that matter, we also have direct criticism on the topic from AWS.

Yesterday, on the other hand, Oracle Cloud had a DNS configuration error that took effectively the entire cloud down.  The good news is that this is just inability to access the cloud, not actual loss of a region, as was the case on OVH. However, when doing Disaster Recovery Planning, having the the entire cloud drop off the face of the earth is also something that you have to consider.

With that background out of the way, let’s consider the impact of losing two availability zones inside AWS, losing a entire region in Azure or GCP or even losing an entire cloud platform. What would be the impact on a RavenDB cluster running in that scenario?

RavenDB is designed to be resilient. Using RavenDB Cloud, we make sure that each of the nodes in the cluster is running on a separate availability zone. If we lose two zones in a region, there is still a single RavenDB instance that can operate. Note that in this case, we won’t have a quorum. That means that certain operations won’t be available (you won’t be able to create new databases, for example) but read and write operations will work normally and your application will fail over silently to the remaining server. When the remaining servers recover, RavenDB will update them with all the missing data that was modified while they were down.

The situation with OVH is actually worse than that. In this case, a datacenter is gone. In other words, these nodes aren’t coming back. RavenDB will allow you to perform emergency operations to remove the missing nodes and rebuilt the cluster from the single surviving node.

What about the scenario where the entire region is gone? In this case, if there are no more servers for RavenDB to run on, it is going to be down. That is the time to enact the Disaster Recovery Plan. In this case, it means deploying a new RavenDB Cluster to a new region and restoring from backup.

RavenDB Cloud ensures full daily backups as well as hourly incremental backups for all databases, so the amount of data loss will be minimal. That said, where are the backups located?

By default, RavenDB stores the backups in S3, in the same region as the cluster itself. Amazon S3 has the best durability guarantees in the market. This is beyond just the number of nines that they provide in terms of data durability. A standard S3 object is going to be residing in three separate availability zones. As mentioned, for AWS, we have guarantees about distance between those availability zones that we haven’t seen from other cloud providers. For that reason, when your cluster reside in AWS, we’ll store the backups on S3 in the same region. For Azure and GCP, on the other hand, we’ll also use AWS S3 for backup storage. For a whole host of reasons, we select a nearby region. So a cluster on Azure US East would store its backups on AWS S3 on US-East-1, for example. And a cluster on Azure in the Netherlands will store its backups on AWS S3 on the Frankfurt region. In addition to all other safeguards, the backups are encrypted, naturally.

The cloud has been around for 15 years already (amazing, I know) and so far, AWS has had a great history with not suffering catastrophic failures like the one that OVH has run into. Then again, until last week, you could say the same about OVH, but with 20+ years of operations. Part of the Disaster Recovery Process is knowing what risks are you willing to accept. And the purpose of this post is to expand on what we do and how we plan to react to such scenarios, be they ever so unlikely.

RavenDB actually has a number of features that are applicable for handling these sorts of scenarios. They aren’t enabled in the cloud by default, but they are important to discuss for users who need to have business continuity.

  • Multi-region or Multi-cloud clusters are available. You can setup RavenDB across multiple disparate location in several manners, but the end result is that you can ensure that you have no single point of failure, while still using RavenDB to its fullest potential. This is commonly used in large applications that are naturally geo distributed, but it also serve as a way to not put all your eggs in a single basket.
  • In addition to the default backup strategy (same AWS region on AWS or nearby AWS region for Azure or GCP), you can setup backups to additional regions.

One of the key aspects of business continuity is the issue of the speed in which you can go back to normal operations. If you are running a large database, just the time to restore from backup can be a significant amount of time. If you have a database (or databases) whose backup are in the hundreds of GB range, just the time it takes to get the backups can be measures in many hours, leaving aside the restore costs.

For that reason, RavenDB also support the notion of an offsite observer. That can be a single isolated node or a whole cluster. We can take advantage of the fact that the observer is not in active use and under provision it, in that case, when we need to switch over, we can quickly allocate additional resources to it to ramp it up to production grade. For example, let’s assume that we have a cluster running in Azure Northern Europe region, composed of 3 nodes with 8 cores each. We also have setup an observer cluster in Azure Norway East region. Instead of having to allocate a cluster of 3 nodes with 8 cores each, we can allocate a much smaller size, say 2 cores only (paying less than a third of the original cluster cost as a premium). In the case of disaster, we can respond quickly and within a matter of minutes, the Norway East cluster will be expanded so each of the nodes will have 8 cores and can handle full production traffic.

Naturally, RavenDB is likely to be only a part of your strategy. There is a lot more to ensuring that your users won’t notice that something is happening while your datacenter is on fire, but at least as it relates to your database, you know that you are in good hands.

time to read 1 min | 200 words

imageI ask candidates to answer the following question. Sometimes at home, sometimes during an interview with a whiteboard.

You need to create an executable that would manage a phone book. The commands you need to support are:

  • phone-book.exe /path/to/file add [name] [phone]
  • phone-book.exe /path/to/file list [skip], [limit]

The output of the list operation must be the phone book records in lexical order. You may not sort the data during the list operation, however. All such work must be done in the add operation.

You may keep any state you’ll like in the file system, but there are separate invocations of the program for each step.  This program need to support adding 10 million records.

Feel free to constrain the problem in any other way that would make it easier for you to implement it. We’ll rate the solution on how much it cost in terms of I/O.

A reminder, we are a database company, this sort of question is incredibly relevant to the things that we do daily.

I give this question to candidates with no experience, fresh graduates,  etc. How would you rate its difficulty?

time to read 2 min | 295 words

I’m talking a lot about candidates and the hiring process we go through right now. I thought it would only be fair to share a story about an interview task that I failed.

That was close to 20 years ago, and I was looking for my first job. Absolutely no professional experience and painfully aware of that. I did have a few years of working on Open Source projects, so I was confident that I had a good way to show my abilities.

The question was simple, write the code to turn the contents of this table into a hierarchical XML file:

image 

In other words, they wanted:

To answer the question, I was given pen and paper, by the way. That made my implementation choices quite hard, since I had to write it all in long hand. I tried to reproduce this from memory, and it looks like this:

This is notepad code, and I wrote it using modern API. At the time, I was using ADO.Net and the XmlDocument. The idea is the same, however, and it will spare you going through a mass of really uninteresting details.

I got so many challenges to this answer,though. I relied on null being sorted first on SQL Server and then on the fact that a parent must exist before its children. Aside from these assumptions, which I feel are fairly safe to make, I couldn’t figure out what the big deal was.

Eventually it turned out that the interviewers were trying to guide me toward a recursive solution. It never even occurred to me, since I was doing that with a single query and a recursive solution would need many such queries.

time to read 4 min | 666 words

I recently got my hands on a the Raspberry PI 400 (the one that comes in a keyboard form). That is an amazing idea and it make the Raspberry a lot more approachable for consumer cases.

At any rate, one of my first actions was to put RavenDB on it and see how well it performs. You can see the results in the image below.

image

In this case, we are running 1,500 queries per second on the system. It has 4 GB of RAM and the database we are using has 450 GB (!) worth of data. I actually just took the nearest external disk I had available and plugged that into the PI. This is a generic hard disk and I can get a maximum of about 30 MB / sec from it.

This is important because my queries are covering more data than can fit in memory. Each query asks for a random (different) document, so there is little chance for optimizations by having a hot working set. We are going to see some I/O to the (pretty poor) disk impacting the outcome. Here are the results:

image

You can see that the for 95% of the queries, we got a result in under 125 milliseconds and that for 99% of the requests, RavenDB on a Raspberry PI is able to answer in about half a second.  And even with some of the requests having to hit the disk, the maximum number of time to wait for a request is just above a second. All of that when we are facing 1,500 queries per second, which is respectable even for big applications running on much more massive hardware.

Of particular interest to me is the state of the server when we are running this benchmark. You can see that both in terms of CPU utilization and in the number of queries processed, we are nearly absolutely flat. There aren’t any hiccups in the load, there haven’t been a GC pause that stopped the world and the system just runs at top speed for as long as we’ll let it. In this case, the benchmark lasted over 5 minutes, so more than enough time to run through all the usual suspects.

Note also the number of documents involved here. We are looking at 882 million documents. And we are requesting close to half a million of them. I run the benchmark long enough to ensure that we will cover more documents than can be fit into memory, so we are seeing I/O work here (on a fairly poor disk, I might add, but that is what I had available at the moment).

The actual size of disk is a bit of a cheat, I’m using documents compression here to pack the data more tightly. The actual data size, without using RavenDB data compression is around 750GB. That also helps a lot with the amount of I/O we have to deal with, but it increase the CPU consumption. Given the difference in relative costs, that is a task that is paying dividends in spades.

I also decided to see what we can look at when we are running a query that touches just a small part of the documents. Instead of working through nearly half a million, I chose to run it on about 100,000 documents. That is small enough that it should mostly all fit in memory. It also represent a far more likely scenario, to be frank.

image

And here we can see that we get all requests, under 1,500 queries per second on a Raspberry PI in under 150 ms, with the 99.999% (!!) percentile in about 50 milliseconds.

And that makes me very happy, because it shows the result of all the work we put into optimizing RavenDB.

time to read 3 min | 563 words

Following a phone screen, we typically ask candidates to complete some coding tasks. The idea is that we want to see their code and asking a candidate to program during an interview… does not go well. I had a candidate some years ago that was provided with a machine, IDE and internet connection and walked out after failing for 30 minutes to reverse a string. Given that his CV said that he has 8 years of experience, I consider myself very lucky.

Back to the candidate that prompt this post. He sent us answers to the coding tasks. In Node.JS and C++. Okay, weird flex, but I can manage. I don’t actually care what language a candidate knows, especially for the junior positions.

Given that we are hiring for junior positions, we’ll usually get solutions that bend the question restrictions. For example, they would do a linear scan of a file even when they were asked not to. For the most part, we can ignore those details and focus on what the candidate is showing us. Sometimes we ask them to fix a particular issue, but usually we’ll just get them to the interview and ask them about their code there.

I like asking candidates about their code, because I presume that they spent some time thinking about it and can discuss the topic in some detail. At one memorable interview I had a candidate tell me: “I didn’t write this code, I have no idea what is going on here”. I had triple checked that this is indeed the code they sent and followed up by sending the candidate home, sans offer. We can usually talk with the candidate about what drove them to certain decisions, what impact a particular constraint would be on their code, etc.

In this case, however, the code was bad enough that I called it. I sent the candidate a notification about the issues we found in their code, detailing the 20+ critical failures that we found in the space of a few minutes of looking at it.

The breaking point for me was that the tasks did not actually work. In fact, they couldn’t work. I’m not sure if they compiled, I didn’t check, but they certain were never even eyeballed.

For example, we asked the candidate to build a server that would translate messages to Morse code and cause the server speaker to beep in Morse code. Nothing particularly fancy, I think. But we got a particular implementation for that. For example, here is the relevant code that plays the Morse code:

image

The Node.js version that I’m using doesn’t come with the relevant machine learning model to make that actually happen, I’m afraid.

The real killer for me was this part:

You might want to read this code a few times.

They pass a variable to a function, set it to a new value and expect to see that new value outside. Basically, they wanted to use an out parameter here, which isn’t valid in JavaScript.

That is the kind of fairly fundamental issue in understanding the flow of code in a program. And that is something that would never have worked.

I’m okay with getting sub optimal solutions, I’m not fine with it never have been actually looked at.

time to read 1 min | 171 words

I recently got an email from a customer. It was a very strange interaction. The email basically said:

I wanted to let you know that I recently had to setup a new server for an existing application of mine. I had to find an old version of RavenDB and I was able to get it from the site.

This is the first time in quite some time (years) that I had to touch this. I thought you would want to know that.

I do want to know that. We spend an inordinate amount of time trying to make sure that Things Work. The problem with that approach is that if we do things properly, you won’t even know that there is a challenge here that we overcome.

Our usual interaction with users is when they run into some kind of a problem. Hearing about the quite mode, where RavenDB just worked and no one paid attention to it in a few years is a breath of fresh air for me and the team in general.

time to read 2 min | 363 words

A few days ago I posted about looking at GitHub projects for junior developer candidates. One of the things that is very common in such scenario is to see them use string concatenation for queries, I hate that. I just reached to a random candidate GitHub profile right now and found this gem:

The number of issues that I have with this code is legion.

  • Not closing the connection or disposing the command.
  • The if can be dropped entirely.
  • And, of course, the actual SQL INJECTION vulnerability in the code.

There is a reason that I have such a reaction for this type of code, even when looking at junior developer candidates. For them, this is acceptable, I guess. They are learning and focusing mostly on what is going on, not the myriad of taxes that you have to pay in order to get something to production. This is never meant to be production code (I hope, at least). I’m not judging this on that level. But I have to very consciously remind myself of this fact whenever I run into code like this (and as I said, this is all too common).

The reason I have such a visceral reaction to this type of code is that I see it in production systems all too often. And that leads to nasty stuff like this:

And this code led to a 70GB data leak on Gab. The killer for me that this code was written by someone with 23 years of experience.

I actually had to triple check what I was seeing when I read the code the first time, because I wasn’t sure that this is actually possible. I thought maybe this is some fancy processing done to avoid SQL injection, not that this is basically string interpolation.

Some bugs are things that you can excuse. A memory leak or a double free are things that will happen to anyone who is writing in C, regardless of experience and how careful they write. They are often subtle and easy to miss, happening in corner cases of error handling.

This sort of bug is a big box of red flags. It is also on fire.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}