Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,145 | Comments: 50,118

Privacy Policy Terms
filter by tags archive
time to read 4 min | 663 words

This is a tale of two options that we took for an exhaustive test. Amazon recently came out with a new disk type on the cloud. As a database vendor, that is of immediate interest to me, so we took a deep look into that.

GP3 disks are about 20% cheaper than their GP2 equivalent. What is more, they come with a guarantee level of performance even before you purchase additional IOPS. Consider the following two disks:

  Size IOPS MB/S Price
GP2 512GB 1,536 250 51.2 USD
GP3 512GB 3,000 125 40.9 USD
GP3 512GB 4,075 250 51.2 USD

In other words, for the same disk, we can get a much better baseline performance at a cheaper price. What isn’t there not to like?

The major difference between GP2 and GP3, however, is their latency. In practice, we see an additional 1 – 2 milliseconds in response times from the GP3 disk vs. the GP2 disk. In other words, GP3 disks are somewhat slower, even if they are able to run more IOPS, their latency is higher.

A really key observation from us, however, is that GP3 does not offer burst I/O capabilities. And that means that I can breath a huge sigh of relief.

RavenDB as a database is meant to run on anything from an SD card to HDD to SSD to NVMe drives. We are used to account for the I/O being the slowest thing around and have already mostly coded around that. An additional millisecond in disk latency doesn’t matter that much in the grand scheme of things.

However… the fact that this doesn’t provide I/O burst is a huge plus for us. RavenDB can easily deal with slow I/O, what it find it very hard to deal with is an environment that very rapidly change its operational characteristics.

Let’s assume that we have a 100 GB GP2 disk, which means that we have a baseline of around 300 IOPS and 75MB / sec of throughput. RavenDB is under some high load, and it is using the maximum capabilities of the hardware. However, because of burstiness, we are actually able to utilize 3,000 IOPS and 250MB/sec for a while.

Until all the I/O credits are gone and we are forced into a screeching halt. That means, for example, that we read from the network at a rate of 250MB/sec, but we are unable to write to the disk at this level. There is a negative balance of 125MB/sec that needs to be stored some where. We can buffer that in memory, of course, but that only work for so long. That means that we have to put a huge break all of a sudden, which the rest of the eco system isn’t happy with. For example, the other side that is sending us data at 250MB /sec, they are likely not going to be able to respond in time to the shift is our behavior. It is very likely that the network connection would congest and break in this case.

All of the internal optimizations inside of RavenDB will also be skewed for a while, until we are used to the new level of speed. If this was gradual, we could adjust a lot more easily, but this is basically like hitting the brakes at speed. You will slow down, sure, but you are also likely to cause an accident.

As a simple example, RavenDB can compress the data that it writes to disk, and it balances the compression ratio vs. the cost to write to the disk. If we know that the disk is slow, we can spend more time trying to reduce the amount of data we write. If this changes rapidly, we are operating under the old assumptions and may create a true traffic jam

The fact that GP3 disks have a predictable performance profile means that we are much better suited to run on them. A more predictable platform from which to operate gives me a much better opportunity to handle optimizations.

time to read 2 min | 222 words

I’m doing some cloud work, and I am working based off the official documentation, trying to automate the creation of an AWS Lambda. In order to allow me to quickly iterate, I’m basically creating the entire thing from scratch each time.

I have the following code:

  • aws iam create-role --role-name $AWS_ROLE --assume-role-policy-document file://trust-policy.json
  • aws iam attach-role-policy --role-name $AWS_ROLE  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
  • aws lambda create-function --function-name $FUNC_NAME --zip-file fileb://lambda.zip --handler lambda_function.lambda_handler  --runtime python3.8 --role $ARN_ROLE

So far, so good, and exactly like it shows in the docs. But if you’ll run it as a script, it will fail with:

An error occurred (InvalidParameterValueException) when calling the CreateFunction operation: The role defined for the function cannot be assumed by Lambda.

If I re-run the exact same command, however, it works properly.

There is this interesting command, which indicates that roles are using eventual consistency:

aws iam wait role-exists --role-name $AWS_ROLE

Except… that this doesn’t work. It looks like there is some additional delay between creating the role, validating that it was created and when it is actually available for Lambda to be used.

After looking around and feeling like a fool, I added a sleep for 10 seconds to the script, and the problem went away.

I’m posting this for posterity sake and in the hope that someone can tell me that there is a better way. For now, I think I need a shower.

time to read 2 min | 337 words

In the past two posts in the series, I talked about ways to store phone book records in a file. During the candidates review process, I noticed that many candidates failed to make their lives significantly easier by placing limits on themselves.

For example:

  • Using variable length records.
  • Using a single file.
  • Choosing simple algorithm to do the whole task.

If we force fixed length records, either directly or via record splitting (if each record is 64 bytes, a record that is bigger than that would reside in some multiple of the record size), the task become much easier. I’ve mostly ignored that in my code so far because I’m using binary offsets, but it can really make the code a lot simpler.

Using a single file lead to complications, because you have to do internal space management (where do the records live, where is the metadata?). It also make it much harder to recover used space in many cases.

The last one is probably the most interesting limitation, and not something that I would expect a junior developer to figure out. The use of a single option is typically limiting you to whatever a particular algorithm is providing, but you can extend on that significantly.

Let’s see another approach to building a persistent phone book. I’m going to effectively build an LSM here. You can see the code here.

I called it a pretty horrible LSM (Log Structure Merge), but all the relevant pieces are there. It is just horribly inefficient. The key problem, by the way, is around the number of times it will open a file handle. That can be really slow on Windows and end up being a real issue for any significant size.

There are also probably a lot of other bugs there, but also enough to figure out how this is actually built.

And with this post, I can can say that I explicitly scratched this itch.

A fun task to take this further, by the way, is to try to implement a persistent trie for the phone book.

time to read 3 min | 422 words

In the previous post, I discussed how I can (fairly naively) solve the phone book problem. In this post, I want to take this a bit further. Whenever a developer hears the terms sorted or ordered, I expect them to think about tree. Indeed, trees are the very first thing that I would expect to pop to mind.

Assuming that they aren’t versed with persistent data structures, they are likely going to look at in memory trees and map the idea directly to a file. I decided to take the same approach. For algorithms, I find Python to be the best language for me to grok. Mostly because it looks so much like pseudo code. Searching for AVLTree Python got me to this implementation. One issue with AVLTrees is their code size. Even in Python, the code is about 200 lines of code. Here is the structure of a node, which we’ll need to store on the disk.

image

You can see the full code here. It is about twice as long as the previous implementation (around 300 lines of code). I’m not going to cover that in depth, mostly because this is an AVL tree, with the only funny thing here is that I’m using that I’m writing the nodes to the file, not holding them in memory.

For example, I have to have some way to find the root node. I do that by writing its position to the end of the file after each write. That means that there are some limits to what I’m doing, but nothing too bad.

I don’t have a way to recover disk space, and updates to the data will use new space, not the old one. That is because we have to take into account that the size of the data may change.

This implementation is also going to be quite wasteful in terms of the disk seeks, given that it is an AVL Tree with a branching factor of 2. One of the reasons that binary search trees aren’t really used with persistent data structures is that the cost of seeking to another location in the file is enormous. B+Tree solves the problem by having a much higher branching factor.

A proper B+Tree, however, is probably going to take about 1,500 lines of code to implement, I think. So if you want to read that code, go ahead and peek into Voron Smile.

time to read 5 min | 891 words

A couple of weeks ago I asked you to rate an interview question that we had sent to candidates. The idea is to build a persistent phone book, with the idea that we care about the amount of I/O traffic that is used more than anything else.

The scenario we presented to the candidates was:

The rules are that you can’t maintain any state in the class itself and that the code should be prepared to handled a lot of records. Of particular interest is the IterateOrderedByName() call, which allows you to do an ordered iteration (by name) from a given name. That pretty much forces us to store the data in a sorted format.

Note that we don’t explicitly state that in the requirements for the task, we expect the candidates to understand that this is a requirement given the requirements for the operations. The most naïve option to complete this challenge is to write a CSV file. Each entry in its own line and you are done. However, that wouldn’t allow us to meet the other requirements.

In other words, reading the whole file to memory, adding the item, sorting the whole thing and writing it back again is a no go.  As a note, this is a task that we give to Junior developers. We expect zero to very little experience.

After going over dozens such answers, I can tell you that this task does its primary job, which is to filter people out. That is an explicit goal, by the way. We have had over 200 applicants to this position and the time it takes to go through the that many CVs, interviews, etc is horrendous. I found that this question filters enough people to make it far more manageable. And given the answers I got to my previous post, this is absolutely the kind of task that I would consider a junior developer suitable for. I remember solving similar problems in high school.

I like this problem because it is a good way to see if a person is able to do several things at once:

  • Can take a non trivial problem and break it to manageable pieces.
  • Can figure out several pitfalls along the way and handle them.
  • Can recognize what the impact of the various requirements are for the solution.
  • Can apply knowledge from one field to another. In this case, in memory data structures vs. persistent files.
  • Understand that data and representations are different things.

I have to admit that I was quite surprised by the results. Pretty much no one chose to use binary format, they all went to the textual format. This makes the problem harder. Also, there were a number of bytes vs. chars errors (that isn’t an issue for a junior position, though).

My first attempt is going to be a bit brute force. Instead of trying to implement any fancy data structures, I’m going to simply write the data out to the file in an append only manner. At the end of the file, however, I’ll keep a sorted array of the positions of the items in the file. Here is what this looks like:

image

To do a search in this format, I can use the sorted positions at the end of the file. Whenever I add a new record, I add it at the end of the records (overwriting the sorted positions sections and then writing the new sorted positions at the end). The overhead per write is the size of the sorted array, basically. The bigger the file, the more we’ll spend writing to the array. For example, assuming each record is around 64 bytes, when we get to 10 million records, we’ll have data size of 610MB, but the metadata we’ll have will be around 38 MB (and we’ll need to write all of it each time we modify a record).

The advantages, however, is that the code is simple and there are several ways that we can optimize the code without too much trouble. Here is what this looks like in code, there are some tests attached that you can run and the entire code runs at roughly 150 lines of code.

That isn’t a great solution, I’ll admit, but it is a straightforward one. It seems like it would be the natural consequence of moving from appending each record to the file to having a sorted access. There are some nice things there, nevertheless. You can “batch” several inserts together and close them by calling Flush() once, instead on each record, for example.

Given that the biggest issue here is the size of the sorted positions, we can seek to reduce it in several ways:

  • We can compress the data on write, you can see the changes required for that here.
  • We can create multiple sorted positions and merge them occasionally.

For that matter, we can store the sorted positions in another file entirely, which will simplify things like supporting efficient in order inserts.

Things that a junior developer might not do here? I’m using BinaryReader and BinaryWriter. That simplify things for me, since I can ignore the cost of textual parsing. That said, I’m not using them in any interesting way and the concepts translates 1:1 to textual interface, with a bit more work.

Can you do better than this implementation?

time to read 1 min | 200 words

imageI ask candidates to answer the following question. Sometimes at home, sometimes during an interview with a whiteboard.

You need to create an executable that would manage a phone book. The commands you need to support are:

  • phone-book.exe /path/to/file add [name] [phone]
  • phone-book.exe /path/to/file list [skip], [limit]

The output of the list operation must be the phone book records in lexical order. You may not sort the data during the list operation, however. All such work must be done in the add operation.

You may keep any state you’ll like in the file system, but there are separate invocations of the program for each step.  This program need to support adding 10 million records.

Feel free to constrain the problem in any other way that would make it easier for you to implement it. We’ll rate the solution on how much it cost in terms of I/O.

A reminder, we are a database company, this sort of question is incredibly relevant to the things that we do daily.

I give this question to candidates with no experience, fresh graduates,  etc. How would you rate its difficulty?

time to read 2 min | 295 words

I’m talking a lot about candidates and the hiring process we go through right now. I thought it would only be fair to share a story about an interview task that I failed.

That was close to 20 years ago, and I was looking for my first job. Absolutely no professional experience and painfully aware of that. I did have a few years of working on Open Source projects, so I was confident that I had a good way to show my abilities.

The question was simple, write the code to turn the contents of this table into a hierarchical XML file:

image 

In other words, they wanted:

To answer the question, I was given pen and paper, by the way. That made my implementation choices quite hard, since I had to write it all in long hand. I tried to reproduce this from memory, and it looks like this:

This is notepad code, and I wrote it using modern API. At the time, I was using ADO.Net and the XmlDocument. The idea is the same, however, and it will spare you going through a mass of really uninteresting details.

I got so many challenges to this answer,though. I relied on null being sorted first on SQL Server and then on the fact that a parent must exist before its children. Aside from these assumptions, which I feel are fairly safe to make, I couldn’t figure out what the big deal was.

Eventually it turned out that the interviewers were trying to guide me toward a recursive solution. It never even occurred to me, since I was doing that with a single query and a recursive solution would need many such queries.

time to read 3 min | 563 words

Following a phone screen, we typically ask candidates to complete some coding tasks. The idea is that we want to see their code and asking a candidate to program during an interview… does not go well. I had a candidate some years ago that was provided with a machine, IDE and internet connection and walked out after failing for 30 minutes to reverse a string. Given that his CV said that he has 8 years of experience, I consider myself very lucky.

Back to the candidate that prompt this post. He sent us answers to the coding tasks. In Node.JS and C++. Okay, weird flex, but I can manage. I don’t actually care what language a candidate knows, especially for the junior positions.

Given that we are hiring for junior positions, we’ll usually get solutions that bend the question restrictions. For example, they would do a linear scan of a file even when they were asked not to. For the most part, we can ignore those details and focus on what the candidate is showing us. Sometimes we ask them to fix a particular issue, but usually we’ll just get them to the interview and ask them about their code there.

I like asking candidates about their code, because I presume that they spent some time thinking about it and can discuss the topic in some detail. At one memorable interview I had a candidate tell me: “I didn’t write this code, I have no idea what is going on here”. I had triple checked that this is indeed the code they sent and followed up by sending the candidate home, sans offer. We can usually talk with the candidate about what drove them to certain decisions, what impact a particular constraint would be on their code, etc.

In this case, however, the code was bad enough that I called it. I sent the candidate a notification about the issues we found in their code, detailing the 20+ critical failures that we found in the space of a few minutes of looking at it.

The breaking point for me was that the tasks did not actually work. In fact, they couldn’t work. I’m not sure if they compiled, I didn’t check, but they certain were never even eyeballed.

For example, we asked the candidate to build a server that would translate messages to Morse code and cause the server speaker to beep in Morse code. Nothing particularly fancy, I think. But we got a particular implementation for that. For example, here is the relevant code that plays the Morse code:

image

The Node.js version that I’m using doesn’t come with the relevant machine learning model to make that actually happen, I’m afraid.

The real killer for me was this part:

You might want to read this code a few times.

They pass a variable to a function, set it to a new value and expect to see that new value outside. Basically, they wanted to use an out parameter here, which isn’t valid in JavaScript.

That is the kind of fairly fundamental issue in understanding the flow of code in a program. And that is something that would never have worked.

I’m okay with getting sub optimal solutions, I’m not fine with it never have been actually looked at.

time to read 6 min | 1093 words

I mentioned that we are currently hiring for a junior dev position and we have been absolutely swamped with candidates. Leaving aside the divorce lawyer that tried to apply to the position and the several accountants (I don’t really get it either) we typically get people with very little experience.

In fact, this position is explicitly open to people with no experience whatsoever. Given that most junior positions require a minimum of two years, I think that got us a lot of candidates.

The fact that we don’t require prior experience doesn’t meant that we don’t have prerequisites, of course. We are a database company and the fundamentals are important to us. A typical task in RavenDB involves a lot of taxes, from ACID compliance, distributed computing, strict performance requirements, visibility into the actions of the database, readability of the code, etc.

I talked before about the cost of a bad hire, and in the nearly a decade that passed since I wrote that post, I hasn’t changed my mind. I would rather end up with no one than hire someone that isn’t a match for our needs.

Our interview process is composed of a phone call, a few coding questions and then an in person interview. At this point, given that I have been doing that for over a decade, I think that I interviewed well over 5,000 people. A job interview stresses some people out a lot. Yesterday I had to listen to a candidate speak so fast that I could barely understand the words and I had to stop a candidate and tell them that they are currently in the 95% percentile of people I spoke to, so they wouldn’t freeze because of a flubbed question.

I twitted(anonymously) about the ups and down of the process and seem to have created quite a lot of noise. A typical phone call for a potential candidate takes about 15 – 30 minutes and is mostly there to serve as an explicit filter. If they don’t meet the minimum requirements that we have, there is no point in wasting either of our time.

One of the questions that I ask is: Build a phone book application that stores the data in memory and outputs the records in lexical order. This can stump some people, so we have an extra question to help. Instead of trying to output the data in lexical order, how would you ensure that you don’t have a duplicate phone number in such a system? Scanning through the entire list of records each time is obviously not the way to go. If they still can’t think of a way to do that the next hint is to think about O(1) and what data structure would fit this requirement. On the Twitter thread, quite a few people were up in arms about that.

Building a phone book is the kind of task that I remember doing in high school programming class as a teenager. Admittedly, that was in Pascal, but I just checked six different computer science degrees and for all of them, data structures was a compulsory course. Moreover, things like “what is the complexity of this operation” are things that we do multiple times a day here. We are building a database here, so operations on data is literally our bread and butter. But even for “normal” operations, that is crucial. A common example, we need to display some information to the user about their database. The information actually come from two sources internally. One is the database list which contains various metadata and one is the active database instance, which can give us the running stats such as the number of requests for this database in the past minute.

Let’s take a look at this code:

The complexity of this code is O(N^2). In other words, for ten databases, it would cost us a hundred. But for 250 databases it would cost 62,500 and for 500 it would be 250,000. Almost the same code, but without the needless cost:

This is neither theoretical nor rare, and it is part of every programming curriculum that you’ll find. Further more, I’m not even asking about the theoretical properties of various algorithms, or asking a candidate to compute the complexity of a particular piece of code. I’m presenting a fairly simple and common task (make sure that a piece of data in unique) and asking how to perform that efficiently. 

From a lot of the reactions, it seems that plenty of people believe that data structures aren’t part of the fundamentals and shouldn’t be something that is deeply embedded in the mindset of developers. To me, that is like saying that a carpenter shouldn’t be aware of the difference between a nail or a screw.

Rob has an interesting thread on the topic, which I wanted to address specifically:

It does not matter what language, actually. In JavaScript, you’ll not use a “new Hashtable()”, you’ll use an object, sure, but that is a meaningless detail. In fact, arrays implemented as hashes in JavaScript maintain most of their important properties, actually. O(1) access time by index being the key. If you want to go deeper, than in practice, the JS runtime will usually use an actual array if it detects that this is how you use the variable.

And I’m sorry, that is actually really important for many reasons. The underlying structure of our data has huge impact on what you can do with that data and how you can operate on it. That is why you have ArrayBuffer and friends in JavaScript now, because it matters, a lot.

And to the people whose 30 years experience never included ever needing to know those details, I have two things to say:

  • Either your code is full of O(N^2) traps (which is sadly common) or you know that, because lists vs. hash is not something that you can really get away with.
  • In this company, implementing basic data structures is literally part of day to day tasks. Over the years, we needed to get customize versions of arrays, lists, dictionaries, trees and many more. This isn’t pie in the sky stuff, that is Monday morning.
time to read 1 min | 188 words

Just had to trouble shoot a really strange problem in a production system. The situation is pretty simple. At one point in time, a machine had an issue and shortly afterward became utterly unable to reach the outside world. I was luckily able to SSH into the machine, but any outgoing connection would fail.

Given that TCP worked (I was using it or SSH), how can that be? I can create a connection to the system and have bidirectional communication, but not initiate any calls to the outside world.

Strange doesn’t begin to cover how weird this is. The reason for the failure, by the way, was that the disk was absolutely full. The reason for the network outage was a mystery.

It took a while to figure out that the error was that the disk was full, which caused an error is systemd-resolved, which failed to setup DNS properly. I was also unable to make connections via IP address, which was strange, but the problem went away after the disk was cleared.

I have to say, network failure due to disk errors was not how I expected to start the day.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Building a phone book (3):
    02 Apr 2021 - Part III
  2. Building a social media platform without going bankrupt (10):
    05 Feb 2021 - Part X–Optimizing for whales
  3. Webinar recording (12):
    15 Jan 2021 - Filtered Replication in RavenDB
  4. Production postmortem (30):
    07 Jan 2021 - The file system limitation
  5. Open Source & Money (2):
    19 Nov 2020 - Part II
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats