Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

, @ Q j

Posts: 6,812 | Comments: 49,041

filter by tags archive
time to read 10 min | 1813 words

imageThis is a sordid tale of chance and mystery and the nasty tricks that Murphy can play on you.

A few customers reported an error similar to the following one:

Invalid checksum for page 1040, data file Raven.voron might be corrupted, expected hash to be 0 but was 16099259854332889469

One such case might be a disk corruption, but multiple customers reporting it is an indication of a much bigger problem. That was a trigger for a STOP SHIP reaction. We consider data safety a paramount goal of RavenDB (part of the reason why I’m doing this Production Postmortem series), and we put some of our most experienced people on it.

The problem was, we couldn’t find it. Having access to the corrupted databases showed that the problem occurred on random. We use Voron in many different capacities (indexing, document storage, configuration store, distributed log, etc) and these incidents happened across the board. That narrowed the problem to Voron specifically, and not bad usage of Voron. This reduced the problem space considerably, but not enough for us to be able to tell what is going on.

Given that we didn’t have a lead, we started  by recognizing what the issue was and added additional guards against it. In fact, the error itself was a guard we added, validating that the data on disk is the same data that we have written to it. The error above indicates that there has been a corruption in the data because the expected checksum doesn’t match the actual checksum from the data. This give us an early warning system for data errors and prevent us from proceeding on erroneous data. We have added this primarily because we were worried from physical disk corruption of data, but it turns out that this is also a great early warning system for when we mess up.

The additional guards were primarily additional checks for the safety of the data in various locations on the pipeline. Given that we couldn’t reproduce the issue ourselves, and none of the customers affected were able to reproduce this, we had no idea how to go from there. Therefor, we had a team that kept on trying different steps to reproduce this issue and another team that added additional safety measures for the system to catch any such issue as early as possible.

The additional safety measures went into the codebase for testing, but we still didn’t have any luck in figuring out what we going on. We went from trying to reproduce this by running various scenarios to analyzing the code and trying to figure out what was going on. Everything pointed to it being completely impossible for this to happen, obviously.

We got a big break when the repro team managed to reproduce this error when running a set of heavy tests on 32 bits machines. That was really strange, because all the reproductions to date didn’t run on 32 bits.

It turns out that this was a really lucky break, because the problem wasn’t related to 32 bits at all. What was going on there is that under 32 bits, we run in heavily constrained address space, which under load, can cause us to fail to allocate memory. If this happens at certain locations, this is considered to be a catastrophic error and requires us to close the database and restart it to recover. So far, this is pretty standard and both expected and desired reaction. However, it looked like sometimes, this caused an issue. This also tied to some observations from customers about the state of the system when this happened (low memory warnings, etc).

The very first thing we did was to test the same scenario on the codebase with the new checks added. So far, the repro team worked on top of the version that failed at the customers’ sites, to prevent any other code change from masking the problem. With the new checks, we were able to confirm that they actually triggered and caught the situation early. That was a great confirmation, but we still didn’t know what was going on. Luckily, we were able to add more and more checks to the system and run the scenario. The idea was to trip over a guard rail as early as possible, to allow us to inspect what actually caused it.

Even with a reproducible scenario, that was quite hard. We didn’t have a reliable method of reproducing it, we had to run the same set of operations for a while to hopefully reproduce this scenario. That took quite a bit of time and effort. Eventually, we figured out what was the root cause of the issue.

In order to explain that, I need to give you a refresher on how Voron is handling I/O and persistent data.

Voron is using MVCC model, in which any change to the data is actually done on a scratch buffer, this allow us to have snapshot isolation at very little cost and give us a drastically simplified model for working with Voron. Other important factors include the need to be transactional, which means that we have to make durable writes to disk. In order to avoid doing random writes, we use a Write Ahead Journal. For these reasons, I/O inside Voron is basically composed of the following operations:

  • Scratch (MEM) – copy on write data for pages that are going to be changed in the transaction. Usually purely in memory. This is how we maintain the Isolated and Atomic aspects on ACID.
  • Journal (WAL) – sequential, unbuffered, writes that include all the modifications to the transaction. This is how we maintain the Atomic and Durability aspects in ACID.
  • Flush (MMAP)– copy data from the scratch buffers to the data file, which allow us to reuse space in the scratch file.
  • Sync – (FSYNC) – ensure that the data from a previous flush is stored in durable medium, allow us to delete old journal files.

In Voron 3.5, we had Journal writes (which happen on each transaction commit) at one side of the I/O behavior and flush & sync as the other side. In Voron 4.0, we actually split it even further, meaning that journal writes, data flush and file sync are all independent operations which can happen independently.

Transactions are written to the journal file one at a time, until it reach a certain size (usually about 256MB), at which point we’ll create a new journal file. Flush will move data from the scratch buffers to the data file and sync will ensure that the data that was moved to the data file is durably stored on disk, at which point you can safely delete the old journals.

In order to trigger this bug, you needed to have the following sequence of events:

  • Have enough transactions happen quickly enough that the flush / sync operations are lagging by more than a single file behind the transaction rate.
  • Have a transaction start a new journal file while the flush operation was in progress.
  • Have, concurrently, the sync operation complete an operation that include that last journal file. Sync can take a lot of time.
  • Have another flush operation go on while the sync is in progress, which will move the flush target to the new journal file.
  • Have the sync operation complete, which only synced some of the changes that came from that journal, but because the new flush (which we didn’t sync yet) already moved on from that journal, mistakenly believe that this journal file is completed done and delete it.

All of these steps, that is just the setup for the actual problem, mind you.

In this case, we are prepared to have to this issue, but we aren’t yet to actually experience it. This is because what happened is that the persistent state (on disk) of the database is now suspect, if a crash happens, we will miss the oldest journal that still have transactions that haven’t been properly persisted to the data file.

Once you have setup the system properly, you aren’t done, in terms of reproducing this issue. We now have a race, the next flush / sync cycle is going to fix this issue. So you need to have a restart of the database within a very short period of time.

For additional complexity, the series of steps above will cause a problem, but even if you crash in just the right location, there are still some mitigating circumstances. In many cases, you are modifying the same set of pages in multiple transactions, and if the transactions that were lost because of the early deletion of the journal file had pages that were modified in future transactions, these transactions will fill up the missing details and there will be no issue. That was one of the issues that made it so hard to figure out what was going on. We needed to have a very specific set of timing between three separate threads (journal, flush, sync) that create the whole, then another race to restart the database at this point before Voron will fix itself in the next cycle, all happening just at the stage that Voron moves between journal files (typically every 256MB of compressed transactions, so not very often at all) and with just the right mix of writes to different pages on transactions that span multiple journal files.

These are some pretty crazy requirements for reproducing such an issue, but as the saying goes: One in a million is next Tuesday.

What made this bug even nastier was that we didn’t caught it earlier already. We take the consistency guarantees of Voron pretty seriously and we most certainly have code to check if we are missing transactions during recovery. However, we had a bug in this case. Because obviously there couldn’t be a transaction previous to Tx #1, we aren’t checking for a missing transaction at that point. At least, that was the intention of the code. What was actually executing was a check for missing transactions on every transaction except for the first transaction on the first journal file during recovery. So instead of ignoring just the the check on Tx #1, we ignored it on the first tx on all recoveries.

Of course, this is the exact state that we have caused in this bug.

Sigh.

We added all the relevant checks, tightened the guard rails a few more times to ensure that a repeat of this issue will be caught very early and provided a lot more information in case of an error.

Then we fixed the actual problems and subject the database to what in humans would be called enhanced interrogation techniques. Hammers were involved, as well as an irate developer with penchant to pulling the power cord at various stages just to see what will happen.

We have released the fix in RavenDB 4.1.4 stable release and we encourage all users to upgrade as soon as possible.

time to read 6 min | 1059 words

I talk a lot about the hiring process that we go through, but there is also the other side of that. When people leave us. Hibernating Rhinos has been around for about a decade, in that time it grew from a single guy operation to a company that cross the bridge from small to medium business a couple of years ago. 

When I founded the company, I had a pretty good idea of what I wanted to have. Actually, I had a very clear idea of what I didn’t want to have. The things that I didn’t want to carry over to my own company. For example, on call for 24/7 or working hours that exceed the usual norms or being under constant pressure.  By and large, looking back at our history and where we are today, I think that we did a pretty good job at upholding these values.

But that isn’t the topic of this post. I wanted to talk about people leaving the company. Given the time that we are in business, we actually have very little turnover. Oh, we had people come and go, and I had to fire people who weren’t pulling their weight. But those were almost always people who were at the company for a short while (typically under a year).

In the past six months, we had two people leave that were with us for three and seven years (about three months apart from one another). That is a very different kind of separation. When I was told that they intend to leave, I was both sad and happy. I was sad because I hated to lose good people, I was happy because they were going to very good places.

After getting over my surprised, I sat down and planned for their leaving. Israel has a month notice requirement, so we had the time to do things properly. I was careful to check (very gently) whatever this is a reversible decision and once I confirmed that they had made the decision, I carried on with that.

My explicit goals for that time were:

  • Make sure that they are leaving on good terms and great spirits.
  • Ensure proper handoff of their current tasks.
  • Provide guidance about current and past tasks.
  • Map area of responsibilities and make sure that they are covered after they are gone.

The last three, I believe, are pretty common goals when people are leaving, but the most important piece was the first one. What does this mean?

I wrote each of them a recommendation letter. Note that they both already had accepted positions elsewhere at that time, so it wasn’t something they needed. It is something that they might be able to make use of in the future, and it was something that I wanted to do, formally, as an appreciation for their work and skills.

As an aside, I have an open invitation to my team. I’ll provide both recommendation letters and serve as a reference in any job search they have, while they are working for us. I sometimes get CVs from candidates that explicitly note: “sensitive, current employer isn’t aware”. I don’t want to be the kind of place that you have to hide from.

We also threw each of them a going away party, with the entire company stopping everything and going somewhere to celebrate.

I did that for several reasons. First, each of them, in very different ways, contributed significantly to RavenDB. It was a joy to work with them, I don’t see any reason why it shouldn’t be a joy to see them go. I can certainly say that not saying goodbye properly would have created a bad taste for the entire thing, and that is something that I don’t want.

Second, and a bit more cold minded, I want to leave the door open to have them come back again. After so much time in the company, the amount of knowledge that they have in their head is a shame to lose for good. But even if they never come back, that is still a net benefit, because…

Third, there is the saying about “if you love someone, let them go…”. I think that a really good way to make people want to leave is to make it hard to do so. By making the separation easy and cordial,  the people who stay know that they don’t need to fear or worry about things if they want to see what else is available for them.

The last few statements came out a bit colder than I intended them to be, but I can’t really think about a good way to phrase the intention that would sound like that. I don’t like that these people left, and I would much rather have them stay. But I started out from the assuming that they are going to leave, and the goal is to make the best out of that.

I was careful to not apply any pressure on them to stay regardless. In fact, in one case, I upfront apologized to the person on the way out, saying: “I want you to know that I’m not pressuring you to stay not because I want you to go, but because I respect your decision to leave and don’t want to make it awkward”.

Fourth, and coming back to the what I want to have as a value for the company, is the idea that I wouldn’t mind at all to be a place where people retire from. In fact, I decidedly want that to be the case. And we do a lot of work to ensure that we are the kind of place that you can be at for long period of times (investing in our people, working on cool stuff, ensuring that grunt work is shared and minimized, etc). However, I would also take great pride in being the place that would be a launching pad to people’s careers.

In closing, people are going to leave. If it is because of something that you can control, that should be a warning sign and something that you should look at to see if you can do better. If it is out of your hands, you should accept it as given and make the best of it.

I was very sad to see them go, and I wish them all the best in their future endeavors.

time to read 1 min | 92 words

Last week we pushed an update to our public demo site, this is intended to walk you through using RavenDB, show code samples and provide detailed guidance into using RavenDB from your application.

Here is an example screen shot:

image

We spent a lot of time and effort on it, and I would appreciate you taking a peek and providing feedback on how useful that is for you to learn RavenDB and how to use it.

time to read 7 min | 1259 words

imageAfter my previous two posts on the topic, it is now time to discuss how we make money from Open Source Software. Before I start, I want to clarify what I’m talking about:

  • RavenDB is a document database.
  • It is about a decade old.
  • The server is released under the AGPL / commercial license.
    • We offer free community / developer licenses without any AGPL hindrance.
  • The RavenDB client APIs are licensed under the MIT license.
  • RavenDB (the product) is created by Hibernating Rhinos (the company).

I created RavenDB because I couldn’t not to. It was an idea that had to go out of my head. I looked up the details, and toward the end of 2008 I started to work on it as a side project. At the time I was involved in five or six active open source projects, just got my NHibernate Profiler product to a stable ground and was turning the idea of getting deeper into databases in my head for a while. So I sat down and wrote some code.

I was just doing some code doodling and it turned into deep design discussion and at some point I was actually starting to actively look for hep building the user interface for a “done” project. That was in late Feb 2010. Somehow, throwing some code at the compiler become over a journey that lasted over a year in which I worked 16+ hours days on this project.

Around Mar 2010 I knew that I had a problem. Continuing as I did before, just writing a lot of code and trying to create an OSS project out of it would eat up all my time (and money). The alternatives were actually making money from RavenDB or stop working on it completely. And I didn’t want to stop working on it.

I decided that I had to make an effort to actually make a product out of this project. And that meant that I had to sit down and plan how I would actually make money from it. I firmly believe that “build it, and they will come” is a nice slogan, but it doesn’t replace planning, strategy and (at least some) luck.

  • I already knew that I couldn’t sustain the project as a labor of love, and donations are not a sustainable way (or indeed, a way) to make money.
  • Sponsorship seemed like it would be unlikely unless I got one of my clients to start using RavenDB and then have them pay me to maintain it. That seemed… unethical, so wasn’t an option.
  • Services / consulting was something that I was already doing quite heavily, and was quite successful at it. But this is a labor intensive way of making money and it would compete directly with the time that it would take to build RavenDB itself.
  • Support is a model I really don’t like, because it put me in a conflict of interest. I take pride in what I do, and I wanted to make something that would be easy to use and not require support.
  • Open Core / N versions back – are models that I don’t like. The open core model often leaves out critical functionality (such as security) and the N versions back mean that you give the users you most want to have the best experience (since that would encourage them to give you money) the worst experience (here are all our bugs that we fixed but won’t give to you yet).

That left us with dual licensing as a way to make money. I chose the AGPL because it was an OSI approved license that isn’t friendly for commercial use, leading most users who want to use it to purchase a commercial license.

So far, this is fairly standard, I believe.

I decided that RavenDB is going to be OSS, but from most other aspects, I’m going to treat it as a commercial product. It had a paid team working on it from the moment it stopped being a proof of concept. It meant that we are intentionally set out to make our money on the license. This, in turn had a lot of implications. Support is defined as a Cost Center in Hibernating Rhinos. In other words, one of the things that we routinely do in Hibernating Rhinos is look at how we can reduce support.

One way of doing that, of course, is not have support, or staff the support team with students or the cheapest off shore option available. Instead, our support staff consists of decided support engineers and the core team that builds RavenDB. This serves several goals. First, it means that when you raise a support issue with us, you get someone who knows what they are doing. Second, it means that the core team is directly exposed (and affected by) the support issues that are raised. I have structured things in this manner explicitly because having an insight into actual deployment and customer behavior means that the team is directly aware of the impact of their work. For example, writing an error message that will explain some issue to the user matters, because it would reduce the time an engineer spends on the phone troubleshooting (not fun) and increases the amount of time they can sling code around (fun).

We had a major update between versions 3.5 and 4.0, taking almost 3 years to finish. The end result was a vastly improved performance, the ability to run on multiple platforms and a whole host of other cool stuff. But the driving force behind it all? We had to make a significant change to our architecture in order for us to reduce the support burden. It worked, and the need for support went down by over 80%.

Treating RavenDB as a commercial product from the get go, even though it had an OSS license, meant that we focused on a lot of the stuff that is mostly boring. Anything from docs, setup and smoothing out all the bumps in the road, etc. The AGPL was there as a way to have your cake and eat it too. Be an OSS project with all the benefits that this entails. Confidence from our users about what we do, entry to the marketplace, getting patches from users and many more. Just having the ability to directly talk to our community with the code in front of all of us has been invaluable.

At the same time, we sell licenses to RavenDB, which is how we make money. The idea is that we provide value above and beyond whatever it is our license cost, and we can do that because we are very upfront and obvious in how we get paid.

We have a few users who have chosen to go with the AGPL version and skip paying us. I would obviously rather get paid, but I have laid out the rules of the game when I started playing and that is certainly within the rules. I believe that we’ll meet these users as customers in the future, it isn’t really that different from the community edition which we offer freely. In both cases, we aren’t getting paid, but it expands our reach, which will usually get us more customers in the long run.

We have been doing this for a decade and Hibernating Rhinos currently has about 30 people working full time on it, so it is certainly working so far Smile!

time to read 5 min | 866 words

A large portion of my day to day tasks is to review code. I’m writing this post barely two weeks into the new year, and we already had over 150 PRs going into RavenDB alone.

As a result, I’ve gotten sensitive to certain issues. For example, the following is a suggestion made for fixing an issue in this method declaration:

image

This is a piece of code (in C) that is meant to handle some low level details for RavenDB. We use the CLR coding conventions for C#, but for C, we have chosen to use a different convention, using snake_case for methods, arguments and variables and SHOUTING_CASE for constants / defines. When reading through the code, I marked this violation of the naming convention for a fix.

This may seem minor, but it is probably annoying for the author of the code. They are interested in comments about the code functionality and utility. Why spend any time on something that doesn’t really have any impact? Both forms of the parameter name are just as readable to me, after all.

Before I get to this part, I want to show another piece of code. Or, to be rather more exact, two pieces of code. One of the reasons that we are using C code is that we can abstract large parts of the underlying platform inside the native code. That means that we have certain parts of the code that are written twice. Once for Windows and once for Linux.

Here is some code for Windows:

image

And here is the same code for Linux:

image

You can see that this is pretty much the same thing, just calling the different APIs for each platform. Once thing to notice here is that part of this method’s task is to ensure that the file that we open is at least as big as the initially requested size.

In Windows, to increase or decrease the file size you call SetFilePointer() followed by SetEndOfFile(). On Linux, you have fallocate() and ftruncate()*.This is reflected in the code. The Windows code has a single method to do this work and the Linux method has two methods. rvn_allocate_file_space() and rvn_truncate_file() which isn’t shown here.

* Actually, you might have fallocate(). Some file systems do not support it, and you need to use another workaround.

One of my code review comments has been that this need to be fixed, that we should have a _resize_file() method for Linux that would simple call the appropriate method based on the file size. But the question is, why?

These are two separate implementations for two separate operating systems. We are already creating the higher level abstraction level with operations that hide many system details. Why do I want to have a passthrough method just because the Windows code has this method?

The answer here, as in the case above with the parameter name, is the same. Consistency.

This it most obvious in the naming convention, but it is the same reasoning I had when I wanted to have the same method structure for both Linux and Windows.

Consistency is key for being able to slog through a lot of code. It is how I (and the rest of the team) can go through thousands of lines of code per week and understand what is going on. Because when we look at a piece of code, it follow certain conventions and structure. Reading the code is easy because we can ignore a lot of cruft around it and focus on what is going on.

In the case of the Windows / Linux methods, I first read one method and then the next, making sure that we are doing the same thing on all platforms. The different behavior (resize vs. allocate) was very obvious to me, which meant that I had to stop, go and look at each method’s implementation to figure out whatever there is any meaningful difference between them. That was a chore, and it would only become worse over time as we add additional functionality, so anything that isn’t different because it has to be different should match.

In general, I like code reviews where I can scan through the changes and not see the code, but it’s purpose. That happens when there isn’t anything there that I really have to think about and the intent is clear.

When writing in C#, we have decades (literally) of experience and organizational inertia that push us in the right direction. When push C code into the repository, I started to actually pay attention to those details explicitly, because I suddenly need to.

This is apparent in code reviews, but it isn’t just the case of me trying to make my own tasks easier. Code is read a lot more often than it is written, and focusing on making the code itself boring will pay off, because what the code is doing is what should be interesting in the long run.

time to read 6 min | 1079 words

I’m certain that the archangel that is responsible for statistical grouping has a strange fondness for jokes. If that isn’t the case, I can’t really explain how we routinely get clear clustering of bug reports on specific issues. We have noticed that several years ago, when we suddenly started to get a lot of reports of a particular issue. There was no recent release and no reason for all of these correlated errors to happen all at once. But they did. Luckily, the critical mass of reports gave us more than enough information to figure out what the bug was and fix it. When we checked, the bug was there, dormant, for 20 months.

The current issue we are handling is a spate of reports about hard errors in RavenDB. The kind of errors that use the terms invalid data, in particular. These kind of errors bring development to a screeching halt, as we direct all available resources to figure out what is going on. We didn’t have a reproduction, but we did have a few cases of people reporting it and the stack trace we got told us that the problem wasn’t isolated to a single location but seemed to be fairly wide spread. This is a Good News / Bad News kind of situation. It is good because it means that the problem is in a low level component, it is bad because it is going to be hard to narrow it down exactly.

The common theme around this issue popped up when the system was already in a bad state. We were able to see that when we run out of disk space or when the memory on the system was very low. The common thread to these was that in both cases these are expected and handled scenarios. What is more, these are the kind of errors that we want to keep on shouldering through.

In general, the error strategy we use in most of RavenDB is fairly simple. Abort the transaction and give the user an error. If the error is considered catastrophic (I/O failure when writing to disk, for example), we also force a restart of the database (which does full recovery). As it turned out, there were three completely separate issues that we found when investigating this issue.

First, and the most critical one was cleanup code that wasn’t aware of the state of the system. For example, consider:

Let’s assume that we have an error during the WriteData. For example, we run out of space on disk. We abort the transaction and throw an exception. The FlushIndex() method, however, will go into the catch clause and try to cleanup its resources. When doing so, it will access the transaction (that was just aborted) and try to use it, still.

The problem is that at this point the state of the transaction is known bad, the only thing you are allowed to do is to dispose it, you cannot access anything on it. The code in question, however, is meant to be used on non transactional resource and require compensating actions to revert to previous state.

For that matter, the code above has another problem, however. Do you see the using statement here? It seemed like we have quite a bit of code that does stuff in the Dispose method. And if that stuff also touch the transaction, you may get what is called in the profession: “hilarity ensued”.

The problem was that our transaction code was too trusting and assumed that the calling code will not call it if it is in invalid state. Indeed, the cases where we did such a thing were rare, but when they happened.

The second problem was when we had code like this:

Except that the real code is a lot more complex and it isn’t as each to see what is going on.

What you see here is that we run commands, and as we process them, we may get (expected) exceptions, which we should report to the caller. The problem is that we mixed expected exceptions with unexpected ones. And these unexpected exceptions could leave us in… a funny state. At which point we would continue executing future commands and even commit the transaction. As you can imagine, that isn’t a good place to be at.

We have gone over our error handling and injected faults and errors at any later of the stack that we could conceive of. We have been able to fix a lot of issues, most of them have probably never been triggered, but it was a sobering moment to look at some of those changes. The most likely cause for the errors was a change that was made (by me, as it turns out) over two years ago. And in all that time, we never seen neither hide nor hair of it. Until suddenly we got several simultaneous cases.

The third and final problem, by the way, was related to not crashing. By default, any severe enough error should cause us to shut down the database and restart it. In the process, we re-apply the journal and ensure that we are in a consistent state. The problem is that we don’t want to do that for certain expected errors. And the issue with staying up was that while Voron (at the storage layer) handled the error properly, the higher level code did not. At that point we had a divergence of the in memory state vs. the on disk state. That usually led to either NRE that would remain until the server was restarted or really scary messages that would typically go away when we recovered the database and reloaded the in memory state from the on disk state.

In short, handling errors in the error handling code is tough.

The primary changes we made were on the transaction side, we made it validate its own state when called, so code that erroneously try to use a transaction that has already failed will error early. We have also added additional validation of operations in several key points. That would allow us to catch errors much more quickly and allow us to pinpoint exactly where things are breaking apart sooner.

The current state is that we pushed these changes to our production system and are running the usual battery of tests to prove that the changes are valid. We’ll also be adding more faults into the process to ensure that we exercise our error handling a lot more often.

time to read 5 min | 944 words

imageWe test RavenDB as thoroughly as we can. Beyond just tests, longevity runs, load test and in general beating it with a stick and seeing if it bleats, we also routinely push early build of RavenDB to our own production systems to see how it behaves on real load.

This has been invaluable in catching several hard to detect bugs. A few days after we pushed a new build to production, we started getting errors about running out of disk space. The admin looked, and it seemed to have fixed itself. Then it repeated the issue.

It took a bit of time to figure out what was going on. A recent change caused RavenDB to hang on to what are supposed to be transient files until an idle moment. Where idle moment is defined as 5 seconds without any writes. Our production systems don’t have an idle moment, here is a typical load on our production system:

image

We’ll range between lows of high 30s requests per seconds to peeks of 200+ requests per second.

So we never have an idle moment to actually clean up those files, so they gather up. Eventually, we run out of disk space. At this point, several interesting things happen. Here is what our topology looks like:

image

We have three nodes in the cluster, and currently, node B is the leader of the cluster as a whole. Most of our operations relate to a single database, and for that database, we have the following topology:

image

Note that in RavenDB, the cluster as a whole and a particular database may have different topologies. In particular, the primary node for the database and the leader of the cluster can and are often different. In this case, Node A is the primary for the database and node B is the leader of the cluster.

Under this scenario, we have Node A accepting all the writes for this database, and then replicating the data to the other nodes. Here is what this looks like for node C.

image

In other words, Node A is very busy handling requests and writes. Node C is just handling replicated writes. Just for reference, here is what Node A’s CPU looks like, the one that is busy:

image

In other words, it is busy handling requests. But it isn’t actually busy. We got a lot of spare capacity at hand.

This make sense, our benchmark scenarios is starting with tens of thousands of requests per seconds. A measly few hundreds per second aren’t actually meaningful load. The servers in questions, by the way, are t2.large with 2 cores and 8GB of RAM.

So we have a bug in releasing files when we aren’t “idle”. The files just pile up and eventually we run out of disk space. Here is what this looks like in the studio:

image

And clicking on details gives us:

image

So this looks bad, but let’s us see how this actually turns out to work in practice.

We have a failure on one node, which causes the database to be unloaded. RavenDB is meant to run in an environment where failure is likely and is built to handle that. Both clients and the cluster as a whole know that when such things happen, either to the whole node or to one particular database on it, we should respond accordingly.

The clients automatically fail over to a secondary node. In the case of the database in question, we can see that if Node A is failure, the next in like would be Node C. The cluster will also detect that are re-arrange the responsibilities of the nodes to indicate that one of them failed.

The node on which the database has failed will attempt to recover from the error. At this point, the database is idle, in the sense that it doesn’t process any requests, and will be able to cleanup all the files and delete them. In other words, the database goes down, restarts and recover.

Because of the different write patterns on the different nodes, we’ll run into the out of disk space error at different times. We have been running in this mode for a few days now. The actual problem, by the way, has been identified and resolved. We just aren’t any kind of pressure to push the fix to production.

Even under constant load, the only way we can detect this problem is through the secondary affects, the disk space on the machines that is being monitored. None of our production systems has reported any failures and monitoring is great across the board. This is a great testament to the manner in which expecting failure and preparing for it at all levels of the stack really pays off.

My production database is routinely running out of space? No worries, I’ll fix that on the regular schedule, nothing is actually externally visible.

time to read 9 min | 1610 words

image“If a tree falls in a forest and no one is around to hear it, does it make a sound?” is a well known philosophical statement. The technological equivalent of this is this story. We got a report that RavenDB was failing in the field. But the details around the failure were critical.

The failure happened on the field, literally. This is a system that is running an industrial robot using a custom ARM board. The failure would only happen on the robot on the field, would not reproduce on the user’s test environment or on our own systems. Initially, that was all the information that we had: “This particular robot works fine for a while, but as soon as there is a break, RavenDB dies and needs to be restarted”. That was the first time I run into a system that would crash when it went idle, instead of dying under load, I have to say.

My recommendation that they would just keep the robot busy at all time was shot down, but for a while, we were in the dark. It didn’t help that this was literally a custom ARM machine that we had no access to. We finally managed to figure out that the crash was some variant of SIGSEGV or SIGABRT. That was concerning. The ARM machine in question is running on 32 bits, and the worry was that our 32 bits code was somehow doing a read out of bound. This is a crash in production, so we allocated a couple of people to investigate and try to figure out what was going on.

We started by doing a review of all our 32 bits memory management code and in parallel attempted to reproduce this issue on a Raspberry Pi (the nearest machine we had to what was actually going on). We got a lucky break when we someone did manage to kill the RavenDB process on our own lab somehow. The exit code was 139 (Segmentation fault), but we weren’t sure what was actually going on. We were trying all sort of stuff on the machine, seeing what would cause this. We basically fed it all sorts of data that we had laying around and saw if it would choke on that. One particular data export would sometimes cause a crash. Sometimes. I really really hate this word. That meant that we were stuck with trying to figure out something by repeatedly trying and relying on the law of averages.

It took several more days, but we figured out that a certain sequence of operations would reliably cause a crash within 5 – 30 minutes. As you can imagine, this made debugging pretty hard. The same sequence of operations on Intel machines, either 32 bits or 64 bits worked without issue, regardless of many times we have repeated them.

We followed several false trails with our investigation into RavenDB’s memory management’s code in 32 bits. We had a few cases where we thought that we had something, but nothing popped up. We have instrumented the code and verified that everything seemed kosher, and it certainly did, but the system still crashed on occasion.

RavenDB usually relies on mmap() to access the data on disk, but on 32 bits, we couldn’t do that. With an addressable memory of just 2 GB, we cannot map the whole file to memory if it is too large. Because of that, we map portions of the file to memory as needed for each transaction. That led us to suspect that we were somehow unmapping memory while it was still in use or something like that. But we have gone through the code with a fine tooth comb and got nothing. We used strace to try to help point out what is going on and we could see that there were no surprise calls to unmap() that shouldn’t be there.

What was really nasty was the fact that when we failed with SIGSEGV, the error was always on an address just past the area of memory that we mapped. This lead us to suspect that we had an out of boundary write and led to a chase for that rouge pointer operation. We instrumented our code ever more heavily, but weren’t able to find any such operation. All our reads and writes were in bound, and that was incredibly frustrating. RavenDB is a CoreCLR application. As such, debugging it on an ARM device is… challenging. We tried lldb and gdb. Both allow unmanaged debugging, but even with lldb, we couldn’t debug managed code or even just pull the managed stack properly from ARM. Eventually we found this extension which allow to do SSH debugging on the Raspberry PI from a Windows machine.

That helped, and we finally figured out where in our managed code the error happened. This always happened during a copy of memory from a document write to a scratch buffer in a memory mapped file. The entire thing was wrapped in boundary checks and everything was good.

We went back to the drawing board and attempted to set it on fire, because it was no good for us. Once we put the fire out, we looked at what remained and had an Eureka! moment. Once of the differences between ARM and x86/x64 machines is in how they treat alignment. In x64/x86 alignment is pretty much a non issue for most operations. On ARM, however, an unaligned operation will cause a CPU fault. That led us to suspect that the SIGABRT error we got was indeed an alignment issue. Most of our code is already aligned on memory because while it isn’t mandatory on x64/x86, it can still get better perf in certain cases, but it is certainly possible that we missed it.

We discovered a horrifying problem:

image

We were using the CopyBlock method, and obviously that was the issue, right? We wrote a small test program that simulated what we were doing and used unaligned CopyBlock and it just worked. But maybe our situation is different?

Using CopyBlockUnaligned on x86 led to a 40% performance drop (we call this method a lot) and initially it looked like it fixed the problem on ARM. Except that on the third or forth attempts to reproduce the problem, we run into our good old SIGSEGV again, so that wasn’t it. This time we went to the drawing board and broke it.

During this time, we have managed to capture the error inside the debugger several times, here is what this looked like:

image

Reading ARM assembly is not something that I’m used to do, so I looked at the manual, and it looks like this instruction is to store multiple registers in descending order and… no clue beyond that. It didn’t make any sort of sense to us.

At this point, we were several weeks and four or five people into this investigation (we consider such issues serious). We have instrumented our code to the point where it barely run, we could manage to reproduce the error in a relatively short time and we were fairly convinced that we were doing things properly. Going over the kernel code for memory mapping and unmapping several times, stracing, debugging, everything. We were stumped. But we also had enough data at this point to be able to point a fairly clear picture of what was going on. So we opened an issue for the CoreCLR about this, suspecting that the issue is in the implementation of this CopyBlockUnaligned.

We got a strange response, though: “This assembly code doesn’t make any sense”. I did mention that I have no idea about ARM assembly, right?  We tried reproducing the same thing in gdb, instead of lldb and got the following assembly code:

image

This looked a lot more readable, to be sure. And it was extremely suspicious. Let me explain why:

image

The faulting instruction is: ldr r3, [r0, #0]

What this says is basically, read a word from the address pointed to by r0 (with 0 offset) into r3.

Now, r0 in this case has this value: 0x523b3ffd. Note the last three characters, ffd.

We are running this on a 32 bits machine, so a word is 4 bytes in side. FFD+4 = 1001

In other words, we had a read beyond the current page boundary. In most cases, the next page is mapped, so everything goes smoothly. In some cases, the next page is not mapped, so you are going to get an access violation trying to read a byte from the next page.

The fix for this is here:

image

This is literally a single character change. And probably the worst lines of codes / time invested that I have ever seen for any bug. Actually, there wasn’t even any code change in RavenDB’s codebase, so that is 0 lines of code / 4 people x 4 weeks.

The good thing is that at least we have proven that the 32 bits memory code is rock solid, and we have a lot better understanding on how to resolve the next issue.

time to read 6 min | 1154 words

I run across this Twitter thread and I’m in… awe, I want to say, but in a really horrifying manner.

The thread started with DHH calling out this job posting:

image

I’ve marked the important piece. You can read the twitter thread for the gory details. Based on the job posting and the interaction of the CEO on twitter, I’m going to assume that this job pay six to nine times more than the average developer can make, because otherwise I can’t really figure out why anyone would work in such a place.

I have had two (or three) separate burnouts before I was thirty. I was single, no kids, and I worked crazy hours. I was productive and I burned out. Mid 2009 was a fun time for me, I was physically nauseous every time I sat in front of a computer  and seriously considered a career shift to construction. A large number of life decisions were made as a result of that.

And while I admit that being young and discovering things is its own reward, this isn’t really something new. Workplace productivity is not an unexplored subject and I would expect anyone in either HR or management position (and certainly a CEO in a company that is busy hiring people) to be at least peripherally aware of them.  You might sometime need to work crazy hours, I fully understand crunch time and “OMG, production is DOWN”. But if you do that, you need to understand that this is done with eyes open.  Such scenarios should be rare and you should be aware that any temporary increase in productivity will need to be paid off by reduced productivity down the line.

Again, this isn’t new. You can go to Ford in the early years on the last century to read more about how to increase productivity. So asking for 60 hours per week on an ongoing basis is pretty… crazy.

Let’s assume that this is for five days a week position, shall we? Based on the job posting on the CEO’s behavior on Twitter, I assume that it isn’t the case, but humor me.

This means, Monday to Friday, you start working at 9 AM and finish at 9 PM. If you have kids, this means never reading them a bed time story, not being able to see them perform, never coming to PTA meeting. If you have a spouse, it means a relationship that is mostly around texts, because you ain’t going to see them.  If you don’t have a significant other, good luck finding one with the time allotted.  But 12 hours days are probably not going to cut it. So let’s say that you work only 10 hours a day, but we’ll include Sunday as well, to make up for the “lost time”. You still log in at 9AM, but now you get to leave at 7 PM.

Oh, and if you better not be sick, or have to drive your mom to the airport or need to visit the DMV. Nobody got any time for that. On a more serious note. This kind of environment will make you sick. It can take a lot of time to recover from that, both mentally and physically.

I can’t imagine anyone who would be signing up for something like that. Actually, I can, quite easily. There are plenty of professions where this is normal. To pick examples off the top of my head, lawyers, nurses and doctors all work crazy hours, or at least, that is the impression I have. Let’s check, shall we? The following results are pretty much the first result of googling the question.

  • Lawyers, for example, can expect to work 60+ hours a week on average.
  • Nurses, on the other hand, are considered full time if they work about 36 – 40 hours a week.
  • Doctors, vary wildly, with 40 - 80 hours, depending on the type of specialty and where they are in their career.

The key here, for both lawyers and doctors, is that typically after a pretty harsh initial period, you can expect to gain a lot for your work. In other words, there is a high probability that there is going to be a good return on this investment.  For this job posting, again based solely on the text and the CEO’s behavior, I’m assuming there is no such upside.

Whenever we sent job acceptance letter, I used to put in a statement about Hibernating Rhinos being a place where you didn’t have to leave the office at 9 PM. Since then we grew a bit and got a few people who are a night owls, so they come later to the office. That somewhat spoils this statement, but I can still state that no one works crazy hours.

By the way, that is not because some of the devs haven’t tried. I’m very familiar with the excitement of being almost there. At the cusp of figuring out this bug or completing that feature. It sometimes make sense to keep going and complete just this one task and getting there. And this is fine, if you pace yourself. But with some people, I had to tell them that if they keep staying so much in the office, I’m going to start charging them rent. That seemed to do the trick.

When I founded Hibernating Rhinos, I wanted to create a place that I would like which would give me interesting things to work on . In the decade that Hibernating Rhinos has been around, I don’t believe that we ever had a crunch time that didn’t directly relate to a customer problem (and these tend to be rare) . Whenever I had to make the choice, slipping the release date has always been the better option in my eyes rather than sacrificing quality or keeping people at their desk to try to get more done.

This includes the last three years in which our team basically rebuilt a distributed database from the ground up and gotten a minimum of 10x performance improvement across the board. This was without anyone expected to put in sunrise to sundown shifts. Given that I think that this is about creating a marketing/sales platform. I’m… not impressed by that. In fact, I’m pretty sure that if they start out with planning to squeeze their own people dry from the get go they have the same level of cluelessness in other aspects of the business. On the other hand, assuming that they actually manage to get a viable product out, someone is going to have a field day threading through all the holes that tired, listless and demotivated developers have left in the system.

I think that I can summarize all of this post in a single word: BEWARE!

time to read 4 min | 660 words

When we start building a feature, we often have a pretty good idea of what we want to have and how to get there. And then we actually start building it and we often end up with something that is quite different (and usually much better). It has gotten to the point where we aren’t even trying to do hard specs and detailed design at anything beyond the exploratory levels. For example, in the design of RavenDB 4.0, there was not even a mention of RQL. That ended up being a very late addition to the codebase, but it improved RavenDB significantly. On the other hand, the low level mechanisms of zero copy documents from Voron all the way to the network were designed up front, but only at a fairly high level.

In this post, I want to talk about query parameters in RavenDB. Actually, let me be more specific, we have query parameters, but what we don’t have (or rather, didn’t have, because that will be merged in by the time you read this post) is the ability to run parameterized queries from the studio. We always meant to have that capability, but we run out of time with the 4.0 release. As we are gearing up to the 4.1 release, we are cleaning the table from the major-minor issues. (Major in term of impact, minor in term of amount of work required). The query parameters in the studio is one such example. Here is what this looks like:

image

My first thought was to just build something like this:

image

Give the user the ability to define arguments and be done with it. The task was assigned to one of our developers and I expected to get a PR in a short while.

This particular developer has a tendency to consider not just the task at hand but also other aspects of the problem. He didn’t want the user to have to manually specify each argument, since that has poor ergonomics. Instead, he wanted the studio to figure it out its own and help the user. So the first thing he did was detect the arguments (regex: “\$\w+”) and present them in the grid. Then there was the issue of how to deal with edits, etc. Then he run into another problem, types. Query parameters can be more than just strings, they can be any JSON data type.

Here is what he came up with:

image

Instead of having to define the query parameters in a separate location, just put them right in. Having the parameters grid involves pointing and clicking with the mouse, entering possibly complex values (such as long arrays) and in general much more work than just having them right above the query.

Note that this is a studio only feature, queries from the client API already have ways to specify arguments properly. So the next question is how we are going to handle passing the arguments to the server. Remember, this is only on the studio, so we can take quite a few shortcuts. In this case, we’ll simply snip the entire first section of the query text (which contains the query parameters). We can do that by going from the start of the query to the first from or declare keywords. We do a basic pre-processing to turn “$name = …“ into “results.$name = …“ and then just execute this code in the browser, giving us a JS object with all the parameters that we can then send to the servers.

The next stage is to make this discoverable, by detecting parameters whose value is not provided and giving the user a quick fix to add them.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. RavenDB 4.2 Features (4):
    19 Mar 2019 - Time travel and revisions revert
  2. Workflow design (4):
    06 Mar 2019 - Making the business people happy
  3. Data modeling with indexes (6):
    22 Feb 2019 - Event sourcing–Part III–time sensitive data
  4. Production postmortem (25):
    18 Feb 2019 - This data corruption bug requires 3 simultaneous race conditions
  5. Making money from Open Source Software (3):
    08 Feb 2019 - How we do it?
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats