Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,708 | Comments: 48,620

filter by tags archive

If you stay in the office any longer, I’ll start charge you rent

time to read 6 min | 1154 words

I run across this Twitter thread and I’m in… awe, I want to say, but in a really horrifying manner.

The thread started with DHH calling out this job posting:

image

I’ve marked the important piece. You can read the twitter thread for the gory details. Based on the job posting and the interaction of the CEO on twitter, I’m going to assume that this job pay six to nine times more than the average developer can make, because otherwise I can’t really figure out why anyone would work in such a place.

I have had two (or three) separate burnouts before I was thirty. I was single, no kids, and I worked crazy hours. I was productive and I burned out. Mid 2009 was a fun time for me, I was physically nauseous every time I sat in front of a computer  and seriously considered a career shift to construction. A large number of life decisions were made as a result of that.

And while I admit that being young and discovering things is its own reward, this isn’t really something new. Workplace productivity is not an unexplored subject and I would expect anyone in either HR or management position (and certainly a CEO in a company that is busy hiring people) to be at least peripherally aware of them.  You might sometime need to work crazy hours, I fully understand crunch time and “OMG, production is DOWN”. But if you do that, you need to understand that this is done with eyes open.  Such scenarios should be rare and you should be aware that any temporary increase in productivity will need to be paid off by reduced productivity down the line.

Again, this isn’t new. You can go to Ford in the early years on the last century to read more about how to increase productivity. So asking for 60 hours per week on an ongoing basis is pretty… crazy.

Let’s assume that this is for five days a week position, shall we? Based on the job posting on the CEO’s behavior on Twitter, I assume that it isn’t the case, but humor me.

This means, Monday to Friday, you start working at 9 AM and finish at 9 PM. If you have kids, this means never reading them a bed time story, not being able to see them perform, never coming to PTA meeting. If you have a spouse, it means a relationship that is mostly around texts, because you ain’t going to see them.  If you don’t have a significant other, good luck finding one with the time allotted.  But 12 hours days are probably not going to cut it. So let’s say that you work only 10 hours a day, but we’ll include Sunday as well, to make up for the “lost time”. You still log in at 9AM, but now you get to leave at 7 PM.

Oh, and if you better not be sick, or have to drive your mom to the airport or need to visit the DMV. Nobody got any time for that. On a more serious note. This kind of environment will make you sick. It can take a lot of time to recover from that, both mentally and physically.

I can’t imagine anyone who would be signing up for something like that. Actually, I can, quite easily. There are plenty of professions where this is normal. To pick examples off the top of my head, lawyers, nurses and doctors all work crazy hours, or at least, that is the impression I have. Let’s check, shall we? The following results are pretty much the first result of googling the question.

  • Lawyers, for example, can expect to work 60+ hours a week on average.
  • Nurses, on the other hand, are considered full time if they work about 36 – 40 hours a week.
  • Doctors, vary wildly, with 40 - 80 hours, depending on the type of specialty and where they are in their career.

The key here, for both lawyers and doctors, is that typically after a pretty harsh initial period, you can expect to gain a lot for your work. In other words, there is a high probability that there is going to be a good return on this investment.  For this job posting, again based solely on the text and the CEO’s behavior, I’m assuming there is no such upside.

Whenever we sent job acceptance letter, I used to put in a statement about Hibernating Rhinos being a place where you didn’t have to leave the office at 9 PM. Since then we grew a bit and got a few people who are a night owls, so they come later to the office. That somewhat spoils this statement, but I can still state that no one works crazy hours.

By the way, that is not because some of the devs haven’t tried. I’m very familiar with the excitement of being almost there. At the cusp of figuring out this bug or completing that feature. It sometimes make sense to keep going and complete just this one task and getting there. And this is fine, if you pace yourself. But with some people, I had to tell them that if they keep staying so much in the office, I’m going to start charging them rent. That seemed to do the trick.

When I founded Hibernating Rhinos, I wanted to create a place that I would like which would give me interesting things to work on . In the decade that Hibernating Rhinos has been around, I don’t believe that we ever had a crunch time that didn’t directly relate to a customer problem (and these tend to be rare) . Whenever I had to make the choice, slipping the release date has always been the better option in my eyes rather than sacrificing quality or keeping people at their desk to try to get more done.

This includes the last three years in which our team basically rebuilt a distributed database from the ground up and gotten a minimum of 10x performance improvement across the board. This was without anyone expected to put in sunrise to sundown shifts. Given that I think that this is about creating a marketing/sales platform. I’m… not impressed by that. In fact, I’m pretty sure that if they start out with planning to squeeze their own people dry from the get go they have the same level of cluelessness in other aspects of the business. On the other hand, assuming that they actually manage to get a viable product out, someone is going to have a field day threading through all the holes that tired, listless and demotivated developers have left in the system.

I think that I can summarize all of this post in a single word: BEWARE!

The iterative design process: Query parameters example

time to read 4 min | 660 words

When we start building a feature, we often have a pretty good idea of what we want to have and how to get there. And then we actually start building it and we often end up with something that is quite different (and usually much better). It has gotten to the point where we aren’t even trying to do hard specs and detailed design at anything beyond the exploratory levels. For example, in the design of RavenDB 4.0, there was not even a mention of RQL. That ended up being a very late addition to the codebase, but it improved RavenDB significantly. On the other hand, the low level mechanisms of zero copy documents from Voron all the way to the network were designed up front, but only at a fairly high level.

In this post, I want to talk about query parameters in RavenDB. Actually, let me be more specific, we have query parameters, but what we don’t have (or rather, didn’t have, because that will be merged in by the time you read this post) is the ability to run parameterized queries from the studio. We always meant to have that capability, but we run out of time with the 4.0 release. As we are gearing up to the 4.1 release, we are cleaning the table from the major-minor issues. (Major in term of impact, minor in term of amount of work required). The query parameters in the studio is one such example. Here is what this looks like:

image

My first thought was to just build something like this:

image

Give the user the ability to define arguments and be done with it. The task was assigned to one of our developers and I expected to get a PR in a short while.

This particular developer has a tendency to consider not just the task at hand but also other aspects of the problem. He didn’t want the user to have to manually specify each argument, since that has poor ergonomics. Instead, he wanted the studio to figure it out its own and help the user. So the first thing he did was detect the arguments (regex: “\$\w+”) and present them in the grid. Then there was the issue of how to deal with edits, etc. Then he run into another problem, types. Query parameters can be more than just strings, they can be any JSON data type.

Here is what he came up with:

image

Instead of having to define the query parameters in a separate location, just put them right in. Having the parameters grid involves pointing and clicking with the mouse, entering possibly complex values (such as long arrays) and in general much more work than just having them right above the query.

Note that this is a studio only feature, queries from the client API already have ways to specify arguments properly. So the next question is how we are going to handle passing the arguments to the server. Remember, this is only on the studio, so we can take quite a few shortcuts. In this case, we’ll simply snip the entire first section of the query text (which contains the query parameters). We can do that by going from the start of the query to the first from or declare keywords. We do a basic pre-processing to turn “$name = …“ into “results.$name = …“ and then just execute this code in the browser, giving us a JS object with all the parameters that we can then send to the servers.

The next stage is to make this discoverable, by detecting parameters whose value is not provided and giving the user a quick fix to add them.

Product Release Postmortem: Things You Should Never Do, Part II

time to read 18 min | 3409 words

imageThis post is the text version of a presentation I gave a few weeks ago. There is in reference of this classic post by Joel.

In 2015, I decided that we needed to reboot RavenDB. I did that with the full understanding that this is going to be a huge task, including knowing that it will be bigger than what I can project, even if I take this line of thinking into account.

RavenDB 1.0 was written a decade ago. It was written because it didn’t leave me alone and I wanted to get it out of my head. At the time, I was focused more on getting it out the door (and my head) and was taking shortcuts in the implementation. That allowed me to cut down dramatically on the amount of work that is involved in it. At the same time, this put some constraints on the implementation and architecture. The most obvious one was the reliance on Esent, which tied us to Windows. C# as the implementation language, to a lesser extent, also had the same issue until .NET Core. (Yes, I’m aware of Mono, I have no idea how people managed to run anything beyond hello world on it. We tried porting RavenDB to Mono multiple times, and I still bear the scars.)

I went back and looked at our release notes, in literally every major release, we have spent significant amount of time and effort on “performance optimizations”. In January of 2015 we had a few sprints that were dedicated to just this issue. We went down to assembly code in some cases, analyzed our hotspots and optimize things in a very serious manner. We got some amazing performance improvements in some cases, reducing the runtime by orders of magnitude in some cases. But it still felt like we were hitting a limit. What is more, experience from customers in production showed us that there were a number of cases where we run into problematic behavior. This mostly happened on large / complex projects. And nearly all those issues were related in one way or another to memory and the GC.

Our indexing, for example, would be reading data from disk into memory. That was meant to save disk I/O during indexing, and including pretty smart prefetching and monitoring behavior. It also had the side effect of loading documents (which can be large) into managed memory and holding on to them long enough to push them into Gen1 and Gen2. Then they would be indexed and need to go away. But given that they were pushed to a higher generation… that meant more expensive collection cycle.

RavenDB was created before the pervasive use of fast disks, and it turns out that in some cases, reading the data from disk was actually faster than parsing it using JSON.Net. In other words, our “I/O bound” process of reading documents was actually dominated by the time it took to parse the JSON text. That does not include the costs of actually cleaning up this memory. Complex JSON documents can have a lot of objects,  and the cost of GC rise with the number of objects that are being tracked. There were pretty fundamental problems, which I didn’t think we could fix in a piecemeal fashion.

That time also coincided with a peak in the number of support incidents that we got. Unlike many other open source projects, we treat support as a cost center, not a revenue center. In other words, we don’t want to have more support, that isn’t how we want to make money. Being a database, we were frequently at the heart of things and our customers and users are very sensitive to any issue that might arise. I’m painting somewhat of a bleak picture, I’m aware. It wasn’t nearly that bad from the point of view of any particular customer. But on aggregate, from our point of view, it felt like a nasty game of whack a mole. As soon as we provided a solution to one customer’s issue, another would pop up, somewhat related but just different enough to not be fixed by the previous change. These weren’t regressions, mind. These were just a lot of places where the changing times violated some of our core assumptions.

Toward the end of 2015, I sat down and really thought about what we needed and were missing. This was the situation as I saw it.

image

There was also the issue that we have learned a lot over the years. We built Voron (our storage engine) from the ground up, we had a lot of experience running in production and we knew what kind of tasks our customers were using us for. I kept thinking that I wished I had a time machine and could do things over properly. Given that my time machine is still in the shop, I decided that we had two options:

  1. Minor fixes along the way – slowly improving our behavior as we stride toward the desired architecture and usage.
  2. Break it all – essentially start from scratch, with a new architecture and write it the way we want it to be written.

The obvious choice was to do this slowly. The problem was that I really couldn’t think of a good way to actually achieve that. The kind of changes we wanted to make started from replacing the most fundamental structure we had, how we represent JSON in our document database and got more complex from there. We wanted to change how we store data on disk, how we index data, how we … literally every single feature that we had was going to be transformed in some way.

We also had additional issues. The Windows only limitation was really hurting us and we really wanted to get a good Linux story going. The support burden was also at the very top of my mind as we considered what to do. In the end, we came up with the following decisions:

  • We don’t require backward compatibility. Either on the server side or client side.
    • That was the hardest decision, but it meant that we could actually tackle some of the biggest issues freely and without constraint.
    • That meant that we wanted to keep the same feeling, but be able to make changes to corners of the API that atrophied.
  • Support cost and simplified operations as a primary concern.
    • This meant that, at the design level, we took into account debugging considerations.
  • Order of magnitude performance improvement across the board.
    • Otherwise, it isn’t worth the effort.
  • Cross platform from the get go.

That was in Sep 2015. I sat down and wrote a design document that outlined the new architectural approach, spiked a few things and then we were off to the races. I blogged all about the process extensively, so I’m not going to repeat that.

We decided to use DNX (which became .NET Core) at a very early stage. Initially, I don’t believe that we even had a debugger, and most of our builds had to be trigger from the command line. I guess that if you are going to make a risky decision, you might as well make a few others…

I’ll say that I made a lot of preparation to fail up front. Part of the reason we went with DNX was that we knew that worst case scenario, we could spend a few days and get it working on the full .NET framework if we had to. I took this step with a lot of backward glances to make sure that we won’t get lost.

Alongside our experience in supporting RavenDB, we also run a UX study and combed all the incident reports we generate from support calls. The idea was to take as much time as necessary to get things as right as we could handle it. The studio change between 3.5 and 4.0 is massive, and was driven by getting a talented professional to design each part of the UI, guided by real world UX study and analysis. We kept asking “where do it hurt?” and whenever we had found a cause of pain we worked to alleviate it.

Some of our guiding principals during that phase of the project were:

  • Cross platform from the get go.
    • We couldn’t afford to port it midway through. Too complex and prune to failure.
  • OWN the stack.
    • We don’t want to use any components that we don’t have good visibility into and the ability to work  with.
    • In particular, anything that is a core competence should be owned and built by us. For our scenario, that means primarily the storage engine.
  • Build for performance.
    • I wasn’t kidding about requiring x10 performance improvement. We had one or two devs at all time running benchmarks and fixing things performance of every completed feature.
  • Build for operations.
    • Each and every design decision should be considered in light of its operational behavior.
    • In particular, we excised any feature that relied on hard to figure out technology or integration (I’m looking at you, Windows Auth).
    • This included changing the design of the software so a core dump would make it easier to figure out what is going on. We also explicitly opened up a lot of the internal behavior as debug endpoints and plug them to the studio so operators will have greater visibility. As an aside, that was very helpful in figuring out our performance bottlenecks and we worked to improve that part of the project as we strived for ever faster performance.
  • Reducing the support burden as an major goal.
    • A lot of the previous points tie into this. But this is also where we combed over any issue that had a “user misconfigured / misused” and built in alerts directly into RavenDB to give the user early warnings about common issues.
  • We defined a set of common scenarios. Reading / writing documents, for example, and then we spent months on designing the whole system so it will work to make these fast, seamless and easy.

A good example of that is how we stored documents in RavenDB now. We have our own binary format that allow us to avoid parsing the document when reading from disk, plays nicely with memory mapped files (which is how Voron, our storage engine, works) and can effectively allow us to hand a pointer to a memory mapped buffer and start working with that as a JSON document without:

  • Allocating any managed memory
  • Parsing JSON
  • Require caching / pre-fetching, etc.

We spent a lot of time thinking about what we want to do, and then we looked into how the operating system expect us to behave. The idea is that if we play to the operating system’s expectation, we can reap a lot of benefits from the OS’ own behavior. This is how RavenDB handles loading data to memory. We let the operating system handle it and just make sure that our own behavior is both predictable and applicable to the OS’ optimizations.

I mentioned that GC was the bane of our existence, right? We moved a lot of the memory management in RavenDB to unmanaged code and handle that explicitly. That gave us the advantage that we know a lot more about how we should expect to use the memory an can spend the time to make this highly optimized.

At the debugger side of things, we made some changes to the design of RavenDB with the intent to make it easier to debug and analyze core dumps. For example, most of the long running threads are named, so it is easy to figure out who they belong to (and not just what they are currently doing). For that matter, long running tasks are using synchronous mode, specifically because it means that we can drop in the debugger / core dump and look at their state. This is much harder to do with async methods. You might have noticed that I mentioned core dumps a few times, right? These are essential to figuring out what is going on with your software on production systems. We learned a lot about production debugging over the years and with RavenDB 4.0 we took steps to make things easier. For example, many data structured in RavenDB have an extra field called tag that is there specifically to provide debugging information about the value if we are looking at the value in the debugger.

An obvious question for this project was whatever we should still stay on .NET or should we move to an unmanaged language. I considered this seriously, with Rust, C/C++ and Go being the top contenders as the implementation language. I decided to stay with .NET for several reasons. Productivity was right there in the top. We already had a team that was well versed in .NET, and while that isn’t a blocker, it was a consideration. The tooling around .NET are leagues ahead of anything else that I have seen. That include both write time (where Rider / ReSharper rules) and for debug time (I found nothing remotely close to Visual Studio for debugging non trivial code easily). The cross platform angle, which was the most serious issue for us, was resolved with .NET Core.

Rust wasn’t matured enough at the time (2015) and even today I think that a language that prides itself in being hard to learn isn’t a good choice. C++ was a strong contender, but the slow compilation times were an issue. The tooling is similar, but inferior in many respects. Cross platform C++ is possible, and modern C++ is very different from what I remember. However, it come with a very high degree of complexity and would take a lot of time to master again properly. C (distinct from C++) is much simpler language. Still has the compilation speed issues, but the language is much simpler. I think that if it had a defer mechanism builtin it would be a much nicer language. Go was ruled out because if I’m going to be writing everything from scratch, I might as well go all the way down to C’s level and not stop with something that still has GC pauses.

The choice of C# as CoreCLR has been vindicated. The project team and the community at large puts a large emphasis on performance and we keep getting more and better way to handle low level details while still able to use higher level concepts when needed. And the tooling… dear God. I routinely work with other platforms, testing things out, but there is nothing that come close to the toolsets that are available for C#.

An interesting wrinkle with the 4.0 release was that we started it before the 3.5 release was even out. For a while, we had a small team working on the foundations of 4.0 while the rest of us were busy hammering in the last details of RavenDB 3.5.

As soon as we had the bare minimum to go (basically, it compiled and could save a single document and even regurgitate it back up again) we started heavy parallelization of the work. We had a team working on indexes while another was dealing with (even at this early stage) performance and another working on the user interface. At that time frame, we have hired a few more people and could really see the benefits of all of the separate teams working in concrete. One of the priorities of this method was to get to a demoable state. In fact, at some point we had over 30% of the people working on either the UI directly or UI related infrastructure.

One of the things we kept hearing back is that the UI and insight it provided into what is going on inside the database were crucial for our users. It also helped a lot to us as we developed RavenDB to get to play with it directly and see things in an easy manner. The UI has been at once one of the most trivial of changes and the most profound. On the one hand, we didn’t really make any significant architectural changes in the UI. On the other hand, we re-wrote most of it with the aid of UX study and a real professional at the helm. That gave us a lot of visible polish that underscore the amount of work that happened in the engine.

How did all of that turn out?

  • I initially thought it would last a year to 15 months, with an expectant due date of Dec 2016. That was with a team size of about 25 people.  Work started in Sep 2015.
  • As it turns out, RavenDB accumulated a lot of features in the years it spent in production. We had to evaluate each of it, see how it would fit into our architecture and get it ported. That took a lot of time. Especially because it many cases we took the time to change the approach we had for the feature completely.
  • By Mid 2016 I already changed the scheduled to Jun 2017.
  • Close to the end of 2016 we released RavenDB 3.5. This freed up some people to work on the 4.0 release, but also meant we had higher than usual support calls while customers integrated the new release.
  • Actual release of the 4.0 release happened in Feb 2018. So just about 30 months from the start or about double the time I expected it to happen.
  • We had to cut some features out to make the 4.0 release, all of them are back in the 4.1 release, scheduled for next month.
    • This means that to get back to the same place took us 3 years. But we now have a lot of extra features.
    • Most of the missing features were pretty minor, though, and rarely used.

What did all of that gain us?

  • Performance: Single node. Over 100,000 writes / sec and over 1,000,000 reads / sec in our benchmarks.
    • Real world users report performance boost of x20 to x52 time faster.
  • Support call duration dropped from days / weeks to about 2 – 4 hours.
  • Cross platform on Windows, Linux, ARM and MacOSX.
    • We are now deployed to production on Raspberry PIs, because we are the fastest real database on that kind of hardware.

We were over a year overdue, and even with the deadline being extended several times we had to cut some features to actually make the cut for release. The general acceptance of the new release by the community has been a roaring success. We exceeded our own goals for the project, even if we took a lot longer than expected to get there.

Now, for some additional thoughts. We didn’t really re-write the whole thing from scratch. Instead, we had a lot of code that we could at least partially reuse. The storage engine was ported, no re-written, for example. However, we changed architectures in a pretty significant way. For example, the format and manner of working with JSON changed entirely between these two released. We are a JSON document database. As you can imagine, we pretty much had to modify everything as a result of that.

We didn’t designed the whole things from the started. We had a rough outline and we let things roll from there. As a result of the new architecture and expectation, by the time we hit a particular feature we were able to utilize what we already learn about how to work with the new architecture to improve things. We also weren’t afraid of changing things multiple times. Authentication had several major design changes midway through, and it ended up so much simpler than what we had before. Even pretty late in the game, we still made significant changes. The RQL support, having a SQL like querying language, came about on the last 20% of the project.

That was a huge change, and I got a lot of “here comes the crazy train again” feedback. This is probably one of the reasons we delayed by another few months. But it was worth it by far. Basically, because we were able to give up on backward compatibility, we were able to move quickly and change stuff as we wished. We knew that we wouldn’t have another change like that for another decade, so we try to get the big changes done.

In retrospect, I think it worked quite well. I’m really proud of how RavenDB 4.0 turned out.

I want to see the QA process that catch this bug!

time to read 2 min | 344 words

When we get bug reports from the field, we routinely also do a small assessment to figure out why we missed the issue in our own internal tests and runway to production.

We just got a bug report like that. RavenDB is not usable at all on a Raspberry PI because of an error about Non ASCII usage.

This is strange. To start with we test on Raspberry Pi. To be rather more exact, we test on the same hardware and software combination that the user was running on.  And what is this Non ASCII stuff? We don’t have any such thing in our code.

As we investigated, we figured out that the root cause was that we were trying to pass a Non ASCII value to the headers of the request. That didn’t make sense, the only things we write to the request in this case is well defined values, such as numbers and constant strings. All of which should be in ASCII. What was going on?

After a while, the mystery cleared. In order to reproduced this bug, you needed to have the following preconditions:

  • A file hashed to a negative Int64 value.
  • A system whose culture settings was set to sv-SE (Swedish).
  • Run on Linux.

This is detailed in this issue. On Linux (and not on Windows), when using Swedish culture, negative numbers are using: ”−1” and not “-1”.

For those of you with sharp eyes, you noticed that this is U+2212, (minus sign), and not U+002D (hyphen minus). On Linux, for Unicode knows what, this is used as the negative mark. I would complain, but my native language has „.

Anyway, the fix was to force the usage of invariant when converting the Int64 to a string for the header, which is pretty obvious. We are also exploring how to fix this in a more global manner.

But I keep coming back to the set of preconditions that is required. Sometimes I wonder why we miss a bug, in this case, I can only say that I would have been surprised if we would have found it.

Times are hard

time to read 2 min | 277 words

One of the things RavenDB does is allow you to define a backup task that will be executed on a given schedule (such as every Saturday at midnight). However, as it turns out, specifying the right time is actually a pretty hard thing to do. The problem is what to do when you have multiple time zones involved:

  • UTC
  • The server local time
  • The operator’s local time
  • The business hours of the application using the database

In some cases, you might have a server in Germany being managed from Japan with users primarily from South Africa. There are at least four different options for when Saturday’s midnight is, and the one sure thing is that it will happen when you least want it to.

Because of that, RavenDB takes the simple positon that the time that it cares about is the server's own time. An operator is free to define it as they wish, but only the server local time is relevant. But we still need to make the operator’s job easier, and we do it using the following method:

image

The operator can specify the time specification using CRON syntax (which should be common to most admins). We translate the CRON syntax to a human readable string, but we also provide the next backup date with the server’s time (when it will actually run), the operator’s local time (which as you can see is a bit different from the server) and the duration. The later is actually really important because it gives the operator an intuitive understanding of when the backup is going to run next.

Migrating data from RavenDB 3.5 to 4.0

time to read 2 min | 325 words

One of the first steps you’ll have when migration RavenDB from 3.5 to 4.0 is to actually get your data in 4.0. There are a few ways of doing that.

You can create a new database in 4.0 from a 3.5 database directory. You can click on the chevron on the New database button to access it:

image

This will give you the following screen, where you can point to the existing database directory (the RavenDB 3.5 server must be offline for this) and the Raven.StorageExporter tool that comes with the 3.5 distribution. RavenDB 4.0 will then create your database and import all the data from the existing db to the new one.

image

This works great if you are doing this is a one time operation, but in many cases, the migration process is a long one. You’ll start by migrating your code, and it will take one or two iterations to complete the full process.

In order to handle that scenario, you’ll create a new database on 4.0 normally, then go to Settings > Import and select importing from another database. In this mode, the 3.5 server is online and running. You’ll provide the details of the server and database and then click on Migrate Database, as you can see in the picture.

image

This will import all the data from the existing database to the new database. This can be an ongoing process. Once this is done, you can migrate your application code to use RavenDB 4.0 and at deployment time, you’ll run this again.

Each time you run this migration, it will get only the updated data from the source server, it doesn’t have to read it all from scratch.

Production Test RunThe self flagellating server

time to read 2 min | 354 words

imageSometimes you see the impossible. In one of our scenarios, we saw a cluster that had such a bad case of split brain that it came near to fracturing the very boundaries of space & time.

In a three node cluster, we have one node that looked to be fine. It connected to all the other nodes and was the cluster leader. The other two nodes, however, were not in the cluster and in fact, they were showing signs that they never were in the cluster.

What was really strange was that we took the other two machines down and the first node was still showing a successful cluster. We looked deeper and realized that it wasn’t actually a healthy situation, in fact, this node was very rapidly switching between leader and follower mode.

It took a bit of time to figure out what was going on, but the root cause was DNS. We had the three nodes on separate DNS (a.oren.development.run, b.oren.development.run, c.oren.development.run) and they were setup to point to the three machines. However, we have previously used the same domain names to run a cluster on the first machine only. Because of the way DNS updates, whenever the machine at a.oren.development.run would try to connect to b.oren.development.run it would actually connect to itself.

At this point, A would tell B that it is the leader. But A is B, so A would respond by becoming a follower (because it was told it should, by itself). Because it became a follower, it disconnected from itself. After a timeout, it would become leader again, and the cycle would continue.

Every time that the server would get up, it would whip itself down again. “I’m a leader”, “No, I’m a leader”, etc.

This is a fun thing to discover. We had to trace pretty deep to figure out that the problem was in the DNS cache (since the DNS itself was properly updated).

We fixed things so we now recognize if we are talking to ourselves and error properly.

Production Test RunWhen your software is configured by a monkey

time to read 3 min | 457 words

imageSystem configuration is important, and the more complex your software is, the more knobs you usually have deal with. That is complex enough as it is, because sometimes these configurations are inter dependent. But it become a lot more interesting when we are talking about a distributed environment.

In particular, one of the oddest scenarios that we had to deal with in the production test run was when we got the different members in the cluster to be configured differently from each other. Including operational details such as endpoints, security and timeouts.

This can happen for real when you make a modification on a single server, because you are trying to fix something, and it works, and you forget to deploy it to all the others. Because people drop the ball, or because you have different people working on different things at the same time.

We classified such errors into three broad categories:

  • Local state which is fine to be different on different machines. For example, if each node has a different base directory or run under a different user, we don’t really care for that.
  • Distributed state which breaks horribly if misconfigured. For example, if we use the wrong certificate trust chains on different machines. This is something we don’t really care about, because things will break in a very visible fashion when this happens, which is quite obvious and will allow quick resolution.
  • Distributed state which breaks horrible and silently down the line if misconfigured.

The last state was really hard to figure out and quite nasty. One such setting is the timeout for cluster consensus. In one of the nodes, this was set to 300 ms and on another, it was set to 1 minute. We derive a lot of behavior from this value. A server will heartbeat every 1/3 of this value, for example, and will consider a node down if it didn’t get a heartbeat from it within this timeout.

This kind of issue meant that when the nodes are idle, one of them would ping the others every 20 seconds, while they would expect a ping every 300 milliseconds. However, when they escalated things to check explicitly with the server, it replied that everything was fine, leading to the whole cluster being confused about what is going on.

To make things more interesting, if there is activity in the cluster, we don’t wait for the timeout, so this issue only shows up only on idle periods.

We tightened things so we enforce the requirement that such values to be the same across the cluster by explicitly validating this, which can save a lot of time down the road.

Production Test RunToo much of a good thing isn’t so good for you

time to read 2 min | 316 words

imageNot all of our testing happened in a production settings. One of our test clusters was simply running a pretty simple loop of writes, reads and queries on all the nodes in the cluster while intentionally destabilizing the system.

After about a week of this we learned that this worked, there were no memory leaks or increased resource usage and also that the size of the data on disk was about three orders of magnitude too much.

Investigating this we discovered that the test process introduced conflicts because it wrote the same set of documents to each of the nodes, repeatedly. We are resolving this automatically but are also keeping the conflicted copies around so users can figure out what happened to their system. In this particular scenario, we had a lot of conflicted revisions, and it was hard initially to figure out what took that space.

In our production system, we also discovered that we log too much. One of the interesting feedback items we were looking for in this production test run is to see what kind of information we can get from the logs and make sure that the details there are actionable. A part of that was to see if we could troubleshoot something simply using the logs, and add missing details if there were stuff that we couldn’t figure out from them.

We also discovered that under load, we would log a lot. In particular, we had logs detailed every indexed document and replicated item. These are almost never useful, but they generate a lot of noise when we lowered the log settings. So that went away as well. We are very focused on logs usability, it should be possible to understand what is going on and why without drowning in minutia.

Production Test RunThe worst is yet to come

time to read 4 min | 676 words

imageBefore stamping RavenDB with the RTM marker, we decided that we wanted to push it to our production systems. That is something that we have been doing for quite a while, obviously, dogfooding our own infrastructure. But this time was different. While before we had a pretty simple deployment and stable pace, this time we decided to mix things up.

In other words, we decided to go ahead with the IT version of the stooges, for our production systems. In particular, that means this blog, the internal systems that run our business, all our websites, external services that are exposed to customers, etc. As I’m writing this, one of the nodes in our cluster has run out of disk space, it has been doing that since last week. Another node has been torn down and rebuilt at least twice during this run.

We also did a few times of “it is compiles, it fits production”. In other words, we basically read this guy’s twitter stream and did what he said. This resulted in an infinite loop in production on two nodes and that issue was handled by someone who didn’t know what the problem was, wasn’t part of the change that cause it and was able to figure it out, and then had to workaround it with no code changes.

We also had two different things upgrade their (interdependent) systems at the same time, which included both upgrading the software and adding new features. I also had two guys with the ability to manage machines, and a whole brigade of people who were uploading things to production. That meant that we had distinct lack of knowledge across the board, so the people managing the machines weren’t always aware that the system was experiencing and the people deploying software weren’t aware of the actual state of the system. At some points I’m pretty sure that we had two concurrent (and opposing) rolling upgrades to the database servers.

No, I didn’t spike my coffee with anything but extra sugar. This mess of a production deployment was quite carefully planned. I’ll admit that I wanted to do that a few months earlier, but it looks like my shipment of additional time was delayed in the mail, so we do what we can.

We need to support this software for a minimum of five years, likely longer, that means that we really need to see where all the potholes are and patch them as best we can. This means that we need to test it on bad situations. And there is only so much that a chaos monkey can do. I don’t want to see what happens when the network failed. That is quite easily enough to simulate and certainly something that we are thinking about. But being able to diagnose a live production system with a infinite loop because of bad error handling and recover that. That is the kind of stuff that I want to know that we can do in order to properly support things in production.

And while we had a few glitches, but for the most part, I don’t think that any one that was really observed externally. The reason for that is the reliability mechanisms in RavenDB 4.0, we need just a single server to remain functional, for the most part, which meant that we can just run without issue even if most of the cluster was flat out broken for an extended period of time.

We got a lot of really interesting results for this experience, I’ll be posting about some of them in the near future. I don’t think that I recommend doing that for any customers, but the problem is that we have seen systems that are managed about as poorly, and we want to be able to survive in such (hostile) environment and also be able to support customers that have partial or even misleading ideas about what their own systems look like and behave.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Graphs in RavenDB (2):
    19 Sep 2018 - The query language
  2. Reviewing FASTER (9):
    06 Sep 2018 - Summary
  3. RavenDB 4.1 features (12):
    22 Aug 2018 - MongoDB & CosmosDB Migration Wizards
  4. Reading the NSA’s codebase (7):
    13 Aug 2018 - LemonGraph review–Part VII–Summary
  5. Codex KV (2):
    06 Jun 2018 - Properly generating the file
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats