Before stamping RavenDB with the RTM marker, we decided that we wanted to push it to our production systems. That is something that we have been doing for quite a while, obviously, dogfooding our own infrastructure. But this time was different. While before we had a pretty simple deployment and stable pace, this time we decided to mix things up.
In other words, we decided to go ahead with the IT version of the stooges, for our production systems. In particular, that means this blog, the internal systems that run our business, all our websites, external services that are exposed to customers, etc. As I’m writing this, one of the nodes in our cluster has run out of disk space, it has been doing that since last week. Another node has been torn down and rebuilt at least twice during this run.
We also did a few times of “it is compiles, it fits production”. In other words, we basically read this guy’s twitter stream and did what he said. This resulted in an infinite loop in production on two nodes and that issue was handled by someone who didn’t know what the problem was, wasn’t part of the change that cause it and was able to figure it out, and then had to workaround it with no code changes.
We also had two different things upgrade their (interdependent) systems at the same time, which included both upgrading the software and adding new features. I also had two guys with the ability to manage machines, and a whole brigade of people who were uploading things to production. That meant that we had distinct lack of knowledge across the board, so the people managing the machines weren’t always aware that the system was experiencing and the people deploying software weren’t aware of the actual state of the system. At some points I’m pretty sure that we had two concurrent (and opposing) rolling upgrades to the database servers.
No, I didn’t spike my coffee with anything but extra sugar. This mess of a production deployment was quite carefully planned. I’ll admit that I wanted to do that a few months earlier, but it looks like my shipment of additional time was delayed in the mail, so we do what we can.
We need to support this software for a minimum of five years, likely longer, that means that we really need to see where all the potholes are and patch them as best we can. This means that we need to test it on bad situations. And there is only so much that a chaos monkey can do. I don’t want to see what happens when the network failed. That is quite easily enough to simulate and certainly something that we are thinking about. But being able to diagnose a live production system with a infinite loop because of bad error handling and recover that. That is the kind of stuff that I want to know that we can do in order to properly support things in production.
And while we had a few glitches, but for the most part, I don’t think that any one that was really observed externally. The reason for that is the reliability mechanisms in RavenDB 4.0, we need just a single server to remain functional, for the most part, which meant that we can just run without issue even if most of the cluster was flat out broken for an extended period of time.
We got a lot of really interesting results for this experience, I’ll be posting about some of them in the near future. I don’t think that I recommend doing that for any customers, but the problem is that we have seen systems that are managed about as poorly, and we want to be able to survive in such (hostile) environment and also be able to support customers that have partial or even misleading ideas about what their own systems look like and behave.