To the right you can see one of our servers, that has been running a longevity test for the two months in a production environment. We currently have a few teams doing nasty stuff to network and hardware to see how hostile they can make the environment and how RavenDB behaves under this conditions. For the past few days, if I have to go to the bathroom I need to watch out for random network cables strewn together all over the place as we create ad hoc networks and break them.
We have been able to test some really interesting scenarios this way and uncover some issues. I might post about a few of these in the future, some of them are interesting. Another team has been busy seeing what kind of effects you can get when you abuse the network at the firewall level , doing everything from random packet drops to reducing quality of service to almost nothing and seeing if we are recovering properly.
One of the bugs that we uncovered in this manner was an issue that would happen during disposal of a connection that timed out. We would wait for the TCP close in a synchronous fashion, which propagated an errors that was already handled into a freeze for the server.
Yet another team is working on finishing the documentation and the smoothing of the setup experience. We care very deeply about the “5 minutes of out of the gate” experience, and it takes a lot of work to ensure that it wouldn’t take a lot of work to setup RavenDB properly (and securely).
We are making steady progress and the list of stuff that we are working on grows smaller every day.
We are now in the last portion, running longevity, failure tests and basically taking the measure of the system. One of the things that I’m really happy about is that we are actively abusing our production system. To the point where if there there was Computer Protective Services we would probably have CPS take them away, but the overall system it running just fine. For example, this blog has been running on RavenDB 4.0 and the sum total of all issues there after the upgrade was no handling the document id change. The cluster going down, individual machines going crazy, taken down, network barriers, etc. It just works .