The state of RavenDB

Jan 05 2018

The state of RavenDB

time to read 3 min | 449 words

A couple of weeks ago we were done with the development of RavenDB, all that remains from now is to get it actually out the door, and we take that very seriously.

To the right you can see one of our servers, that has been running a longevity test for the two months in a production environment. We currently have a few teams doing nasty stuff to network and hardware to see how hostile they can make the environment and how RavenDB behaves under this conditions. For the past few days, if I have to go to the bathroom I need to watch out for random network cables strewn together all over the place as we create ad hoc networks and break them.

We have been able to test some really interesting scenarios this way and uncover some issues. I might post about a few of these in the future, some of them are interesting. Another team has been busy seeing what kind of effects you can get when you abuse the network at the firewall level , doing everything from random packet drops to reducing quality of service to almost nothing and seeing if we are recovering properly.

One of the bugs that we uncovered in this manner was an issue that would happen during disposal of a connection that timed out. We would wait for the TCP close in a synchronous fashion, which propagated an errors that was already handled into a freeze for the server.

Yet another team is working on finishing the documentation and the smoothing of the setup experience. We care very deeply about the “5 minutes of out of the gate” experience, and it takes a lot of work to ensure that it wouldn’t take a lot of work to setup RavenDB properly (and securely).

We are making steady progress and the list of stuff that we are working on grows smaller every day.

We are now in the last portion, running longevity, failure tests and basically taking the measure of the system. One of the things that I’m really happy about is that we are actively abusing our production system. To the point where if there there was Computer Protective Services we would probably have CPS take them away, but the overall system it running just fine. For example, this blog has been running on RavenDB 4.0 and the sum total of all issues there after the upgrade was no handling the document id change. The cluster going down, individual machines going crazy, taken down, network barriers, etc. It just works Smile .

Tweet Share Share 4 comments

Tags:

raven

Comments

08 Jan 2018
08:32 AM

Pop Catalin

Any change we will see a write-up about the impact of Meltdown and Spectre OS patches on Raven?

08 Jan 2018
11:36 AM

Andrew

Neat. Have you thought about writing a "chaos monkey" service that randomly injects failures into components, like Netflix documented Chaos monkey (randomly terminates EC2 instances in production), and its other "simian army" characters (one now takes down entire availability zones of servers) at https://medium.com/netflix-techblog/the-netflix-simian-army-16e57fbab116

08 Jan 2018
12:57 PM

Oren Eini

Pop Catalin, We haven't tested a before / after setup yet. We do a lot of calls for the network, so I expect there will be some effect.

08 Jan 2018
12:58 PM

Oren Eini

Andrew, Yes, that is more or less what we do.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB