Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,545 | Comments: 48,084

filter by tags archive

Production Test RunThe worst is yet to come

time to read 4 min | 676 words

imageBefore stamping RavenDB with the RTM marker, we decided that we wanted to push it to our production systems. That is something that we have been doing for quite a while, obviously, dogfooding our own infrastructure. But this time was different. While before we had a pretty simple deployment and stable pace, this time we decided to mix things up.

In other words, we decided to go ahead with the IT version of the stooges, for our production systems. In particular, that means this blog, the internal systems that run our business, all our websites, external services that are exposed to customers, etc. As I’m writing this, one of the nodes in our cluster has run out of disk space, it has been doing that since last week. Another node has been torn down and rebuilt at least twice during this run.

We also did a few times of “it is compiles, it fits production”. In other words, we basically read this guy’s twitter stream and did what he said. This resulted in an infinite loop in production on two nodes and that issue was handled by someone who didn’t know what the problem was, wasn’t part of the change that cause it and was able to figure it out, and then had to workaround it with no code changes.

We also had two different things upgrade their (interdependent) systems at the same time, which included both upgrading the software and adding new features. I also had two guys with the ability to manage machines, and a whole brigade of people who were uploading things to production. That meant that we had distinct lack of knowledge across the board, so the people managing the machines weren’t always aware that the system was experiencing and the people deploying software weren’t aware of the actual state of the system. At some points I’m pretty sure that we had two concurrent (and opposing) rolling upgrades to the database servers.

No, I didn’t spike my coffee with anything but extra sugar. This mess of a production deployment was quite carefully planned. I’ll admit that I wanted to do that a few months earlier, but it looks like my shipment of additional time was delayed in the mail, so we do what we can.

We need to support this software for a minimum of five years, likely longer, that means that we really need to see where all the potholes are and patch them as best we can. This means that we need to test it on bad situations. And there is only so much that a chaos monkey can do. I don’t want to see what happens when the network failed. That is quite easily enough to simulate and certainly something that we are thinking about. But being able to diagnose a live production system with a infinite loop because of bad error handling and recover that. That is the kind of stuff that I want to know that we can do in order to properly support things in production.

And while we had a few glitches, but for the most part, I don’t think that any one that was really observed externally. The reason for that is the reliability mechanisms in RavenDB 4.0, we need just a single server to remain functional, for the most part, which meant that we can just run without issue even if most of the cluster was flat out broken for an extended period of time.

We got a lot of really interesting results for this experience, I’ll be posting about some of them in the near future. I don’t think that I recommend doing that for any customers, but the problem is that we have seen systems that are managed about as poorly, and we want to be able to survive in such (hostile) environment and also be able to support customers that have partial or even misleading ideas about what their own systems look like and behave.

ReminderEarly bird pricing for RavenDB workshops about to close

time to read 1 min | 98 words

imageThe early bird pricing for the Q1 RavenDB workshops is about to end, hurry up and register. We have workshops in Tel Aviv, San Francisco and New York in this round and are working on Q2 workshop Europe and South America now.

In the workshop, we will dive deeply into RavenDB 4.0, and all the new and exciting things it can do for you. This is meant for developers and their operations teams who want to know RavenDB better.

The state of RavenDB

time to read 3 min | 449 words

imageA couple of weeks ago we were done with the development of RavenDB, all that remains from now is to get it actually out the door, and we take that very seriously.

To the right you can see one of our servers, that has been running a longevity test for the two months in a production environment. We currently have a few teams doing nasty stuff to network and hardware to see how hostile they can make the environment and how RavenDB behaves under this conditions. For the past few days, if I have to go to the bathroom I need to watch out for random network cables strewn together all over the place as we create ad hoc networks and break them.

We have been able to test some really interesting scenarios this way and uncover some issues. I might post about a few of these in the future, some of them are interesting. Another team has been busy seeing what kind of effects you can get when you abuse the network at the firewall level , doing everything from random packet drops to reducing quality of service to almost nothing and seeing if we are recovering properly.

One of the bugs that we uncovered in this manner was an issue that would happen during disposal of a connection that timed out. We would wait for the TCP close in a synchronous fashion, which propagated an errors that was already handled into a freeze for the server.

Yet another team is working on finishing the documentation and the smoothing of the setup experience. We care very deeply about the “5 minutes of out of the gate” experience, and it takes a lot of work to ensure that it wouldn’t take a lot of work to setup RavenDB properly (and securely).

We are making steady progress and the list of stuff that we are working on grows smaller every day.

We are now in the last portion, running longevity, failure tests and basically taking the measure of the system. One of the things that I’m really happy about is that we are actively abusing our production system. To the point where if there there was Computer Protective Services we would probably have CPS take them away, but the overall system it running just fine. For example, this blog has been running on RavenDB 4.0 and the sum total of all issues there after the upgrade was no handling the document id change. The cluster going down, individual machines going crazy, taken down, network barriers, etc. It just works Smile.

The five requirements for the design of all major RavenDB features

time to read 3 min | 534 words

imageWe started some (minor) design work for the next set of features for RavenDB (as we discussed in the roadmap) and a few interesting things came out of that. In particular, the concept of the five pillars any major feature need to stand on.

By major I mean something that impact the persistent state of the system as a whole. For example, attachments, cmpxchng, revisions and conflicts are quite obvious in this manner, while a query is local and transient.

Here they are, in no order of importance:

  • Client API
  • Cluster
  • Backup
  • Studio
  • Disaster Recovery

The client API is how a feature is exposed to clients, obviously. This can be explicit, as in the case of attachments or more subtle, like the CmpXchg usage, which can either be the low level calls or using it directly from RQL.

The cluster is how a particular feature operates in the cluster. In the case of attachments, it means that attachments flow across the network as part of the replication behavior between nodes. For CmgXchg, it means that the values are directly stored in the cluster state machine and are managed by the Raft cluster. The actual way it works doesn’t matter, that we thought about the implications of this feature in a distributed environment has been discussed.

Backup is subtle. It is easy to implement a feature and forget that we actually need to support backup and restore until very late in the game. RavenDB has a few backup strategies, and this also include migrating data from another instance, long term behavior, etc. RavenDB has a few backup strategies (full snapshot or regular) and the feature need to work across all of them.

The studio refers to how we are actually going to expose a feature to the user on the studio. A good example where we failed in the CmpXchng values that are currently not exposed in the studio (there are endpoints for that, but we haven’t got around to this). We are feeling the lack and it is on the fast track for new features for the next minor release. If a feature isn’t in the studio, how do we expect a user to discover, manage or work with it?

Finally, we have disaster recovery. We are taking data integrity very seriously, and one of the things we do is to make sure that even in the case of disk failure or some other data corruption, we can still get the data out. This is done by laying out the data on disk in such a way that there are multiple ways to access it. First, by reading the data normally and assuming a valid structure. This is what we usually do. Second, by reading one byte at a time and still being able to reconstruct the data back, even if some parts of that has been corrupted. This require us to plan ahead how we store the data for a feature in advance so we can support recovery.

There are other stuff as well, anything from monitoring to debugging to performance, but usually they aren’t so important at the design phase of a feature.

This blog is now running RavenDB 4.0

time to read 2 min | 276 words

image

We are currently busy converting our entire infrastructure to the new version. Somehow we gathered quite a bit of stuff internally, so that is taking some time. However, this blog is now running on RavenDB 4.0. It feels faster, but we haven’t done proper measurements yet. Primarily because this blog was written to be a sample app for RavenDB about seven years ago, and the code show its age.

We’ll be working on also upgrading that to a more modern system. In particular, we want to turn that into a sample app of how to properly deploy a RavenDB 4.0 application for the modern world. This means that beside actually talking to a highly available cluster, the blog itself is going to be distributed and highly available. The idea is that it would be nice to not take down anything while we are updating stuff, but at the same time, the blog is small enough that it makes it possible to talk about its high availability features without drowning in details.

True work on that is going to start next week, and we would appreciate any feedback on what you are interested in seeing. I’ll probably make that into a series of posts, detailing how to take an existing RavenDB application and move it to RavenDB 4.0, adding all the nice touches along the way, ending up in a distributed and highly available system that can be deployed to production and survive all the nasty things going on there.

So please let me know what you’ll like us to cover.

Invisible race conditionsThe async query

time to read 1 min | 195 words

This issue was reported to the mailing list with a really scary error: UseAfterFree detected! Attempt to return memory from previous generation, Reset has already been called and the memory reused!

I initially read it as an error that is raised from the server, which raised up all sort of flags and caused us to immediately try to track down what is going on.

Here is the code that would reproduce this:


And a key part of that is that this is not happening on the server, but on the client. You now have all the information required to see what the error is.

Can you figure it out?

The problem is that this method returns a Task, but it isn’t an async method. In other words, we return a task that is still running from ToListAsync, but because we aren’t awaiting on it, the session’s dispose is going to run, and by the time the server request completes and is ready to actually do something with the data that it go, we are already disposed and we get this error.

The solution? Turn this into an async method and await on the ToListAsync() before disposing the session.

RavenDB 4.0 is ready!

time to read 2 min | 372 words

imageI was told that putting more than a single exclamation mark on the title is in bad taste, but it was really hard to refrain.

Today is my birthday, and we are celebrating with no remaining issues for RavenDB 4.0.

We just made the final commits for RavenDB 4.0. This means that it is (almost) done. As we speak the release train is already picking up speed with the bits currently being churned on the build server and on their way to be publicly available.

And yet there is this almost, what does this mean? We don’t have anything left to do in 4.0, but the release process we have is not something as simple as just pushing a build through the build server and sending it to the world.

We are now going into the final proof stage. For the next week, the entire company is going to be focused primarily on trying to see if we can break RavenDB in interesting ways. We are also rolling the new RavenDB bits to all our productions systems, doing a much larger scale test of all the features.

We decided to make these bits available to users as well, to give you direct access to the final product before the actual release. The RavenDB website is already updated with a new coat of paint, which I’m quite fond of.

As it turns out, releasing the project is a bit of a chore, so we are also working furiously on the docs, but it will take some time to complete. We pushed a lot of the updates to the online docs already, but there is still a lot to be done, which is another reason why we hold off on the RTM label.

That said, this is it, the only work going on right now is docs, testing and making sure that all the bits are glued together. Please download the new bits and give it a try, we would dearly appreciate any and all feedback.

We’ll have a full blog post with all the details in a week, when the official release will happen, in the meantime, we’re off to celebrate.

Queries++ in RavenDBSpatial searches

time to read 2 min | 221 words

Spatial queries are fun, when you look at them from the outside. Not so fun when you are working to implement them, but that is probably not your concern.

RavenDB had had support for spatial queries for many years now, but the RavenDB 4.0 release has touched on that as well and now you can query spatial data with much greater each. Here is a small sample of how this works:

This query is doing a polygon search for all the employees located inside that polygon. You can visualize the query on the map, we have 4 employees (in yellow) in the viewport and two of them are included within the specified polygon (in blue).

image

And here is what this looks like in the studio:

image

In this case, you can see how we support automatic indexing of spatial data. You can also define your own spatial indexes if you need greater control but it is as easy as pie to just go ahead and start running it.

From code, this is just as easy:

I’m not sure why, but when looking at the results, this just feels like magic.

Queries++ in RavenDBI suggest you can do better

time to read 3 min | 555 words

imageSometimes, there is no escaping it, and you must accept some user’s input into your system. That sad state of affairs is problematic, because all users have one common tendencies, they are going to be entering weird stuff into your system. And when stuff doesn’t work, you’ll need to make it work, even it make no sense.

Let us take a simple example with this user, shall we?

image

Assuming that you have a search form for users, and a call goes to the call center, where you need to have the helpdesk staff search for that particular user.

Here are the queries that they tried:

  • Stefanie
  • Stephenie
  • Stefanee
  • Stepfanie
  • Stephany

Now, you may think poorly of the helpdesk staff (likely outsource and non native speakers), but the same thing can happen for Yasmin, Jasmyne and Jasmyn (I literally took the first two examples I found in Twitter, to show that this is a real issue).

How do you handle something like this? Well, if you are Google, you do this:

image

What happens if you aren’t Google? Well, since we are talking about RavenDB here, you are going to run a suggestion query, like so:

from index 'Users/Search' select suggest(Name, "Stefanie")

This requires us to have defined the “Users/Search” index and mark the Name field as viable for suggestions. This tend to be quite computing intensive during indexing time, but it allow us to make a suggestion query on the field, which will give us this result:

image

What is going on here? During indexing, RavenDB is going to generate a list of permutations of the data that is being indexed. Then, when you run a suggestion query, we can compare the user’s input to the data that has been indexed and suggest possible alternatives to what the user actually entered. This isn’t a generic  selection, it is based on what you actually have in your system.

A more serious case is the international scene. When you have a user such as “André Sørina”:

image

How do you search for them? On my keyboard, I don’t know how to type this marks (diacritic, I had to search for that). If someone tried to tell me these over the phone, I would be completely lost. It’s a good thing that we have a good solution for that:

from index 'Users/Search' select suggest(Name, "andre")

Which will give us:

image

And now we can search for that, and find the user very easily.

This is a feature that we had since 2010, but it got a serious face lift and made easier to use in RavenDB 4.0.

The RavenDB 4.0 Workshops are now opened

time to read 1 min | 188 words

blog-imgI’m going to be giving several workshops about RavenDB 4.0 in the next few months. You can see the full details here, but the gist of it is that these are full day workshops with yours truly, aimed to take you from knowing absolutely nothing about RavenDB to building complex systems on top of it.

We’ll cover how to deploy a cluster, model your application data, query it effectively and in general anything you need to know about RavenDB 4.0. I’m also going to do deep dives into several fascinating topics, such as high available, dynamic configuration and the kind of queries that are enabled by the new version.

We are running the workshops in San Francisco, New York and Tel Aviv in the first quarter of the 2018. We’ll announce more workshops in a couple of months, in more locations.

This is a good chance to use your 2017 training budget before the clocks run out. For the next 30 days, we are offering an 25% discount for early birds.

FUTURE POSTS

  1. The TCP Inversion Proposal - one day from now

There are posts all the way to Jan 19, 2018

RECENT SERIES

  1. Production Test Run (3):
    18 Jan 2018 - When your software is configured by a monkey
  2. Reminder (8):
    09 Jan 2018 - Early bird pricing for RavenDB workshops about to close
  3. Book Recommendation (2):
    08 Jan 2018 - Serious Cryptography
  4. Talk (3):
    03 Jan 2018 - Modeling in a Non Relational World
  5. Invisible race conditions (3):
    02 Jan 2018 - The cache has poisoned us
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats