Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,522 | Comments: 47,973

filter by tags archive

Reducing the cost of support with anticipatory errors

time to read 2 min | 262 words

I speak a lot about good error handlings and our perspective on support. We consider support to be a cost center (as in, we don’t want or expect to make money from support), I spoke at this at more length here. Today I run into what is probably the best example for exactly what this means in a long while.

A user got an error:

image

The error is confusing, because they are able to access this machine and URL. The actual issue, if you open the show details is here:

Connection test failed: An exception was thrown while trying to connect to http://zzzzzzzz.com:8081 : System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (0x80004005): A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 143.29.63.128:53329

And that is good enough to explain to me, what is going on. RavenDB is usually using HTTP for most things, but it is using TCP connections to handle performance critical stuff, such as clustering, replication, etc. In this case, the server is listening on port 53329 for TCP connections, but since this is a public facing instance, the port is not accessible from the outside world.

This issue has generated a support call, but a better message, explaining that we could hit the HTTP endpoint and not the TCP endpoint would have led the user knowing exactly what the issue is and solving the problem on their own, much faster.

Queries++ in RavenDBFacets of information

time to read 3 min | 536 words

image

RavenDB has a lot of functionality that is available just underneath the surface. In addition to just finding documents, you can use RavenDB to find a lot more about what is going on in your database. This series of posts is aimed at exposing some of the more fun things that you can do with RavenDB that you are probably not aware of.

One of the those things is the idea of not just querying for information, but also querying for the facets of the results. This can be very useful if you are likely to search for something that would return a lot of results and you want to quickly filter these out without having the user do a lot of trial an error. This is one of those cases where it is much easier to explain what is going on with a picture.

Imagine that you are searching for a phone. You might have a good idea what you are looking for a phone on eBay. I just did that and it gave me over 300 thousands results. The problem is that if I actually want to buy one of them, I’m not going to scroll through however many pages of product listings. I need a way to quickly narrow down the selection, and facets allow me to do that, as you can see in the image. Each of these is a facet and I can filter out things so only the stuff that I’m interested in will be shown, allowing me to quickly make a decision (and purchase).

Using the sample dataset in RavenDB, we’ll explore how we can run faceted searches in RavenDB. First, we’ll define the “Products/Search” index:

Using this index, we can now ask RavenDB to give us the facets from this dataset, like so:

image

This will give us the following results:

image

And we can inspect each of them in turn:

image     image

These are easy, because they give us the count of matching products for each category and supplier. Of more interest to us is the Prices facet.

image

And here we can see how we sliced and diced the results. We can narrow things further with the user’s choices, of course, let’s check out this query:

image

Which gives us the following Prices facet:

image

This means that you can, in a very short order, produce really cool search behavior for your users.

Production postmortemdata corruption, a view from INSIDE the sausage

time to read 10 min | 1883 words

In terms of severity, there are very few things that we take more seriously than data integrity. In fact, the only thing that pops to mind as higher priority are security issues. A user reported an error when using a pre-release 4.0 database that certainly looked like data corruption, so we were very concerned when we go the report, and quite happy about the actual error. If this is strange, let me explain.

Storage bugs are nasty. I suggest reading this article to understand how tricky these can be. The article talks about memory allocators (even though it calls them storage) but the same rules apply. And the most important rule from this article?

WHEN A MEMORY DAMAGE BUG IS OBSERVED, IT TAKES PRIORITY OVER ALL OTHER BUG FIXES, ENHANCEMENTS, OR ANY OTHER DEVELOPMENT ACTIVITY.  ALL DEVELOPMENT CEASES UNTIL IT IS FOUND.

You can read the article for the full reasoning why, but basically is about being able to  reproduce and fix the bug and not make it “go away” with a hammer approach. We do the same with data corruption. One of our developers stops doing anything else and investigate just that, as a top priority issue. Because we take this so seriously, we have built several layers of defense in depth into RavenDB.

All the data is signed and we compare hashed when reading from disk to validate that it hasn’t been modified. This also help us catch an enormous amount of problems with storage devices and react to them early. There are other checks that are being run to verify the integrity of the system, from debug asserts to walking the structure of the data and verifying its correctness.

In this case, analysis of the data the user provided showed that we were failing the hash validation, which should usually only happen if there is a physical file corruption. While we were rooting for that (since this would mean no issues with our code), we also looked into the error in detail. What we found was that we were somehow starting to read a document from the middle, instead of the beginning. Somehow we managed to mess up the document offset and that caused us to think that the document was corrupted.

At this point, we had a confirmed data corruption issue, since obviously we shouldn’t lose track of where we put the documents. We pulled another developer into this, to try to reproduce the behavior independently while checking if would salvage the user’s data from the corrupted files. This deserve some explanation. We don’t assume that our software is perfect, so we took steps in advanced. The hashing the data and validating it is one such step, but another is build, upfront, the recovery tools for when the inevitable happens. That meant that the way we lay out the data on disk was designed, upfront and deliberately, to allow us to recover the data in the case of corruption.

Admittedly, I was mostly thinking about corruption of the data as a result of physical failure, but the way we lay out the data on disk also protect us from errors in record keeping such as this one. This meant that we were able to extract the data out and recover everything for the user.

At this time, we had a few people trying to analyze the issue and attempting to reproduce it. The problem with trying to figure out this sort of issue from the resulting file is that by the time you have found the error, this is too late, the data is already corrupted and you have been operating in a silent bad state for a while, until it finally got to the point this become visible.

8011098[3]We had the first break in the investigation when we managed to reproduce this issue locally on a new database. That was great, because it allowed us to rule out some possible issues related to upgrading from an earlier version, which was one of the directions we looked at. The bad part was that this was reproduced mostly by the developer in question repeatedly hitting the keyboard with his head in frustration. So we didn’t have a known way to reproduce this.

Yes, I know that animated GIFs are annoying, so was this bug, I need a way to share the pain. At one point we got something that could reliably generate an error, it was on the 213th write to the system. Didn’t matter what write, but the 213th write will always produce an error. There is nothing magical about 213, by the way, I remember this value because we tried so very hard to figure out what was magical about it.

At this point we had four or five developers working on this (we needed a lot of heads banging on keyboards to reproduce this error). The code has been analyzed over and over. We found a few places where we could have detected the data corruption earlier, because it violated invariants and we didn’t check for that. That was the first real break we had. Because that allowed us to catch the error earlier, which let to less head banging before the problem could be reproduced. The problem was that we always caught it too late, we kept going backward in the code, each time really excited that we are going to be able to figure out what was going on there and realizing that the invariants this code relied on were already broken.

Because these are invariants, we didn’t check them, they couldn’t possibly be broken. That sounds bad, because obviously you need to validate your input and output, right? Allow me to demonstrate a sample of a very simple storage system:

 

There isn’t anything wrong with the code here at first glance, but look at the Remove method, and now at this code that uses the storage:

The problem we have here is not  with the code in the Remove or the GetUserEmail method, instead, the problem is that the caller did something that it wasn’t supposed to, and we proceeded on the assumption that everything is okay.

The end result is that the _byName index contained a reference to a deleted document, and calling GetUserEmail will throw a null reference exception. The user visible problem is the exception, but the problem was actually caused much earlier. The invariant that we violating could have been caught in the Remove method, though, if we did something like this:

These sort of changes allow us to get earlier and earlier to the original location where the problem first occurred. Eventually we were able to figure out that a particular pattern of writes would put the internal index inside RavenDB into a funny state, in particular, here is how this looks like from the inside.

image

What you see here is the internal structure of the tree inside RavenDB used to map between documents etags and their location on the disk. In this case, we managed to get into a case where we would be deleting the last item from a page that is the leftmost page in a tree that has 3 or more levels and whose parent is the rightmost page in the grandparent and is less than 25% full while the sibling to its left is completely full.

In this case, during rebalancing operation, we were forgetting to reset the downward references and ended up messing up the sort order of the tree. That worked fine, most of the time, but it would slowly poison our behavior, as we made binary searches on data that was supposed to be sorted but wasn’t.

Timeline (note, despite the title, this is pre released software and this is not a production system, the timeline reflects this):

  • T-9 day, first notice of this issue in the mailing list. Database size exceed 400GB. Back and forth with the user on figuring out exactly what is going on, validating the issue is indeed corruption and getting the data.
  • T-6 days, we start detailed analysis of the data in parallel to verifying that we can recover the data.
  • T-5 days, user has the data back and can resume working normally, investigation proceeds.
  • T-4 days, we have managed to reproduced this on our own system, no idea how yet.
  • T-3 days, head banging on keyboards, adding invariants validations and checks everywhere we can think of.
  • T-2 days, managed to trap the root cause of the issue, tests added, pruning investigation code for inclusion in product for earlier detection of faults.
  • Issue fixed
  • T – this blog post is written Smile.
  • T + 3 days, code for detecting this error and automatically resolving this is added to the next release.

For reference, here is the fix:

image

The last change in the area in question happened two years ago, by your truly, so this is a pretty stable part of the code.

In retrospect, there are few really good things that we learned from this.

  • In a real world situation, we were able to use the recovery tools we built and get the user back up in a short amount of time. We also found several issues with the recovery tool itself, mostly the fact that its default logging format was verbose, which on a 400GB database means an enormous amount of logs that slowed down the process.
  • No data was lost, and these kinds of issues wouldn’t be able to cross a machine boundary so a second replica would have been able to proceed.
  • Early error detection was able to find the issue, investment with hashing and validating the data paid off handsomely here. More work was done around making the code more paranoid, not for the things that it is supposed to be responsible for but to ensure that other pieces of the code are not violating invariants.
  • The use of internal debug and visualization tools (such as the one above, showing the structure of the internal low level tree) was really helpful with resolving the issue.
  • We focused too much on the actual error that we got from the system (the hash check that failed), one of the things we should have done is to verify the integrity of the whole database at the start, which would have led us to figure out what the problem was much earlier. Instead, we suspected the wrong root cause all along all the way to the end. We assumed that the issue was because of modifications to the size of the documents, increasing and decreasing them in a particular pattern to cause a specific fragmentation issue that was the root cause of the failure. It wasn’t, but we were misled about it for a while because that was the way we were able to reproduce this eventually. It turned out that the pattern of writes (to which documents) was critical here, not the size of the documents.

Overall, we spent over a lot of time on figuring out what the problem was and the fix was two lines of code. I wrote this post independently of this investigation, but it hit the nail straight on.

Time handling and user experience, Backup Scheduling in RavenDB as a case study

time to read 3 min | 597 words

Time handling in software sucks. In one of the RavenDB conferences a few years ago we had a fantastic talk for over an hour that talked about just that.  It sucks because what a computer think about as time and what we think about as time are two very different things. This usually applies to applications, since that is where you are typically working with dates & times in front of the users, but we had an interesting case with backups scheduling inside RavenDB.

RavenDB allow you to schedule full and incremental backups and it used the CRON format to set things up. This make things very easy to setup and is highly configurable.

It is also very confusing. Consider the following initial user interface:

image

It’s functional, does exactly what it needs to do and allow the administrator complete freedom. It is also pretty much opaque, requiring the admin to either know the CRON format by heart (possible, but not something that we want to rely on) or find a webpage that would translate that.

The next thing we did was to avoid the extra step and let you know explicitly what this definition meant.

image

This is much better, but we can still do better. Instead of just an abstract description, let us let the user know when the next backup is going to happen. If you run backups each Friday, you probably want to change that to the before or after Black Friday, for example. So date & time matter.

image

This lead us to the next issue, what time? In particular, backups are done on the server’s local time, on the assumption that most of the time this is what the administrator will expects. This make it easier to do things like schedule backup to happen in the off hours. We thought about doing that always in UTC, but this would require you to always do date math in your head.

That does lead to the issue of what to do when the admin’s clock and the server clock are out of sync? This is how this will look like in that case.


image

We let the user know that the backup will run in the local server time and when that will happen in the user’s time.

We also provide on the fly translation from CRON format to a human readable form.

image

And finally, to make sure that we cover all the basis, in addition to giving you the time specification, the server time and local time, we also give you the time duration to the next backup.

image

I think that this covers up pretty much every scenario that I can think of.

Except getting the administrator to do practice restores to ensure that they are familiar with how to do this. Smile

Update: Here is how the field looks like when empty:

image

API DesignThe lack of a method was intentional forethought

time to read 4 min | 639 words

imageOne of the benefits of having a product in the market for a decade is that you gain some experience in how people are using it. This lead to interesting design decisions over time. Some of them are obvious. Such as the setup process for RavenDB. Some aren’t, such as the surface of the session. It is kept small and focused on CRUD operations to make it easy to understand and use in the common cases.

And sometimes, the design is in the fact that the code isn’t there at all. Case in point, the notion of connection strings in RavenDB 4.0. This feature was removed in its entirety in this release and users are expected to provide the connection parameters to the document store on their own. How they do that is not something that we concern ourselves with. A large part of the reasoning behind this decision was around our use of X509 certificates for authentication. In many environments there are strict rules about the usage and deployment of certificates and having a connection string facility would force us to always chase the latest ones. For that matter, where you store the connection string is also a problem. We have seen configuration stored in app.config, environment variables, json configuration, DI configuration and more. And each time we were expected to support this new method of getting the connection string.  By not having any such mechanism, we are able to circumvent the problem entirely.

This sounds like a copout, but it isn’t. Consider this thread in the RavenDB mailing list. It talks about how to setup RavenDB 4.0 in Azure in a secure manner. Just reading the title of the thread made me cringe, thinking that this is going to be a question that would take a long time to answer (setup time, mostly). But that isn’t it at all. instead, this is a walk through showing you how you can setup things properly in an environment where you cannot load a certificate from a file and need to do that directly from the Azure certificate store.

This is quite important, since this is one of the things that I keep having to explain to team members. We want to be a very clear demarcation about the kind of things that we support and the kinds we don’t. Mostly because I’m not willing to do half ass job in supporting things. So saying something like: Oh, we’ll just support a file path and we’ll let the user do the rest for more complex stuff is not going to fly with this design philosophy.

If we do something, a user reasonably expects us to do a complete job in doing that and puts the entire onus of responsibility on us. On the other hand, if you don’t do something, there is usually no expectation that you’ll handle that. There is also the issue that is many cases, solving the general problem is nearly impossible while solving a particular user scenario is trivial. So letting them have full responsibility works much better. At a minimum, they don’t need to circumvent the things we do for the stuff that we do support, but can start from a clear ground.

Coming back to the certificate example, if we would have a Certificate property and a CertificatePath property, allowing for each setup for a common scenario, then it is easy down the line to just assume that the CertificatePath is set if we have a certificate, and suddenly a user that doesn’t use a certificate from a file is going to need to be aware of this and handle the issue. If there is no such property, the behavior is always going to be correct.

Production postmortemThe random high CPU

time to read 2 min | 253 words

A customer complained that every now and then RavenDB is hitting 100% CPU and stays there. They were kind enough to provide a minidump, and I started the investigation.

I loaded the minidump to WinDB and started debugging. The first thing you do with high CPU is rung the “!runaway” command, which sorts the threads by how busy they are:

image

I switched to the first thread (39) and asked for its stack, I highlighted the interesting parts:

image

This is enough to have a strong suspicion on what is going on. I checked some of the other high CPU threads and my suspicion was confirmed, but even from this single stack trace it is enough.

Pretty much whenever you see a thread doing high CPU within the Dictionary class it means that you are accessing it in a concurrent manner. This is unsafe, and may lead to strange effects. One of them being an infinite loop.

In this case, several threads were caught in this infinite loop. The stack trace also told us where in RavenDB we are doing this, and from there we could confirm that indeed, there is a rare set of circumstances that can cause a timer to fire fast enough that the previous timer didn’t have a chance to complete, and both of these timers will modify the same dictionary, causing the issue.

If you have a finalizer, watch your ctor

time to read 2 min | 289 words

We recently got into a very bad state with our build server. We had a few dozens failing tests in an area of the code that didn’t change in months. That was annoying, to say the least. I tried to run this on my machine, to see what was going on, and we got an even worse state, a process crash.

The good thing was that this is managed code, so you get stack traces, and we were able to figure out that the issue is with the SqlClient assembly. It looks like a recent change meant that loading it by reflection will give you the netstadnard 2.0 release, which is basically just a filler that always throw PlatformNotSupported. That is perfectly fine, we’ll change the way we do things to load the direct reference.

What killed us was this:

image

This will kill the process. How is that?

Well, DbConnection inherits from Component, and Component has a finalizer. Care to guess how what will happen when that is called?

image

So here is what will happen, the ctor throws, so no one can have an instance of this class. But the finalizer is already tracking it, and will call the finalizer. The finalizer will call the Dispose method, and that will end up throwing, and an exception during finalization is fatal to the entire process. Bye!

The general idea is that when you are working with a throwing ctor, you either make sure to call GC.SupprressFinalize or you cleanup in such a way that you can safely finalize things.

Carefully performing invalid operations to get the wrong error and the right result

time to read 2 min | 285 words

The RavenDB setup process happens in the browser, and the last part involves restarting the server and then redirecting you to the new server. Along the way you have also specified a certificate to use and the other configurations.

We got a bug report about that when the admin configured us with a self signed certificate. During the time the server restarts, the browser will ping the server periodically, waiting for it to come back up with the new configuration. That can be a problem when using self signed certificates, because the browser will reject them as untrusted.

From the point of view of the client side running in the browser, there is no way to tell the difference from a server that is down and a server that is using a self signed certificate. But we wanted to get the nice feature of showing you when you can move to the new URL. So how can you do that?

Remember that RavenDB have auto detection for invalid HTTP access when using HTTPS? And that this error is raised at the HTTP level?

That means that we can carefully construct an HTTP request such as “http://my.ravendb.sever:443” and check what the result is. If the server is up, the request will fail with a bad request error, and that means that we can distinguish between the server being down and the server being up (but maybe with bad cert).

In fact, once we know the server is up, we can check if the certificate is valid, and show something about it.

This is convoluted, requires us to do things at several places at once at very different levels of the stack. But it is quite amazing to see, it just works!

You are doing it REALLY wrong, the shortest code review ever

time to read 2 min | 394 words

We got a question from a customer that included the following code:

image

This was quite unrelated to the question being asked, but it was enough to trigger all sorts of alarms. Just based on this two lines, and the fact that we can assume that _session is a RavenDB session, we can tell quite a lot about the codebase.

To start with, it is broken, probably badly so.

Let us try to break down why. The fact that you are locking on the session means that you access it concurrently. And that is utterly forbidden. The session is not meant to be used concurrently, and Strange Stuff will happen if you try to.

Now, the session doesn’t have thread affinity, which means that if you are calling it serially, you can certainly move it between threads. In fact, with async sessions, that happens quite often. So one would expect that locking on the session would provide sufficient protection for the session from concurrent use.

However, nothing could be further from the truth. Look at the output of this method. It returns User instances, which are used outside the lock. However, if two threads will call this method concurrently, they will each run in turn, taking and releasing the lock as required. Which is great, but it is very likely that they will get the same User instance back (remember, the session implements Identity Map and ensure that within the scope of the session, the same document is always represented by the same object instance). That is when things start going boom, boom, BOOM!

Even if you are locking on the session, you aren’t locking on your entities, and that means that you have concurrent access to them. I haven’t actually reviewed the entity in question, but I’m 99.999999% certain that no attempt was made to make them safe for multi threading and 100% certain that it there was such an attempt, it would fail horribly in some manner quite easily.

Entities are not meant for concurrent usage, and even trying to reason about them effectively is going to be near impossible.

A session is meant to be a scope of work, in fact, look up the Unit of Work pattern, which is exactly how you are expected to use the session.

The best features are the ones you never knew were thereYou can’t do everything

time to read 2 min | 241 words

imageIn the previous posts in this series, I talked about the kind of features that we build into RavenDB. Things that you never even notice making your life easier.

One feature we don’t have is doing HTTPS to HTTP downgrade. What do I mean by that? Assume that you have a RavenDB instance that is running using HTTP, and a client attempts to connect to it using HTTPS. Remember that we are assuming that the access it made on the same port. So the client wrote "https://my.raven.database:8080” instead of “http://my.raven.database:8080”.

If the other thing would happen, we would detect that and give a clear error to the user. But the other way around? We don’t do that, but why?

Well, the reasoning is very simple. If you connect to an HTTP endpoint using HTTPS, the first packet on the wire wants to do SSL negotiation. However, we don’t have a certificate that we can use here, so we can’t even start the negotiation process.

We could try generating a self signed certificate on the fly and answer the request with an error. But at this point, the client will likely already error at a low level because of the self signed certificate not being trusted.

Another point against implementing this feature is that HTTP endpoints typically become HTTPS, but rarely the other way around.

FUTURE POSTS

  1. Queries++ in RavenDB: Gimme more like this - 3 hours from now
  2. Setting unrealistic goals, then exceeding them - about one day from now
  3. Queries++ in RavenDB: I suggest you can do better - 2 days from now
  4. The married couple component design pattern - 3 days from now
  5. Queries++ in RavenDB: Spatial searches - 4 days from now

And 2 more posts are pending...

There are posts all the way to Dec 19, 2017

RECENT SERIES

  1. PR Review (9):
    08 Nov 2017 - Encapsulation stops at the assembly boundary
  2. Queries++ in RavenDB (4):
    07 Dec 2017 - Facets of information
  3. Production postmortem (21):
    06 Dec 2017 - data corruption, a view from INSIDE the sausage
  4. API Design (9):
    04 Dec 2017 - The lack of a method was intentional forethought
  5. The best features are the ones you never knew were there (5):
    27 Nov 2017 - You can’t do everything
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats