Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,648 | Comments: 48,411

filter by tags archive

Errors, exceptions and faults, oh my!

time to read 10 min | 1947 words

If we could code for the happy path only, I think that our lives would have been much nicer. Errors are hard, because you keep having to deal with them, and even basic issues in error handling can take down systems that are composed of thousands of nodes.

I went out to look at research around error handling rates, and I found this paper. It says that about 3% of code (C#, mind) is error handling. However, it counts only the code inside catch / finally as error handling. My recent foray into C allow me another data point. The short  version, with no memory handling is 30 lines of code, the long version, with error handling, is over a 100.

If I had to guess, I would say that error handling is at least 10 – 15 %, and I would be surprised by 25 – 30%. In C# and similar languages, a centralized error handling strategy can help a lot in this regard, I think.

Anyway, let’s explore a few options for error handling:

The C way – return codes. This sucks. I think that this is universally known to suck. In particular, there is no rhyme or reason for return codes. Something you need to check for INVALID_HANDLE_VALUE, sometimes for a value that is different from zero. Sometimes the return code is the error code. In other times you need to call a separate function to get it. It also forces you to have a very localized error handling mode. All error handling should be done all the time, which can easily lead to either a single forgotten return code causing issues down the line (forgetting to check fsync() return code got data corruption in Postgres, for example) or really bad code where you lose sight of what is actually going on because there are so much error handling that the real functionally went into hiding.

The return code model also doesn’t compose very well, in the case of complex operations failing midway. It doesn’t provide contextual information or allow you to get stack traces easily. Each of this is important if you want to have a good error handling strategy (and good debugging / troubleshooting experience).

So the C way of doing things is out .What are we left? We have a few options:

  • Go with multiple return codes
  • Rust with Option<T>, Result<T>
  • Node.js with callbacks
  • C# / Java with Exceptionsmuch

Let’s talk about the Go approach for a bit. I think that this is universally loathed as being very similar to the C method and cause a lot of code repetition. On the other hand, at least we don’t have GetLastError() / errno to deal with. And one advantage off Go in this regard that the defer command allow you to much more cleanly handle state (you can just return and any resource will be cleaned up). This means that the code may be repetitive to write, but it is much easier to review.

The problem with this approach is that it is hard to compose errors. Imagine a method that needs to read a string from the network, parse a number from the string and then update a value in a file. Without error handling, this looks like so:

I haven’t even written the file handling path, mostly because it got too tiring. In this case, there are so many things that can go wrong. The code above handles failure to make the request, failure to read the value from the server, failure to parse the string, etc. With a file, you need to handle failure to open the file, read its content, parse them, do something with the value from the server and file value and then serialize the value back to bytes to be written to the file. About every other word in this previous statement require some form of error handling.  And the problem is that when we have complex system, we don’t just need to handle errors, we need to compose them so they would make sense.

EPERM error from somewhere is pretty useless, so having the file name is huge help in figuring out what the problem was. But knowing that the error is actually because we tried to write to save the data to the on-disk cache give me the proper context for the error.  The problem with errors is that they can happen very deeply in the code path, and the policy for handling such errors belong much higher in the stack.

Rust’s approach for errors is cleaner than Go, you don’t have multiple result types but the result is actually wrapped in a Result / Option value that you need to explicitly handle. Rust also contain some syntax sugar to make this pretty easy to write.

However, Rust error handling just plain sucks when you try to actually compose errors. Imagine the case where I want to do several operations, some of which may fail. I need to report success if all has passed, but error if any had errored. For a bit more complexity, we need to provide good context for the error, so the error isn’t something as simple as “int parse failure” but with enough details to know that it was an int parse failure on the sixth line of a particular file that belong to a certain operation.

The reason I say that Rust sucks for this is that for consuming error, things are pretty simple. But for producing them? The suggestion to library authors is to implement your own Error type. That means that you need to implement the Display trait manually, you need to write a separate From trait for each error that you want to compose up. If your code suddenly need to handle a new error type, you deal with that by writing a lot of boiler plate code. Any change in the error enum require touching multiple places in the code, violating SRP. You can use Box<Error>, it seems, but in this case, you have just “an error occurred” and it is complex to get back the real error and act on it.

A major complication of all the return something option is the fact that they usually don’t provide you with  a stack trace. I think that having a stack trace in the error is extremely helpful to actually analyzing a problem and being able to tell what actually happened.

Callbacks, such as was done with node.js, are pretty horrible. On the one hand, it is much easier to provide the context, because you are called from the error site and can check your current state. However, there is only so much that you can do in such a case, and state management is a pain. Callbacks have proven to be pretty hard to program with, and the industry as a whole is moving to async/await model instead. this give you sequential like mechanism and much better way to reason about the action of the system.

Finally, we have exceptions. There are actually several different models for exceptions. You have Java with checked exceptions, with the associated baggage there (cannot change the interface, require explicit handling, etc). There is the Pony language which has “exceptions”. That is really strange choice of implementation. Pony has exceptions for flow control, but it doesn’t give you any context about the actual error. Just that one happened.  The proper way of handling errors in Pony is to return a union of the result and possible errors (similar to how Rust does it, although the syntax looks nicer and there is less work).

I’m going to talk about C#’s exceptions. Java’s exceptions, except for some of them being checked, are pretty much the same.

Exceptions have the nice property that they are easily composable, it is easy to decide to handle some errors and to pass some up the chain. Generic error handling is also easy.  Exceptions are problematic because they break the flow of the code. An exception in one location can be handled somewhere completely different, and there is no way for you to see that when looking on the code. In fact, I’m not even aware of any IDE / tooling that can provide you this insight.

In languages with exceptions, you also can have exceptions pretty much at any location, which mean that you need to write exception-safe code to make sure that an exception don’t leave your code in an inconsistent state. There is also a decidedly non trivial cost of exceptions. To start with, many optimizations are mitigated by try blocks and throwing exceptions is often very expensive. Part of that is the fact that we need to capture the oh so valuable stack trace, of course.

There is also another aspect to error handling to consider. There are many cases where you don’t care about errors. Any time that you have generic framework code that calls to user code. An HTTP Handler is a good example of that. You call the user’s code to handle the request, and you don’t care about errors. You simple catch that error and return 500 / message to the client. Any error handling strategy must handle both scenarios. The “I really care about every single detail and separate error handling code path for everything” and “I just want to know if there is an error and print it, nothing else”.

In theory, I really love the Rust error handling mechanism, but the complexity of composability and generic handling means that it is a lot less convenient to actually consume and produce errors. Exceptions are great in terms of composability and the amount of detail they provide, but they are also breaking the flow of the code and introduce a separate and invisible code paths that are hard to reason about in many cases. On the other hand, exceptions allow you to bubble errors upward natively and easily, until you get to a location that can apply a particular error handling policy.

A good example from a recent issue we had to deal with. When running on a shared drive, a file delete isn’t going to be processed immediately, there is a gap of time in which the delete command seems to have succeeded, but attempting to re-create the file will fail with EEXISTS (and trying to open the file will give you ENOENT, so that’s fun). In this case, we throw the error up the stack. In our use case, we have this situation only when dealing with temporary files, and given that they are temporary, we can detect this scenario and use another file name to avoid this issue. So we catch a FileNotFoundException and retry with a different file name. This goes through four of five layers of code and was pretty simple to figure out and implement.

Doing that with error codes is hard, and adding another member for the Error type will likely have cascading implications for the rest of the code. On the other hand, throwing a new exception type from a method can also break the contract. Explicitly in languages like Java and implicitly in languages like C#.  In fact, with C#, for example, the implied assumption is always: “Can throw the following exceptions for known error cases, and other exceptions for unexpected”. This is similar to checked exceptions vs. runtime exceptions in Java. But in this case, this is the implicit default and it gives you more freedom overall when writing your code. Checked exceptions sounds great, but they have been proven to be a problem for developers in practices.

Oh well, I guess I won’t be able to solve the error handling problem perfectly in a single blog post.

I want to see the QA process that catch this bug!

time to read 2 min | 344 words

When we get bug reports from the field, we routinely also do a small assessment to figure out why we missed the issue in our own internal tests and runway to production.

We just got a bug report like that. RavenDB is not usable at all on a Raspberry PI because of an error about Non ASCII usage.

This is strange. To start with we test on Raspberry Pi. To be rather more exact, we test on the same hardware and software combination that the user was running on.  And what is this Non ASCII stuff? We don’t have any such thing in our code.

As we investigated, we figured out that the root cause was that we were trying to pass a Non ASCII value to the headers of the request. That didn’t make sense, the only things we write to the request in this case is well defined values, such as numbers and constant strings. All of which should be in ASCII. What was going on?

After a while, the mystery cleared. In order to reproduced this bug, you needed to have the following preconditions:

  • A file hashed to a negative Int64 value.
  • A system whose culture settings was set to sv-SE (Swedish).
  • Run on Linux.

This is detailed in this issue. On Linux (and not on Windows), when using Swedish culture, negative numbers are using: ”−1” and not “-1”.

For those of you with sharp eyes, you noticed that this is U+2212, (minus sign), and not U+002D (hyphen minus). On Linux, for Unicode knows what, this is used as the negative mark. I would complain, but my native language has „.

Anyway, the fix was to force the usage of invariant when converting the Int64 to a string for the header, which is pretty obvious. We are also exploring how to fix this in a more global manner.

But I keep coming back to the set of preconditions that is required. Sometimes I wonder why we miss a bug, in this case, I can only say that I would have been surprised if we would have found it.

RavenDB 4.1 FeaturesCluster wide ACID transactions

time to read 5 min | 903 words

imageOne of the major features coming up in RavenDB 4.1 is the ability to do a cluster wide transaction. Up until this point, RavenDB’s transactions were applied at each node individually, and then sent over to the rest of the cluster. This follows the distributed model outlined in the Dynamo paper. In other words, writes are important, always  accept them. This works great for most scenarios, but there are a few cases were the user might wish to explicitly choose consistency over availability. RavenDB 4.1 brings this to the table in what I consider to be a very natural manner.

This feature builds on the already existing compare exchange feature in RavenDB 4.0. The idea is simple. You can package a set of changes to documents and send them to the cluster. This set of changes will be applied to all the cluster nodes (in an atomic fashion) if they have been accepted by a majority of the nodes in the cluster. Otherwise, you’ll get an error and the changes will never be applied.

Here is the command that is sent to the server.

image

RavenDB ensures that this transaction will only be applied after a majority confirmation. So far, that is nice, but you could do pretty much the same thing with write assurance, a feature RavenDB has for over five years. Where it gets interesting is the fact that you can make the operation in the transaction conditional. They will not be executed unless a certain (cluster wide) state has an expected value.

Remember that I said that cluster wide transactions build upon the compare exchange feature? Let’s see what we can do here. What happens if we wanted to state that a user’s name must be unique, cluster wide. Previously, we had the unique constraints bundle, but that didn’t work so well in a cluster and was removed in 4.0. Compare exchange was meant to replace it, but it was hard to use it with document modifications, because you didn’t have a single transaction boundary. Well, now you do.

Let’s see what I mean by this:

As you can see, we have a new command there: “ClusterTransaction.CreateCompareExchangeValue”. This is adding another command to the transaction. A compare exchange command. In this case, we are saying that we want to create a new value named “usernames/Arava” and set its value to the document id.

Here it the command that is sent to the server:

image

At this point, the server will accept this transaction and run it through the cluster. If a majority of the nodes are available, it will be accepted. This is just like before. The key here is that we are going to run all the compare exchange commands first. Here is the end result of this code:

image

We add both the compare exchange and the document (and the project document not shown) here as a single operation.

Here is the kicker. What happen if we’ll run this code again?

You’ll get the following error:

Raven.Client.Exceptions.ConcurrencyException: Failed to execute cluster transaction due to the following issues: Concurrency check failed for putting the key 'usernames/Arava'. Requested index: 0, actual index: 1243

Nothing is applied and the transaction is rolled back.

In other words, you now have a way to provide consistent concurrency check cluster wide, even in a distributed system. We made sure that a common scenario like uniqueness checks would be trivial to implement. The feature allows you to do in-transaction manipulation of the compare exchange values and ensure that document changes will only be applied if all the compare exchange operations (and you have more than one) have passed.

We envision this being used for uniqueness, of course, but also for high value operations where consistency is more important than availability. A good example would be creating an order for a seat in a play. Multiple customers might try to purchase the same seat at the same time, and you can use this feature to ensure that you don’t double book it*. If you manage to successfully claim the seat, your order document is updated and you can proceed. Otherwise, the whole thing rolls back.

This can significantly simplify workflow where you might have failure mid operation, by giving you transactional guarantee around the whole cluster.

A cluster transaction can only delete or put documents, you cannot use a patch. This is because the result of the cluster transaction must be self contained and repeatable. A document modified by a cluster transaction may also take part in replication (including external replication). In fact, documents modified by cluster transactions behave just like normal documents. However, conflicts between documents modified by cluster transactions and modifications that weren’t made by cluster transaction are always resolved in favor of the cluster transactions modifications. Note that there can never be a conflict between modifications on cluster transactions. They are guaranteed proper sequence and ordering by the nature of running them through the consensus protocol.

* Yes, I know that this isn’t how it actually work, but it is a nice example.

RavenDB 4.1 FeaturesExplain that choice

time to read 2 min | 277 words

One of the things that we do in RavenDB is try to expose as much as possible the internal workings and logic inside RavenDB. In this case, the relevant feature we are trying to expose is the inner working of the query optimizer.

Consider the following query, running on a busy system.

image

This will go to query optimizer, that needs to select the appropriate index to run this query on. However, this process is somewhat of a black box from the outside. Let me show you how RavenDB externalize that decision.

image

You can see that there were initially three index candidates for this. The first one doesn’t index FirstName, so it was ruled out immediately. That gave us a choice of two suitable indexes.

The query optimizer selected the index that has the higher number of fields. This is done to route queries from narrower indexes so they will be retired sooner.

This is a simple case, there are many other factors that may play into the query optimizer decision, such as when an index is stale because it was just created. The query optimizer will then choose another index until the stale index catch up with all its work.

To be honest, I mostly expect this to be of use when we explain how the query optimizer work. Of course, if you are investigating “why did you use this index and not that one” in production, this feature is going to be invaluable.

RavenDB Security Vulnerability Advisory

time to read 3 min | 533 words

You can read the full details here. The short of it is that we discovered a security vulnerability in RavenDB. This post tells a story. For actionable operations, see the previous link and upgrade your RavenDB instance to a build that includes the fix.

Timeline:

  • June 6 – A routine code review inside RavenDB expose a potential flaw in sanitizing external input. It is escalated and confirmed be a security bug. Further investigation classify it as CRTICIAL issue. A lot of sad faces on our slack channels show up. The issue has the trifecta of security problems:
    • It is remotely exploitable.
    • It is on in the default configuration.
    • It provide privilege escalation (and hence, remote code execution).
  • June 6 – A fix is implemented. This is somewhat complicated by the fact that we don’t want it to look like a security fix to avoid this issue.
  • June 7 – The fix goes through triple code review by independent teams.
  • June 7 – An ad hoc team goes through all related functionality to see if similar issues are still present.
  • June 8 – Fixed version is deployed to our production environment.

We had to make a choice here, whatever to alert all users immediately, or first provide the fix and urge them to upgrade (while opening them up to attacks in the meanwhile). We also want to avoid the fix, re-fix, for-real-this-time cycle from rushing too often.

As this was discovered internally and there are no indications that this is known and/or exploited in the wild, we chose the more conservative approach and run our full “pre release” cycle, including full 72-96 hours in a production environment serving live traffic.

  • June 12 – The fix is now available in a publicly released version (4.0.5).
  • June 13 – Begin notification of customers. This was done by:
    • Emailing all RavenDB 4.0 users. One of the reasons that we ask for registration even for the free community edition is exactly this. We want to be able to notify users when such an event occur.
    • Publishing security notice on our website.
    • Pushing a notification to all vulnerable RavenDB nodes warning about this issue. Here is what this looks like:
      image
  • Since June 13 – Monitoring of deployed versions and checking for vulnerable builds still in use.
  • June 18 – This blog post and public notice in the mailing list to get more awareness of this issue. The website will also contain the following notice for the next couple weeks to make sure that everyone know that they should upgrade:
    image

We are also going to implement a better method to push urgent notices like that in the future, to make sure that we can better alert users. We have also inspected the same areas of the code in earlier versions and verified that this is a new issue and not something that impacts older versions.

I would be happy to hear what more we can do to improve both our security and our security practices.

And yes, I’ll discuss the actual vulnerability in detail in a month or so.

The Incredibles II

time to read 2 min | 273 words

imageI just got back from watching the Incredibles 2. The previous movie was a favorite a mine from first view, and it is one of the few movies that I can actually bear to watch multiple times. I was hoping for a sequel almost from the moment I finished the first movie, and it took over a decade to get it.

I actually sat down with my 3 years old daughter to watch the first movie before I went to see the second one. I’m not sure of how much she got from it, although she is very fond of trains and really loved the train scene (and then kept asking where is the train). It is unusual for me to actually “prepare” to see a movie, by the way. But it did mean that I had the plot sharp in my head and that I could directly compare the two movies.

First, in terms of the plot. It was funny, especially since I got a kid now and could appreciate a lot more of the not so subtle digs at parenthood.

Second, in terms of visuals, wow, it improved by a lot. The original movie held up really good in terms of visuals in the past 12 years, but the new one is visibly better in this term.

Also, at this point I nearly got a heart attack because a talking book (again, my daughter) starting neighing at me at the middle of the night just as I got into the house.

Highly recommended.

The case of the missing writes in Docker (a Data Corruption story)

time to read 6 min | 1017 words

image

We started to get reports from users that are running RavenDB on Docker that there are situations where RavenDB reports that there has been a data corruption event.  You can see how this looks like on the right. As you can see, this ain’t a happy camper. In fact, this is a pretty scary one. The kind you see in movies that air of Friday the 13th.

The really strange part there was that this is one of those errors that really should never be possible. RavenDB have a lot of internal checks, including for things that really aren’t supposed to happen. The idea is that it is better to be safe than sorry when dealing with your data. So we got this scary error, and we looked into it hard. This is the kind of error that gets top priority internally, because it touch at the core of what we do, keeping data safe.

The really crazy part there was that we could find any data loss event. It took a while until we were able to narrow it down to Docker, so we were checking a lot of stuff in the meantime. And when we finally began to suspect Docker, it got even crazier. At some point, we were able to reproduce this more or less at will. Spin a Docker instance, write a lot of data, wait a bit, write more data, see the data corruption message. What was crazy about that was that we were able to confirm that there wasn’t any actual data corruption.

We started diving deeper into this, and it looked like we fell down a very deep crack. Eventually we figured out that you need the following scenario to reproduce this issue:

  • A Linux Docker instance.
  • Hosted on a Windows machine.
  • Using an external volume to store the data.

That led us to explore exactly how Docker does volume sharing. I a Linux / Linux or Windows / Windows setup, that is pretty easy, it basically re-route namespaces between the host and the container. In a Linux container running on a Windows machine, the external volume is using CIFS. In other words, it is effectively running on a network drive, even if the network is machine local only.

It turned out that the reproduction is not only very specific for a particular deployment, but also for a particular I/O pattern.

The full C code reproducing this can be found here. It is a bit verbose because I handled all errors. The redacted version that is much more readable is here:

This can be executed using:

And running the following command:

docker run --rm -v PWD:/wrk gcc /wrk/setup.sh

As you can see, what we do is the following:

  • Create a file and ensure that it is pre-allocated
  • Write to the file using O_DIRECT | O_DSYNC
  • We then read (using another file descriptor) the data

The write operations are sequential, and the read operations as well, however, the read operation will read past the written area. This is key. At this point, we write again to the file, to an area where we already previously read.

At this point, we attempt to re-read the data that was just written, but instead of getting the data, we get just zeroes.  What I believe is going on is that we are hitting the cached data. Note that this is doing system calls, not any userland cache.

I reported this to Docker as a bug. I actually believe that this will be the same whenever we use CIFS system (a shared drive) to run this scenario.

The underlying issue is that we have a process that reads through the journal file and apply it, at the same time that transactions are writing to it. We effectively read the file until we are done, forcing the file data into the cache. The writes, which are using direct I/O are going to bypass that cache and we are going to have to wait for the change notification from CIFS to know that this needs to be invalidated. That turn this issue into a race condition of data corruption,of sort.

The reason that we weren’t able to detect data corruption after the fact was that there was no data corruption. The data was properly written to disk, we were just mislead by the operating system about that when we tried to read it and got stale results. The good news is that even after catching the operating system cheating on us with the I/O system, RavenDB is handling things with decorum. In other words, we immediately commit suicide on the relevant database. The server process shuts down the database, register an alert and try again. At this point, we rely on the fact that we are crash resistant and effectively replay everything from scratch. The good thing about this is that we are doing much better the second time around (likely because there is enough time to get the change event and clear the cache). And even if we aren’t, we are still able to recover the next time around.

Running Linux containers on Windows is a pretty important segment for us, developers using Docker to host RavenDB, and it make a lot of sense they will be using external volumes. We haven’t gotten to testing it out, but I suspect that CIFS writes over “normal” network might exhibit the same behavior. That isn’t actually a good configuration for a database for a lot of other reasons, but that is still something that I want to at least be able to limp on. Even with no real data loss, a error like the one above is pretty scary and can cause a lot of hesitation and fear for users.

Therefor, we have changed the way we are handling I/O in this case, we’ll avoid using the two file descriptors and hold a bit more data in memory for the duration. This give us more control, actually likely to give us a small perf boost and avoid the problematic I/O pattern entirely.

.NET Core 2.1 broke my software, thank you very much!

time to read 1 min | 137 words

We just upgraded our stable branch to .NET Core 2.1. The process was pretty smooth overall, but we did get the following exchange in our internal Slack channel.

It went something like this:

  • is it known that import doesn't work ?
  • As you can imagine, Import is pretty important for us.
  • no
  • does it work on your machine ?
  • checking,,,
  • what's an error?
  • no error.
  • so UI is blocked?
  • image
  • do you have any errors in dev tools console?
  • `TypeError: e is undefined`
    doesn't says to me much
  • same thing in incognito
  • export doesn't work either
  • lol the reason is: dotnet core 2.1
  • the websockets are faster and I had race in code
    will push fix shortly

There you have it, .NET Core 2.1 broke our code. Now I have to go and add Thread.Sleep somewhere…

Inside RavenDB 4.0 book is Done, Done

time to read 1 min | 98 words

And now the book is another tiny big step close to actually being completed. All editing has been completed, and we did a full pass through the book. All content is written and there isn’t much to do at all.

We are now sending this for production work, and once that is done, I can announce this project complete. Of course, by that time, I’ll have to start writing about the new features in RavenDB 4.1, but that is a story for another day.

You can get the updated bits here, as usual, I would really appreciate any feedback.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. RavenDB 4.1 features (6):
    20 Jun 2018 - Cluster wide ACID transactions
  2. Codex KV (2):
    06 Jun 2018 - Properly generating the file
  3. I WILL have order (3):
    30 May 2018 - How Bleve sorts query results
  4. Inside RavenDB 4.0 (10):
    22 May 2018 - Book update
  5. RavenDB Security Report (5):
    06 Apr 2018 - Collision in Certificate Serial Numbers
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats