Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,707 | Comments: 48,617

filter by tags archive

Graphs in RavenDB: The overall design

time to read 5 min | 863 words

Note: These series of posts are about a planned feature, exploring how we go about building it. This is meant to solicit feedback and get more eyes on the idea, things aren’t set in stone and we don’t have a firm release date on this.

We have been wanting to add graph queries to RavenDB for several years now, but we always had more important things get in the way. That didn’t prevent us from discussing this internally and sketch up a few options. We are now looking at this more seriously and I thought that sharing the details of our deliberations would be interesting and likely to garner us some valuable feedback. I’m going to assume that the reader is at least somewhat familiar with the notion of graph data and graph queries.

Probably the most well known graph database is Neo4J, which provides the notion of nodes and edges, both of which have a type and a set of (flat) properties. This allow you to define a model of any arbitrary complexity. This works if you model is purely graph based, but it doesn’t work for RavenDB, whose users are used to the document model. On the surface, this looks like a minor detail. RavenDB has documents, which can have any shape, including containing embedded values and collections inside them. Neo4J, on the other hand, model things differently. The simplest example that I can think of is Orders and Order Lines, where you’ll have the following models:

Neo4J

RavenDB

image image

Both models have the same information, but each element in the Neo4J graph is an independent node that is linked to the others. On the other hand, with RavenDB, we have a single document that embeds a lot of the information directly.  Note that what we haven’t shown in the image is that in RavenDB as well, you have other documents as well. The products, for example, are separate documents. 

Graph databases are often used to handle the basis of recommendation engines, fraud detection, etc. But they are usually used to augment the capabilities of the system, rather than as the primary data store of applications. RavenDB, on the other hand, is most frequently deployed as the primary data store. We want to give our users the ability to perform graph operations, but we don’t want to lose anything that make RavenDB useful and easy to use.

We initially thought about having the following definition:

  • Each document is (implicitly) a node in the graph.
  • You can call Link(src,dest,type, attributes) to create an edge between any two documents.
  • Provide the usual graph queries on top of that.

We started exploring this implementation, but it quickly led to mounting complexity. From the point of view of the user, it led to having to do additional work, you’ll have to maintain your document model and the edges at the same time. This allow you to do some interesting things, but it also likely to cause complications down the line and very likely to cause issues when the document model and graph model disagree with one another. Other issues relates to how do you handle graphs in a distributed manner. How do you deal with the creation on an edge between two documents on one node when one of them was deleted on another?

We pushed in that direction for a while, because that was the obvious thing to do, but it really turned up to be a bad idea which didn’t play well with the rest of RavenDB. The worst part was the fact that you might modify the document properties but not define the edge, which lead to inconsistency. This was very easy to do.

The next thing we played with was to remove the Link() call and allow the user to define a background operation that would go and create the links between documents automatically whenever they were updated. This would allow us to avoid having any inconsistencies between the data in the documents and the links between then. After thinking about this for a while, we went ahead with this approach, but removed the requirement for a background operations.

RavenDB will be able to use your existing document model as the graph model as well. In other words, in the model above, you have the orders/2 document, which has two links, for each of the products. This give us both the ability to have a well define document model, with its well known Domain Driven architecture and the ability to hop off all the pre-existing links that we have in the model.

I’ll discuss the querying model and how it all plays together in a future post. For now, I want to show you how this looks like when we want to do a typical graph operation, friends of friends:

image

More details will come in the next post…

Debug considerations for high level system architecture

time to read 4 min | 602 words

I run into this twit:

This resonated very strongly with me, because when we architected RavenDB 4.0, one of the key considerations was the issue of debuggability. RavenDB instances often run for months on end, usually only restarted to apply updates to OS or database. They are often running in production environments where it is not possible to do any meaningful debugging. We rely heavily on resolving issues through minidumps, core dumps, etc. Part of the work we did in architecting RavenDB 4.0 was to sit down and think about supporting the system in production.

For many of the core components, async was right out. Part of that was because of issues relating to the unpredictability of async execution, we want certain things to always happen first, avoid thread pool starvation / growth policies / etc. But primarily, we were sick and tired of getting a dump (or even just pausing a running instance when we debug a complex situation) and having to manually reconstruct the state of the system. Parallel stacks alone is an amazing feature for figuring out what is going on in a complex system.

The design of RavenDB called for any long lived task to run on a dedicated thread. These threads are named, so if you stop in the debugger, you can very quickly see what is actually is going on there. This is also useful for things like account for memory, CPU time, etc. We had a problem in a particular component that was leaking memory at a rate of 144 bytes per second, just under 12 MB per day. This is something that is very easy to lose in the noise. But because we do memory accounting on a thread basis, it was easy to go to a system that was running for a few weeks and see that this particular thing had 500MB of memory in use, when we expected maybe 15MB.

We still use async for handling of short term operations. For example, processing of a single request, because these are fast and if there are problems with them, we’ll usually see them already executing.

I’m really happy with this decision, since it provided us many dividends down the line. We planned this for production, to be honest, but it ended up really helpful in normal debugging as well.

This also allow us to take advantage of the fact that a thread that is not runnable is effectively free (aside from some memory, of course), so we can dedicate a full thread for these long running tasks and greatly simplify everything. An index in RavenDB always has its own dedicated thread, which is woken up if there is anything that this index needs to process. This means that indexing code is simple, isolated and we can start applying policies at the index level easily. For example, if I have an index that has a low priority, I can just adjust the thread’s priority and let the OS do the hard work of scheduling it accordingly.

Async simplifies the programming model significantly, but it also come at a cost of system complexity and maintenance overhead. Figuring out that you have a request stuck on a task that will never return, for example, is never pleasant. The same thing using blocking operations is immediately obvious. That is a benefit that should absolutely not be discounted.

Transactional Patterns: Conversation vs. Batch

time to read 6 min | 1136 words

When I designed RavenDB, I had a very particular use case at the forefront of my mind. That scenario was a business application talking to a database, usually as a web application.

These kind of applications have a particular style of communication with the user. As you can see below, there are two very distinct operations. Show the user the data, followed by some “think time” (seconds at minimum, but can be much longer) and then followed by an action.

image

This shouldn’t really be a surprised for anyone who developed any kind of application for the last decade or two, so why do I mention this explicitly?  I mention this because of the nature of communication between the application and the database.

Some databases have a the conversation pattern with the application. In terms of API, this will look something like this:

  • BeginTransaction()
  • Update()
  • Insert()
  • Commit()

This is a very natural model and should be quite familiar for most developers. The other alternative to this method is to use batches:

  • SaveChanges( [Update, Insert] )

I want to use this post to talk about the difference between the two styles and how that impacts your work. Relational databases uses the conversation style while RavenDB uses batch style. On the surface, it looks like it would be a more complex to use RavenDB to achieve the same task, but there is very little difference in the API as far as the user is concerned. In both cases, the code looks very much the same:

Behind the scenes, however, the RavenDB code will send just a single request to the server, while a relational database will need four separate commands to execute the transaction. In many cases, you can send all of these commands to the server in a single roundtrips, but that is an optimization that doesn’t always work and often isn’t applied even when it is possible.

Sidebar: Reducing server roundtrips

Why is the reduction in server roundtrips so important? Because it has a lot of implications on the overall performance of the system. In many cases the cost of making a remote query from the application to the database far outstrips the costs of actually executing the query. This ties closely to the Fallacies of Distributed Computing. Latency isn’t zero, even though when you develop locally it certainly seems like this is the case.

The primary goal of this design in RavenDB was to reduce the number of network roundtrips that your application must endure. Because in the vast majority of the cases, your application is going to follow the “show data” / “modify data” as two separate operations (often separated by a long idle time) there is a lot of value in having the database interaction model match what you will actually be doing.

As it turned out, there are some additional advantages (and disadvantages, which I’ll cover a bit later) to this approach, beyond just the obvious reduction in the number of server roundtrips.

When the server gets all the operations that needs to be done in a single request, it can apply all of them at once. For that matter, it can chose how to apply them in the most optimal order. This gives the database server a lot more chances for optimization. It is similar to going to the supermarket with a list of items to purchase vs. a treasure hunt. When you have the full list, you can decide to pick things up based on how close they are on the shelves. If you only get the next instruction after you complete the previous one, you have no option for optimization.

When using the conversation style, durability and state management become more complex as well. Relational databases typically use some variation of ARIES for their journals. This is because they need to record information about ongoing transactions that haven’t yet been committed. This add significant complexity to the amount of work that is required from the database engine. Furthermore, when running in a distributed system, you need to share this transaction state (which hasn’t yet been committed!) across the nodes to allow failover of the transaction if the server fails. With the conversation style, you need to support concurrent transactions all operating at the same time and potentially reading and modifying the same data. This lead to a great deal of code that is required to properly manage locking and latching inside the database engine.

On the other hand, batch mode give the server all the operations in the transaction in a single go. This means that failover can simply be sending the batch of operations to another node, without the need to share complex state between them. It means that the database server has all the required information and can make decisions based on it. For example, if there are no data dependencies, it can execute the operations in the transaction in whatever order it desires, leading to more optimal execution time. The database can also mix & match operations from different transactions into a single batch (as long as it keeps the externally visible behavior consistent, of course) to optimize things even further.

There are two major disadvantages for batch mode. The first of which is that there is usually a strict separation of reads from writes. That means that you usually can’t get a single consistent read/modify operation that stay in the same transaction. The second issue is similar, because you need to generate all the operations ahead of time, you can’t make decisions about what operations to execute based on the data you read, at least not in the same transaction. The typical solution for that is to send a script in the batch. This script can then read / modify data in the same context, apply logic, etc. The important thing here is that this script runs inside the server, already inside the transaction. This means that you don’t pay network round trips time to make such operations.

On the other hand, it means that you need to write potentially complex logic in the database’s scripting language, rather than your own platform, which you’ll likely prefer.

Luckily, for most scenarios, especially with web applications, you don’t need to execute complex logics on the server side. You can usually just send the commands you need in a single batch and be done with it. Often, just have optimistic concurrency is enough to get you the consistency you want, with scripting reserved for more exceptional cases.

RavenDB’s usage scenario was meant to make the common operations easy and the hard stuff possible. I think that we got it right and ended up with an API that is functional, highly performant and one that has withstood the test of time very well.

The iterative design process: Query parameters example

time to read 4 min | 660 words

When we start building a feature, we often have a pretty good idea of what we want to have and how to get there. And then we actually start building it and we often end up with something that is quite different (and usually much better). It has gotten to the point where we aren’t even trying to do hard specs and detailed design at anything beyond the exploratory levels. For example, in the design of RavenDB 4.0, there was not even a mention of RQL. That ended up being a very late addition to the codebase, but it improved RavenDB significantly. On the other hand, the low level mechanisms of zero copy documents from Voron all the way to the network were designed up front, but only at a fairly high level.

In this post, I want to talk about query parameters in RavenDB. Actually, let me be more specific, we have query parameters, but what we don’t have (or rather, didn’t have, because that will be merged in by the time you read this post) is the ability to run parameterized queries from the studio. We always meant to have that capability, but we run out of time with the 4.0 release. As we are gearing up to the 4.1 release, we are cleaning the table from the major-minor issues. (Major in term of impact, minor in term of amount of work required). The query parameters in the studio is one such example. Here is what this looks like:

image

My first thought was to just build something like this:

image

Give the user the ability to define arguments and be done with it. The task was assigned to one of our developers and I expected to get a PR in a short while.

This particular developer has a tendency to consider not just the task at hand but also other aspects of the problem. He didn’t want the user to have to manually specify each argument, since that has poor ergonomics. Instead, he wanted the studio to figure it out its own and help the user. So the first thing he did was detect the arguments (regex: “\$\w+”) and present them in the grid. Then there was the issue of how to deal with edits, etc. Then he run into another problem, types. Query parameters can be more than just strings, they can be any JSON data type.

Here is what he came up with:

image

Instead of having to define the query parameters in a separate location, just put them right in. Having the parameters grid involves pointing and clicking with the mouse, entering possibly complex values (such as long arrays) and in general much more work than just having them right above the query.

Note that this is a studio only feature, queries from the client API already have ways to specify arguments properly. So the next question is how we are going to handle passing the arguments to the server. Remember, this is only on the studio, so we can take quite a few shortcuts. In this case, we’ll simply snip the entire first section of the query text (which contains the query parameters). We can do that by going from the start of the query to the first from or declare keywords. We do a basic pre-processing to turn “$name = …“ into “results.$name = …“ and then just execute this code in the browser, giving us a JS object with all the parameters that we can then send to the servers.

The next stage is to make this discoverable, by detecting parameters whose value is not provided and giving the user a quick fix to add them.

Playing with graphs and logic systems

time to read 3 min | 554 words

imageRecently I have been playing with graphs a bit, trying to understand them in more depth. Because I learn much better by doing, I thought that I would build a toy graph query engine to see how that works. I loaded the MovieLens small data set into a set of C# classes and started playing with them.

Here is what the source data looks like:

I’m not dealing with typical issues, such as how to fetch the data, optimizing indexes, etc. Instead, I want to focus solely on the problem of finding patterns in the graph.

Here is a simple example of a pattern in the graph:

(userA:User)-[:Rated]->(movie:Movie)<-[:Rated]-(userB:User)

The syntax is called Cypher, which is commonly used for graph queries.

What we are trying to find here is a set of triads. User A who rated a movie that was also rated by user B. The result of this query is a list of tuples matching (userA, movie, userB).

This is really similar to the way I remember learning Prolog, so I thought about giving it a shot and solving the problem in this way.

The first thing to do is to break the query itself into independent steps:

(userA:User)-[:Rated]->(movie:Movie) AND (userB:User)-[:Rated]->(movie:Movie)

Note that in this case, the first and second queries are exactly the same, but now they are somewhat easier to reason about. We just need to do the match ups property, here is how I would write the code:

This query can take a while to run, because on the small data set (with just 100,004 recommendations and 671 users) there are over 6.2 million such connections. And yes, I used join intentionally, because it show case the interesting problem of cartesian product.

Now, these queries aren’t really interesting and they can be quite expensive. A better query would be to find the set of movies that were rated by both user 1 and user 306. This can be done as simple as changing the previous code starting location:

Again, this is a pretty simple scenario. A more complex one would be to find a list of movies a particular user has not rated that were rated by people who liked the same movies as this user. As a query, this will look roughly like this:

(userA:User)-[:Rated(Rating >= 4)]->(:Movie)<-[:Rated(Rating >= 4)]-(userB:User) AND (userB:User)-[:Rated(Rating >= 4)]->(notRatedByA:Movie) AND NOT (userA:User)-[:Rated]->(notRatedByA:Movie)

Note that this merely specify the first part, find me users that liked the same movies as userA.  The second part is a bit more complex, we want to find movies rated by the second users and exclude movies rated by the first. Let’s break it into its component parts, shall we?

Here is the code for the first clause:  (userA:User)-[:Rated(Rating >= 4)]->(:Movie)<-[:Rated(Rating >= 4)]-(userB:User)

As you can see, the output of this code is a set of ( userA, userB ). Now, let’s go to the second one, shall we? We already have a match on userB in this case, so we can start evaluating that. Here is the next stage: (userB:User)-[:Rated(Rating >= 4)]->(notRatedByA:Movie)

Now we have the last stage, where we need to filter things out:

And now we have the final results.

For me, thinking about these kind of queries as a “fill in the blanks” makes the most sense.

Pruning issues and the idle bin

time to read 4 min | 666 words

imagePart of the job of a product owner is to pay attention to the list of issues in the issue tracker. Not just to get a feeling for the cadence of the project, but to have an impact on its direction.

Paying attention to the issues doesn’t mean just tracking down what bugs are still opened, mind. Consider the case of a product owner with the release due date looming over the horizon, you need to start looking at the list of remaining issues and take active steps to make sure that you are going to get done more or less on time.

The usual rules apply, chose any 2 of:

  • Speed
  • Quantity
  • Quality

In other words, your team can deliver more features in time, if you are willing to sacrifice quality. On the other hand, they can keep high quality and the same number of features, but the due date will have to move.

As an aside, it is possible to get all three of these aspects at once, but only for a very short amount of time (few days to a week or two at most), but at a very high long term cost.

One of the things that I observed is that in some cases, a lot of complexity and work is in the last 2% of work, where all the the polish work and rough edge cases lurks. In some respects, this is actually a really good thing. Because it gives the product owner the chance to remove features that won’t usually have an explicit impact on the users.  A good example for this in RavenDB would be the amount of time and effort we put into the intellisense feature of RQL queries in the studio. That falls under the Nice To Have set of features. It is unlikely that we’ll get many upset users if the intellisense isn’t up to part with something like Visual Studio or ReSharper, so beyond getting some basic functionality right, we can defer improvements there if we don’t have the extra capacity to complete this by the expected date.

I’m sure that you can think of other examples in your own projects. Note that this require you to understand what exactly your users are valuing your software for. In the case of RavenDB, adding more query functionality and speeding up overall system performance ranks much higher than adding extra smarts to intellisense that is mostly used during exploration / demos.

On the other hand, the effect of the pushing such features down the road accumulate over time. In other words, if you keep your priorities straight and select which features should go into the product, you will defer the small fries over and over. At some point, you’ll need to make a decision about them. You can either decide that they don’t make sense anymore or they are never really going to be important enough to actually put in the “let’s get this done” queue.

Alternatively, you might want to put them in the idle bin. In other words, whenever you have an idle portion in your development, you can peek into the idle bin and get some tasks from there. That is also a good place to have a new team member start from. These are tasks that are minor and not that important, after all, so they can use that to learn the codebase. In fact, we have used this in the past as the tasks bin for interns. That is usually a really good fit, for the same reason that they are good tasks for a new team member with the added benefits that they are usually well scoped and if the intern messes up, you didn’t lose too much.

Regardless, the idle bin notion is important, because otherwise your future tasks queue is going to grow larger and larger, and it will be ever harder to figure out what tasks actually matter.

Dealing with massively distributed data flows

time to read 4 min | 610 words

imageImagine that you are the owner of Gary’s Shoes, and that you want to get data from all of your multitudes of stores into a centralized location. You’ll use that data to make decisions, predict future trends, etc. Given that each store must operate independently, you have a server in each location that will push up it changes (and get updates from) the HQ cluster. You can see an example of this kind of setup in this post.

This work quite well, but it does require the user to be aware of a potential issue. When you have a massively distributed data flow process setup, you need to also pay attention for the quite in the noise. What do I mean by that?

One of our customers have RavenDB deployed to tens of thousands of locations worldwide. At any given time, you are going to have at least some of those locations unavailable. In some locations, part of closing down for the day means literally flipping the master switch on electricity for the entire building. On others, you might have someone tripping over the router or have some local or regional network outage.

Part of the strategy for dealing with such a data set, coming from so many separate locations, is the need to monitor when we aren’t getting data. The fact that on most of our locations we have near real time data is very powerful for the business. But you also need to see where you aren’t getting the data from and setup proper alerts and monitoring for the missing data. From a business perspective, it is also advisable to surface that kind of detail all the way to the user. If you are going to be ordering inventory for the stores in a particular state, but the two major stores in the area are down because of a network issue and has been down for two days now, you want to be aware of that and figure out that you are working on out of date data.

To be honest, the issues isn’t so much about two days of lag in the case of once in blue moon type of error. In the scenario outlined above, in pretty much all business scenarios that I can think of, you won’t really see any impact on the decision making of the organization.

The killer is when you have some sort of a problem that goes on for a while. A DNS update that was missed because of bad DNS cache policy, for example. Now your updates to HQ go into the void in a consistent basis. On the other hand, everything else continue to function properly both locally and for HQ. If this isn’t accounted for, it is easy to miss this for a long period of time. I have seen such a case that was only discovered when the year’s end numbers didn’t quite match up what they were supposed to. Given that this was the second year in a row this happened, the investigation found that some network issue indeed cause a very long term topology failure. This was actually properly reported, in a log file that no one ever read.

Lesson learned, make sure that part of your data flow strategy accounts for such things and bring them to the users’ attention. Actually resolving the issue was a network configuration change that took minutes and the entire dataset was synchronized within a few hours afterward. But finding out that there was even a problem took effectively forever.

Modeling Milk: A discussion on domain modeling

time to read 2 min | 342 words

imageI recently had a discussion at work about the complexity of modeling data in real world systems. I used the example of a bottle of milk in the discussion, and I really like it, so I thought it would make for a good blog post.

Consider a supermarket that sells milk. In most scenarios, this is not exactly a controversial statement. How would you expect the system to model the concept of milk? The answer turns out to be quite complex, in practice.

To start with, there is no one system here. A supermarket is composed of many different departments that work together to achieve the end goal. Let’s try to list some of the most prominent ones:

  • Cashier
  • Stock
  • Warehouse
  • Product catalog
  • Online

Let’s see how each of these think about milk, shall we?

The cashier rings up a specific bottle of milk, but aside from that, they don’t actually care. Milk is fungible (assuming the same expiry date). The cashier doesn’t care which particular milk cartoon was sold, only that the milk was sold.

The stock clerks care somewhat about the specific milk cartoons, but mostly because they need to make sure that the store doesn’t sell any expired milk. They might also need to remove milk cartoons that don’t look nice (crumpled, etc).

The warehouse care about the number of milk cartoons that are in stock on the shelves and in the warehouse, as well as predicting how much should be ordered.

The product catalog cares about the milk as a concept, the nutritional values, its product picture, etc.

The online team cares about presenting the data to the user, mostly similar to the product catalog, until it hits the shopping cart / actual order. The online team also does prediction, based on past orders, and may suggest shopping carts or items to be purchased.

All of these departments are talking about the same “thing”, or so it appears, but it looks, behaves and acted upon in very different ways.

Working with legacy embedded types inside documents

time to read 2 min | 338 words

imageDatabase holds data for long periods of time. Very often, they keep the data for longer than single application generation. As such, one of the tasks that RavenDB has to take care of is the ability to process data from older generations of the application (or even from a completely different application).

For the most part, there isn’t much to it, to be honest. You process the JSON data and can either conform to whatever there is in the database or use your platform’s tooling to rename it as needed. For example:

There are a few wrinkles still. You can use RavenDB with dynamic JSON objects, but for the most part, you’ll use entities in your application to represent the documents. That means that we need to store the type of the entities you use. At the top level, we have metadata elements such as:

  • Raven-Clr-Type
  • Raven-Java-Class
  • Raven-Python-Type
  • Etc…

This is something that you can control, using Conventions.FindClrType event. If you change the class name or assembly, you can use that to tell RavenDB how to treat the old values. This require no changes to your documents and only a single modification to your code.

A more complex scenario happens when you are using polymorphic behavior inside your documents. For example, let’s imagine that you have an Order document, as shown on the right. This document has an internal property call Payment which can be any of the following types:

  • Legacy.CreditCardPayment
  • Legacy.WireTransferPayment
  • Legacy.PayPalPayment

How do you load such a document? If you try to just de-serialize it, you’ll get a deserialziation error. The type information about the polymorphic property is encoded in the document and you’ll need these legacy types to successfully load the document.

Luckily, there is a simple solution. You can customize the JSON serializer like so:

And the implementation of the binder is straightforward from that point:

In this manner, you can decide to keep the existing data as is or migrate it slowly over time.

Errors, exceptions and faults, oh my!

time to read 10 min | 1947 words

If we could code for the happy path only, I think that our lives would have been much nicer. Errors are hard, because you keep having to deal with them, and even basic issues in error handling can take down systems that are composed of thousands of nodes.

I went out to look at research around error handling rates, and I found this paper. It says that about 3% of code (C#, mind) is error handling. However, it counts only the code inside catch / finally as error handling. My recent foray into C allow me another data point. The short  version, with no memory handling is 30 lines of code, the long version, with error handling, is over a 100.

If I had to guess, I would say that error handling is at least 10 – 15 %, and I would be surprised by 25 – 30%. In C# and similar languages, a centralized error handling strategy can help a lot in this regard, I think.

Anyway, let’s explore a few options for error handling:

The C way – return codes. This sucks. I think that this is universally known to suck. In particular, there is no rhyme or reason for return codes. Something you need to check for INVALID_HANDLE_VALUE, sometimes for a value that is different from zero. Sometimes the return code is the error code. In other times you need to call a separate function to get it. It also forces you to have a very localized error handling mode. All error handling should be done all the time, which can easily lead to either a single forgotten return code causing issues down the line (forgetting to check fsync() return code got data corruption in Postgres, for example) or really bad code where you lose sight of what is actually going on because there are so much error handling that the real functionally went into hiding.

The return code model also doesn’t compose very well, in the case of complex operations failing midway. It doesn’t provide contextual information or allow you to get stack traces easily. Each of this is important if you want to have a good error handling strategy (and good debugging / troubleshooting experience).

So the C way of doing things is out .What are we left? We have a few options:

  • Go with multiple return codes
  • Rust with Option<T>, Result<T>
  • Node.js with callbacks
  • C# / Java with Exceptionsmuch

Let’s talk about the Go approach for a bit. I think that this is universally loathed as being very similar to the C method and cause a lot of code repetition. On the other hand, at least we don’t have GetLastError() / errno to deal with. And one advantage off Go in this regard that the defer command allow you to much more cleanly handle state (you can just return and any resource will be cleaned up). This means that the code may be repetitive to write, but it is much easier to review.

The problem with this approach is that it is hard to compose errors. Imagine a method that needs to read a string from the network, parse a number from the string and then update a value in a file. Without error handling, this looks like so:

I haven’t even written the file handling path, mostly because it got too tiring. In this case, there are so many things that can go wrong. The code above handles failure to make the request, failure to read the value from the server, failure to parse the string, etc. With a file, you need to handle failure to open the file, read its content, parse them, do something with the value from the server and file value and then serialize the value back to bytes to be written to the file. About every other word in this previous statement require some form of error handling.  And the problem is that when we have complex system, we don’t just need to handle errors, we need to compose them so they would make sense.

EPERM error from somewhere is pretty useless, so having the file name is huge help in figuring out what the problem was. But knowing that the error is actually because we tried to write to save the data to the on-disk cache give me the proper context for the error.  The problem with errors is that they can happen very deeply in the code path, and the policy for handling such errors belong much higher in the stack.

Rust’s approach for errors is cleaner than Go, you don’t have multiple result types but the result is actually wrapped in a Result / Option value that you need to explicitly handle. Rust also contain some syntax sugar to make this pretty easy to write.

However, Rust error handling just plain sucks when you try to actually compose errors. Imagine the case where I want to do several operations, some of which may fail. I need to report success if all has passed, but error if any had errored. For a bit more complexity, we need to provide good context for the error, so the error isn’t something as simple as “int parse failure” but with enough details to know that it was an int parse failure on the sixth line of a particular file that belong to a certain operation.

The reason I say that Rust sucks for this is that for consuming error, things are pretty simple. But for producing them? The suggestion to library authors is to implement your own Error type. That means that you need to implement the Display trait manually, you need to write a separate From trait for each error that you want to compose up. If your code suddenly need to handle a new error type, you deal with that by writing a lot of boiler plate code. Any change in the error enum require touching multiple places in the code, violating SRP. You can use Box<Error>, it seems, but in this case, you have just “an error occurred” and it is complex to get back the real error and act on it.

A major complication of all the return something option is the fact that they usually don’t provide you with  a stack trace. I think that having a stack trace in the error is extremely helpful to actually analyzing a problem and being able to tell what actually happened.

Callbacks, such as was done with node.js, are pretty horrible. On the one hand, it is much easier to provide the context, because you are called from the error site and can check your current state. However, there is only so much that you can do in such a case, and state management is a pain. Callbacks have proven to be pretty hard to program with, and the industry as a whole is moving to async/await model instead. this give you sequential like mechanism and much better way to reason about the action of the system.

Finally, we have exceptions. There are actually several different models for exceptions. You have Java with checked exceptions, with the associated baggage there (cannot change the interface, require explicit handling, etc). There is the Pony language which has “exceptions”. That is really strange choice of implementation. Pony has exceptions for flow control, but it doesn’t give you any context about the actual error. Just that one happened.  The proper way of handling errors in Pony is to return a union of the result and possible errors (similar to how Rust does it, although the syntax looks nicer and there is less work).

I’m going to talk about C#’s exceptions. Java’s exceptions, except for some of them being checked, are pretty much the same.

Exceptions have the nice property that they are easily composable, it is easy to decide to handle some errors and to pass some up the chain. Generic error handling is also easy.  Exceptions are problematic because they break the flow of the code. An exception in one location can be handled somewhere completely different, and there is no way for you to see that when looking on the code. In fact, I’m not even aware of any IDE / tooling that can provide you this insight.

In languages with exceptions, you also can have exceptions pretty much at any location, which mean that you need to write exception-safe code to make sure that an exception don’t leave your code in an inconsistent state. There is also a decidedly non trivial cost of exceptions. To start with, many optimizations are mitigated by try blocks and throwing exceptions is often very expensive. Part of that is the fact that we need to capture the oh so valuable stack trace, of course.

There is also another aspect to error handling to consider. There are many cases where you don’t care about errors. Any time that you have generic framework code that calls to user code. An HTTP Handler is a good example of that. You call the user’s code to handle the request, and you don’t care about errors. You simple catch that error and return 500 / message to the client. Any error handling strategy must handle both scenarios. The “I really care about every single detail and separate error handling code path for everything” and “I just want to know if there is an error and print it, nothing else”.

In theory, I really love the Rust error handling mechanism, but the complexity of composability and generic handling means that it is a lot less convenient to actually consume and produce errors. Exceptions are great in terms of composability and the amount of detail they provide, but they are also breaking the flow of the code and introduce a separate and invisible code paths that are hard to reason about in many cases. On the other hand, exceptions allow you to bubble errors upward natively and easily, until you get to a location that can apply a particular error handling policy.

A good example from a recent issue we had to deal with. When running on a shared drive, a file delete isn’t going to be processed immediately, there is a gap of time in which the delete command seems to have succeeded, but attempting to re-create the file will fail with EEXISTS (and trying to open the file will give you ENOENT, so that’s fun). In this case, we throw the error up the stack. In our use case, we have this situation only when dealing with temporary files, and given that they are temporary, we can detect this scenario and use another file name to avoid this issue. So we catch a FileNotFoundException and retry with a different file name. This goes through four of five layers of code and was pretty simple to figure out and implement.

Doing that with error codes is hard, and adding another member for the Error type will likely have cascading implications for the rest of the code. On the other hand, throwing a new exception type from a method can also break the contract. Explicitly in languages like Java and implicitly in languages like C#.  In fact, with C#, for example, the implied assumption is always: “Can throw the following exceptions for known error cases, and other exceptions for unexpected”. This is similar to checked exceptions vs. runtime exceptions in Java. But in this case, this is the implicit default and it gives you more freedom overall when writing your code. Checked exceptions sounds great, but they have been proven to be a problem for developers in practices.

Oh well, I guess I won’t be able to solve the error handling problem perfectly in a single blog post.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Reviewing FASTER (9):
    06 Sep 2018 - Summary
  2. RavenDB 4.1 features (12):
    22 Aug 2018 - MongoDB & CosmosDB Migration Wizards
  3. Reading the NSA’s codebase (7):
    13 Aug 2018 - LemonGraph review–Part VII–Summary
  4. Codex KV (2):
    06 Jun 2018 - Properly generating the file
  5. I WILL have order (3):
    30 May 2018 - How Bleve sorts query results
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats