Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:


+972 52-548-6969

, @ Q j

Posts: 6,783 | Comments: 48,880

filter by tags archive
time to read 3 min | 505 words

Computation during indexes open up some nice  features when we are talking about data modeling and working with your data. In this post, I want to discuss predicting the future with it. Let’s see how we can do that, shall we?

Consider the following document, representing a (simplified) customer model:


We have a customer that is making monthly payments. This is a pretty straightforward model, right?

We can do a lot with this kind of data. We can obviously compute the lifetime value of a customer, based on how much they paid us. We already did something very similar in a previous post, so that isn’t very interesting.

What is interesting is looking into the future. Let’s see how we can start simple, but figuring out what is the next charge rate for this customer. For now, the logic is about as simple as it can be. Monthly customers pay by month, basically. Here is the index:


I’m using Linq instead of JS here because I’m dealing with dates and JS support for dates is… poor.

As you can see, we are simply looking at the last date and the subscription, figuring out how much we paid the last three times and use that as the expected next payment amount. That can allow us to do nice things, obviously. We can now do queries on the future. So finding out how many customers will (probably) pay us more than 100$ on the 1st of Feb both easy and cheap.

We can actually take this further, though. Instead of using a simple index, we can use a map/reduce one. Here is what this looks like:


And the reduce:


This may seem a bit dense at first, so let’s de-cypher it, shall we?

We take the last payment date and compute the average of the last three payments, just as we did before. The fun part now is that we don’t compute just the single next payment, but the next three. We then output all the payments, both existing (that already happened) and projected (that will happen in the future) from the map function. The reduce function is a lot simpler, and simply sum up the amounts per month.

This allows us to effectively project data into the future, and this map reduce index can be used to calculate expected income. Note that this is aggregated across all customers, so we can get a pretty good picture of what is going to happen.

A real system would probably have some uncertainty factor, but that touches on business strategy more than modeling, so I don’t think we need to go into that here.

time to read 4 min | 613 words

imageIn my last post on the topic, I showed how we can define a simple computation during the indexing process. That was easy enough, for sure, but it turns out that there are quite a few use cases for this feature that go quite far from what you would expect. For example, we can use this feature as part of defining and working with business rules in our domain.

For example, let’s say that we have some logic that determine whatever a product is offered with a warranty (and for how long that warranty is valid). This is an important piece of information, obviously, but it is the kind of thing that changes on a fairly regular basis. For example, consider the following feature description:

As a user, I want to be able to see the offered warranty on the products, as well as to filter searches based on the warranty status.

Warranty rules are:

  • For new products made in house, full warranty for 24 months.
  • For new products from 3rd parties, parts only warranty for 6 months.
  • Refurbished products by us, full warranty, for half of new warranty duration.
  • Refurbished 3rd parties products, parts only warranty, 3 months.
  • Used products, parts only, 1 month.

Just from reading the description, you can see that this is a business rule, which means that it is subject to many changes over time. We can obviously create a couple of fields on the document to hold the warranty information, but that means that whenever the warranty rules change, we’ll have to go through all of them again. We’ll also need to ensure that any business logic that touches the document will re-run the logic to apply the warranty computation (to be fair, these sort of things are usually done as a subscription in RavenDB, which alleviate that need).

Without further ado, here is the index to implement the logic above:

You can now query over the warranty types and it’s duration, project them from the index, etc. Whenever a document is updates, we’ll re-compute the warranty status and update the index.

This saves you from having additional fields in your model and greatly diminish the cost of queries that need to filter on warranty or its duration (since you don’t need to do this computation during the query, only once, during indexing).

If the business rule definition changes, you can update the index definition and RavenDB will effectively roll out your change to the entire dataset. That is nice, but even though I’m writing about cool RavenDB features, there are some words of cautions that I want to mention.

Putting queryable business rules in the database can greatly ease your life, but be wary of putting too much business logic in there. In general, you want your business logic to reside right next to the rest of your application code, not running in a different server in a mode that is much harder to debug, version and diagnose. And if the level of complexity involved in the business rule exceed some level (hard to define, but easy to know when you hit it), you should probably move from defining the business rules in an index to a subscription.

A RavenDB subscription allow you to get all changes to documents and apply your own logic in response. This is a reliable way to process data in RavenDB, this runs in your own code, under your own terms, so it can enjoy all the usual benefits of… well, being your code, and not mine. You can read more about them in this post and of course, the documentation.

time to read 3 min | 536 words

imageIn my previous post I talked about an interesting challenge, distributing data among many different nodes, each of which can act independently on this data. Sadly, the best example for this scenario is distributing ads, but I hope you’ll excuse me on that.

The first thing to realize about this task that this is basically a synchronization problem over anything else. We define a certain location as the primary one, the one that will accept all the modifications to the data. Each of the nodes will then simply need to connect to that location every now and then and get all the updates.

“Basically a synchronization problem” is like saying that getting a P1 emergency bug at Friday night just before the movie starts is a “tad annoying”. The problem is that to do synchronization properly, you have to model your data properly, make sure that you keep track of changes and be able to send partial changes down the wire efficiently. That is not a simple task at all.

In this post, I want to offer another option to handle this. Using Raft. This is a strange use case for a consensus algorithm, I’ll admit. I guess that technically you could run a Raft consensus over 10,000 nodes. Just don’t expect it to be making any sort of decisions. So why am I offering Raft for a scenario when we have that many nodes? Because Raft, at its core, is a way to achieve consensus on a distributed log, that is all. And no one says that you must get that distributed log only via Raft directly.

The idea is basically to have a single source of truth. This can be a single server or it can be a Raft cluster with 3 – 7 nodes in it. All writes in the system are going to go there. The actual process is very well understood and there are multiple ways to do that. The simplest one to consume is likely rqlite. The log, in the case of rqlite, is going to be the SQL statements that are going to be applied to a sqlite database.

But how does this solve the problem of distributing the data? The answer is simple, we already have a way to disseminate distributed state, the log itself. What is going to happen is that you’ll have each of the nodes in the edge connect to the cluster and ask for a copy of the log as of the last committed entry that they have. When they get that, they can apply these statements to their own local copy of sqlite and know that they are now up to date with the state of the system at that time frame.

This approach skips over the need to architect your data for sync (which is hard) and push all of that complexity down the stack to your infrastructure. If the number of nodes you have is large enough, you might need to introduce mirrors to reduce the load. But that fits very nicely into the architecture without really needing to change something.

time to read 4 min | 648 words

imageBefore I start discussing this topic, I want to talk a bit about the speed of light. That pesky limit basically means that there in an inherent lag in passing information between any two points in space. On your daily life, you can mostly ignore it. The human brain is far too slow to perceive it, and even if you are working with computers, you can usually ignore the speed of light for anything less than about 500 miles.

But the speed of light is merely the hard upper limit of our ability to send information from one location to another. In practice, the lag time between any two computers connected to a network is much higher. In fact, if you are a gamer, you are very well acquainted with that fact.

Let’s assume that we have a need to do the following:

  • Hold a mutable state of some kind.
  • Modify it in a consistent manner.
  • Distribute it to many locations, where many is at least hundreds and potentially tens of thousands of locations.

Let’s break this up a bit to its component parts. A mutable state basically means that we can modify it over time, and that these modifications shows up in all locations. The speed of light (and network lag) ensures that we can’t have this happen instantaneously. And, of course, the killer requirement here is that we need to do this in a consistent manner.

I find it really hard to talk about abstract problems concretely so let’s use an example. We have a set of tax rules that determine how certain products should be taxed. The ruleset is big, changes on a regular basis and it is quite important to get things right. Oh, and we also need to do push those rules to tens of thousands of locations that would independently compute taxes for purchases.

In this case, the solution is quite easy. We have a single source of truth that all the locations can pull the tax ruleset from (directly or through mirrors). We’ll also ensure that tax rules are applied only at some future date from their creation, to allow time for all these locations to be updated. If a location hasn’t had an updated ruleset for over a certain amount of time, they can refuse to process any payments until they get an update. This example has really very little for us to do in term of design.

A better example would have multiple actors changing the data as we work with it. An actual example for when this is required can be when we have an ad provider that needs to run ads in many (physical locations). Each location is actually an IoT device that includes a computer, a screen and a camera. Each such device decides independently what ads should be shown (for example, if the camera see a baby stroller, they show ads for baby toys).

I actually had to search hard for an example that would work for this scenario but I think this is a good  one. We obviously have a lot of activity on the ad provider with many people registering and modifying ads settings. We need that portion of our business to be consistent (this is what we are charging people for, after all). However, given the probably high amount of ad devices spread all over, we can’t rely on them always being connected to a central server or even on them being connected at all.

And even if there is no connectivity and nothing new to show, we must show something. Even if we aren’t getting paid for showing the ad, not showing something or (actually worse) showing an error is something that we can’t tolerate.

Given all of these constraints, how do would you build such a system?

time to read 3 min | 403 words


Consider the graph on the right. I already talked about this graph when I wrote about permission based graph queries.

In this post, I want to show off another way to deal with the same problem, but without using graph queries, and using only the capabilities that we have in RavenDB 4.1.

The idea is that, given a user, I want to be able to issue a query for all the issues that this user has access to, either directly (like Sunny in the graph), via a group (like Max, via project-x group) or via a recursive group, like (Nati, via project-x –> team-nati groups).

As you can image from the name of this post, this requires recursion. You can read the documentation about this, but I thought to spice things up and use several features all at once.

Let’s look at the following index (Issues/Permissions):

This is an JS index, which has a map() function over the Issues collection. For each of the issues, we index the Users for the issue and the groups (recursively) that are allowed to access it.

Here is what the output of this index will be for our issue in the graph:


Now, let’s look at how we can query this, shall we?


This query has two clauses, either we are assigned directly or via a group. The key here is in the recurse_groups() and inside that, the load() call in the index. It scans upward through the defined groups and their parents until we have a simple structure in the index that is easily searchable.

RavenDB will ensure that whenever a document that is referenced by a load() in the index is updated, all the documents that are referencing it will be re-indexed. In the case we have here, whenever a group is updated, we’ll re-index all the relevant issues to match the new permissions structure.

One of the core principles of RavenDB is that you can push more work to the indexing and keep your queries fast and simple. This is a good example of how we can arrange the data in such a way that we can push work to background indexing in a pretty elegant manner.

time to read 4 min | 760 words

imageYou might be familiar with Moore’s law, which states that the number of transistors in a dense integrated circuit doubles about every two years. In effect, that performance doubles every 24 months. For many years, that has certainly held true. But that hasn’t been the case for the past 10 years or so. Even when Moore’s law held true, there was a snag. Wirth’s law is also in effect (as an aside, read his article “A Plea for Lean Software”, 23 years, and it holds true today) and Wirth’s law states that software is slower quicker than hardware is becoming faster. It’s good to be a software developer, because even when the CPU clock speed doesn’t jump all the time, we still get more CPUs to play with. A common approach to handle performance issues today is just to throw more parallelism at the problem until it shuts up.

To a certain extent, this make a perfect economic sense. In the calculus between developers’ salaries and the cost of hardware, you’ll usually find that buying a couple extra servers is drastically cheaper than spending another 6 months improving the performance of the system. Jeff Atwood wrote about the topic a decade ago and I think that he is still very much correct, to a degree. There are other factors to consider, which is the overhead over time and in particular in the cloud. One of the major factors to take into account here is that when you are running in the cloud, you aren’t running on your servers and you are charged on usage. That can change the math by quite a bit.

For example, if I bought a couple of servers, the number of IO operations that I make is pretty much meaningless to me. If I hitting the disk ten times a second or ten thousands times a second, I’m paying the same cost. Oh, sure, I might need to buy a better hard disk to get 10,000 IOPS, but that is a one time cost, and usually not that meaningful in the grand scheme of things. But when you are on the cloud, getting higher IOPS cost more, and it cost more over time. In the same sense, in my data center, the cost of querying a database is zero. In the cloud, you will typically be charged some (miniscule) amount on a per query basis. Nothing to worry about, except that your software still work according to Wirth’s law, so you are making more queries than you should, which means that you are charged for each and every one of them.

I used to make my living by being a database performance consultant. I would go to a customer, look at how they are using their database and optimize the access pattern. It was common to see 90%+ savings in the number of queries for common operations. I was doing that because that directly translated to better responsiveness of the application. Applications that respond faster are more pleasant to use and there are numerous studies about faster applications generating higher revenues. I remember talking to clients and explaining to them why they should invest in the overall performance of the application before they hit the total resource depletion that would take them down.

Today, in the cloud, I would have a much simpler task. Let’s assume a simple application using CosmosDB, as an example. With 200 page views / sec on the site, and each page view generating 80 requests to the database, that gives us a total of 16,000 requests a second, which translates to an end of the month bill of about 10,000$. I’m using the 80 queries / page view as a reasonable low ball estimate, mind. Drupal, for example, does 300 – 400 queries per page view and it is easy to get to these numbers without paying attention. Dropping the number of queries per page to 10 (which is usually pretty easy to do with proper queries, attention to details and some caching) gives you a database bill that is around than 1,000$.

Over the span of a year, that is enough to pay a full time developer that can go and find other places where your software can be improved. And unlike before, you don’t need to justify with studies or any indirect causation. You can point directly to the bottom line in an invoice and show how much money is saved.

time to read 10 min | 1862 words

imageI spent some idle time thinking about the topic lately, and I think that this can be a pretty good blog post. The goal is to have a generic application level network protocol for client/server and server/server communication. There are a lot of them out there, and this isn’t actually design work that I intend to implement. It is a thought exercise that run through a lot of the reasoning behind the design.

There are network protocol that are specific for a purpose, and that is reflected in a lot of hidden assumptions they have. I’m trying to conceive a protocol that would be generic enough to be used for a wide variety of cases. Here are the key points in my thinking.

  • Don’t invent the wheel
  • Security is a must
  • RPC is the most common scenario
  • Don’t cause problem for the client / server by messing the protocol
  • Debuggability can’t be bolted on
  • Push model should also work

By not re-inventing the wheel I mean that this should be relatively simple to implement. That pretty much limits us to TCP as the underlying mechanism. I’m actually going to specify a stream based communication protocol, instead, though. With the advent of QUIC, HTTP/3, etc, that might actually be useful. But the whole idea is that the underlying abstraction that we want to rely on is a connection between two nodes that is a stream. All the issue of packet ordering, retries, congestions, etc are to be handled at that level.

At this day and age, security is no an optional requirement, and that should be incorporated into the design of the system from the get go. I absolutely adore TLS, and it solves a whole bunch of problems for us at the same time. It give us a secure channel, it handles authentication on both ends and is is both widely understood and commonly used. This means that selecting TLS as the security mechanism, we aren’t limiting any clients.  So the raw protocol we rely on is TLS/TCP, with authentication done using client certificates.

By far the most common usage for a network protocol is the request/reply model. You can see it in HTTP, SMTP, POP3 and most other network protocol. There is a problem with this model, though. A simply request/reply protocol is going to cause scalability and management issues for the users. What do I mean by that? Look at HTTP as a great example. It is a simple request/reply protocol, and that fact has caused a lot of complexity for users. If you want to send several requests in parallel, you need multiple connections, and head of line queue is a real problem. This impact both client and servers and can cause a great deal of hardship for all. Indeed, this is why HTTP/2 allows framing and to send multiple requests without specifying the order in which the server reply to them.

A better model would be to break that kind of dependency, and I’m likely going to be modeling at least some of that on the design of HTTP/2.

Speaking of which, HTTP/2 is a binary protocol, which is great if you have the entire internet behind you. If you are designing a network protocol that isn’t going to be natively supported by all and sundry, you are going to need to take into account the debuggability of the solution. The protocol I have in mind is a text based protocol and should be usable from the command line by using something like:

openssl s_client -connect my_server:4833

This will give you what is effectively a shell into the server, and you should be able to write commands there and get their results. I used to play around a lot with network protocols and being able to telnet to a server and manually play with the commands is an amazing experience. As a side affect of this, it also means that having a trace of the communication between client and server will be amazingly useful for diagnostics down the line. For that matter, for certain industries, being able to capture the communication trace might be an absolute requirement for auditing purposes (who did what and when).

So, here is what we have so far:

  • TLS/TCP as the underlying protocol
  • Text based so we can manually

What we are left with, though, is what is the actual data on the wire going to look like?

I’m not going to be too fancy, and I want to stick closely to stuff that I know that works. The protocol will use messages as the indivisible unit of communication. A message will have the following structure (using RavenDB as the underlying model):

GET employees/1-A employees/2-B
Timeout: 30
Sequence: 293
Include: ReportsTo

PUT “document with spaces”
Sequence: 294
Body: chunked

<<binary data 39 bytes in len>>

So, basically, we have a line oriented protocol (each line separated by \r\n, and limited to a well known maximum size). A message starts with a command line, which has the following structure:

cmd (token) args[] (token)

Where token is either a sequence of characters without whitespace or a quoted string if it contains whitespace.

Following the command line, you have the headers, which pretty much follow the design of HTTP headers. They are used to pass additional information, such as the timeout for that particular command, command specific data (like the Include header on the first command) or protocol details (like specifying a timeout for that particular command or that the second command has a body and how to read it). A command ends with an empty line, and then you have an optional body.

The headers here serve a very important role. As you can see, they are key for protocol flexibility and enabling versioning. It give us a good way to add additional data after the first deployment without breaking everything.

Because we want to support people typing this manually, we’ll probably need to have some way to specify message bodies that a human can type on their own without having to compute sizes upfront. This is something that will likely be needed only for human input, so we can probably define a terminating token that would work, not focusing on this because it isn’t a mainline feature, but I wanted to mention this because debuggability isn’t a secondary concern.

You might have noticed a repeated header in the commands I sent. The Sequence header. This one is optional (when human write it) but will be very useful for tracing, so tools will always add it. A stream connection is actually composed of two channels, the read and the write. On the read side of this protocol, we read a full command from the network, hand it off to something else to process it and go right back to reading from the network. This design is suitable for the event based systems that has proven to be so useful to scale the amount of work a server can handle. Because we can start reading the next command while we process the current ones, we greatly reduce the number of connections we require and enable a lot more parallel work.

Once a message has been processed, the reply is sent back to the client. Note that there is no requirement that the replies will be sent in the same order as the requests that initiated them. That means that an expensive operation on the server side isn’t going to block cheaper operations that came after it, which is again, important for the overall speed of the system. It also lends itself quite nicely for an event loop based processing. The sequence number for the request is used in the reply to ensure that the client can properly correlate the reply to the relevant request.

On the client side, you write a command to the network. When reading from the network, you need to keep track of the sequence number you sent and route it back to the right caller. The idea here is that on the client side, you may have a single connection that is shared among several threads, reducing the number of overall connections you need and getting better overall utilization from the network.

A nice property of this design is that you don’t have to do things this way. If you don’t want to support concurrent requests / replies, just have a single connection and wait to read the reply from the server whenever you make a request. That give you the option of simple stateful approach, but also an easy upgrade path down the line if / when you need it. The fact that the mental model of the user is request/reply is a great help, to be honest, even if this isn’t what is actually going on. This greatly reduce the amount of complexity that a user need to keep in their head.

Some details on the protocol would need gentle massaging, to ensure that a human on the command line can type reasonable commands, but that is fairly straightforward. The text based nature of the communication also lends itself nicely to tracing / audits. At the client or server levels, we can write <connection-id>.trace file that will have all the reads and writes on that connection. During debugging, you can just tail <connection-id> the right file and see exactly what is going on, or just zip them for achieve for auditing.

Speaking of zipping, let’s consider the following command:


This command on the connection can do things like change how we encode the data, in this case, ask the server to use gzip for all reads and writes from now on. The server can reply with a message (uncompressed) that it is now switching compression and everything from that point forward will be compressed. Note that unlike HTTP compression, we can get the benefits of compression across multiple requests, and given that most requests / reply have a lot of the same structure, likely benefit quite us by a lot.

The last topic I listed is the notion of push operations and how they should be handled. Given that we don’t have a strict request/reply model, there is an obvious way for the server to send additional data “out of band”. A client can request to be notified by the server of certain things, and the server will make a note on that and just send the data back at some later time. There is obviously the need to correlate the push notification to the original request, but that is why we have the headers for. A simple CorrelationId header on the original request and the push notification will be sufficient for the client side to be able to route that to the right callback.

I think that this is pretty much it, this should cover enough to give you a clear idea about what is required and I believe that it is enough for a thought exercise. There are a lot of other details that should probably be answered, for example, how do you deal with very large responses (break them to multiple messages, I would assume, to avoid holding up the connection for other requests), but that should be the gist of it.

time to read 8 min | 1516 words

imageA couple of weeks ago, GitHUb had a major outage, lasting over 24 hours and resulted in wide spread disruption of many operations for customers. A few days after everything was fixed, they posted their analysis on what happened, which makes for a really good read.

The pebble that started all of this was a connection disruption that lasted 43 seconds(!). A couple of months ago I talked about people who say that you can assume that distributed failures are no longer meaningful. The real world will keep serving up examples of weird / strange / nasty stuff to your productions systems, and you need to handle that. Quoting from the original post:

Therefore: the question becomes: how much availability is lost when we guarantee consistency? In practice, the answer is very little. Systems that guarantee consistency only experience a necessary reduction in availability in the event of a network partition. As networks become more redundant, partitions become an increasingly rare event. And even if there is a partition, it is still possible for the majority partition to be available. Only the minority partition must become unavailable. Therefore, for the reduction in availability to be perceived, there must be both a network partition, and also clients that are able to communicate with the nodes in the minority partition (and not the majority partition). This combination of events is typically rarer than other causes of system unavailability.

So no, not really. There is a good point here on the fact only the minority portion of the system must become unavailable, but given typical production deployment, any disconnect between data centers will cause a minority portion to be visible to clients and become unavailable.

The actual GitHub issues that are discussed in the post are a lot more interesting. First, we have the obvious problem that most applications assume that their database access is fast and they make multiple such calls during the processing of a single request (sometimes, many calls). This is just another example of the Fallacies of Distributed Computing in action. RavenDB has a builtin detection for that and a host of features that allow you to go to the database server once, instead of multiple times. In such a case, even if you need to failover to a remote server, you won’t pay the roundtrip costs multiple times.

However, this is such a common problem that I don’t think that it deserve much attention. There isn’t much that you can do about it without careful consideration and support from the whole stack. Usually, this happens on projects when you have a strong leader that institute a performance budget and enforce that. This has costs of its own and usually it is cheaper to just not failover across data center boundaries.

The next part that I find really interesting is that the system that GitHub uses for managing topologies is not consistent but is required to be. The problem is that there is an inherent delay between their orchestrator re-organizing the cluster after a failure and when the failure actually occurs. That would have been fine, if they had a way to successfully merge histories, but that is not the case. In fact, looking at just the information that they have published (and ignoring that I have the benefit of hindsight) the issue is glaringly obvious.

A deep dive (and a fascinating read) into how GitHub handles high availability talks about the underlying details and expose the root cause. You cannot layer distinct distributed architectures on top of one another and expect to come up with a good result. Here is what happens in a master crash scenario:

In a master crash scenario:

  • The orchestrator nodes detect failures.
  • The orchestrator/raft leader kicks off a recovery. A new master gets promoted.
  • orchestrator/raft advertises the master change to all raft cluster nodes.
  • Each orchestrator/raft member receives a leader change notification. They each update the local Consul’s KV store with the identity of the new master.
  • Each GLB/HAProxy has consul-template running, which observes the change in Consul’s KV store, and reconfigures and reloads HAProxy.
  • Client traffic gets redirected to the new master.

I read this and feel a bit queasy, because the master crash scenario is not the interesting bit. That is the easy part. The really hard part is how you manage things when you have a network disruption, with both sides still up and functioning. In fact, that is exactly what happened to GitHub. In this case, on the minority side, their orchestrator cannot get a majority (so cannot make any forward process). However, the rest of the system cannot proceed, the whole thing stops at either the first or second stage.

That means that the rest of the system will continue to write to the old master, resulting in a conflict. And this is where things gets complicated. The issue here is that with MySQL (and most other systems that relies on log replication) you must have a single master at any given time. That is an absolute requirement. If you got to the point where you had two writes with divergent histories, you are in for selecting which one you’ll accept (and what data you’ll discard) and trying to manually fix things after the fact.

The proper way to handle something like this would have been to use Raft to actually send the commands themselves to the server. This ensures a consistent set of statements that run in the same order for all servers. Rqlite is a great example of this, where you can get consistent and distributed system on top of individual components. That would be the proper way to do it, mind, not the way anyone would do it.

You wouldn’t be able to get any reasonable performance from the system using this kind of approach. Rqlite, for example, talks about being able to get 10 – 200 operations per second. I’m going to assume that GitHub has a need for something better than that. So the underlying distributed architecture looks like this:

  • MySQL write master with multiple read-only secondaries using binlog.
  • Orchestrator to provide health monitoring and selection of new write primary (consistent using Raft)
  • Underlying infrastructure that uses (different) Raft to store routing configuration.

If you break Orchestrator’s ability to make decisions (easy, just create a partition), you take away the ability to change the write master, and if the failure mode you are dealing with is not a failed master (for example, you have partition) you are going to accept new writes to the old master.  That breaks completely the whole idea of binlog replication, of course, so you are sort of stuck at that point. In short, I think that Orchestrator is something that was meant to solve an entirely different problem, it was meant to deal with the failure of a single node, not to handle a full data center partition.

When looking at such incidents, I always compare to what would have happened if RavenDB was used instead. This is not really fair in this case because RavenDB was designed upfront to be a distributed database. RavenDB doesn’t really have the concept a a write master. For simplicity’s sake, we usually try to direct all writes to a single node for each database, because this simplify how you usually work. However, but any node can accept writes and will distribute it to the rest of the nodes in the cluster. In a situation like the one GitHub faced, both sides of the partition would keep accepting writes (just like happened in GitHub’s case with MySQL).

The difference is what will happen when the partition is healed. Both sides of the partition will update the other with the data that is missing on the other side. Any conflicting writes (by which I mean writes on both sides of the partition to the same document or documents) will be detected and resolved automatically. Automatic resolution is very important to keeping everything up and running. This can be a custom resolution policy defined by the user or arbitrary by RavenDB. Regardless of the conflict resolution policy, the administrator will be notified about the conflicts and can review the actions taken by RavenDB and decide what to do about that.

In GitHub’s case, their busiest cluster had less than a thousand writes in the time period in question. Most of which aren’t going to conflict. I would expect the timeline with RavenDB to be:

  • 2018 October 21 22:52 UTC – initial network partition, lasting 43 seconds
  • 2018 October 21 22:54 UTC – internal monitoring alert about any conflicts (but I consider this unlikely)
  • 2018 October 21 23:00 UTC – issue resolved, go home

The difference is mostly because RavenDB was designed to live in these kind of environment, deployed in multiple data centers and actually handling, in the real world and with very little assistance, the task of keeping applications up and running without blowing things up. It is quite literally one of the basic building blocks we have, so it shouldn’t be surprising that we are pretty good at it.

time to read 4 min | 606 words

imageI have been writing software at this point for over twenty years, and I want to believe that I have learned a few things during that timeframe.

And yet, probably the hardest thing for me is to start writing from scratch. If there is no code already there, it is all too easy to get lost in the details and not actually be able to get anywhere.

An empty source file is full of so many options, and any decision that I’ll make is going to have very long lasting impact. Sometimes I look at the keyboard and just freeze, unable to proceed because I know, with a 100% certainty, that whatever I’ll produce isn’t going to be up to my own standards. In fact, it is going to suck, for sure.

I think that about 90% of the things I have written so far are stuff that I couldn’t write today. Not because I lack the knowledge, but because I have far greater understanding of the problem space and I know that trying to solve it all is such a big task that it is not possible for me to do so. What I need reminding, sometimes, is that I have written those things, and eventually, those things were able to accomplish all that was required of them.

A painter doesn’t just start by throwing paint on canvas, and a building doesn’t grow up by people putting bricks where they feel like. In pretty much any profession, you need to iterate several times to get things actually done. With painters, you’ll typically do a drawing before actually putting paint on canvas. With architects will build a small scale model, etc.

For me, the hardest thing to do when I’m building something new is to actually allow myself to write it out as is. That means, lay out the general structure of the code, and ignore all the other stuff that you must have in order to get to real production worthy code. This means flat our ignoring:

  • Error handling
  • Control of allocations and memory used
  • Select the underlying data structures and algorithms
  • Yes, that means that O(N^2) is just fine for now
  • Logging, monitoring and visibility
  • Commenting and refactoring the code for maintainability over time

All of these are important, but I literally can’t pay these taxes and build something new in the same time.

I like to think about the way I work as old style rendering passes. When I’m done with the overall structure, I’ll go back and add these details. Sometimes that can be a lot of work, but at that point, I actually have something to help me. At a minimum, I have tests that verify that things still work and now I have a good understanding of the problem (and my solution) so I can approach things without having so many unknown to deal with.

A large part of that is that the fact that I didn’t pay any of the taxes for development. This usually means that the new thing is basically a ball of mud, but it is a small ball of mud, which means that if I need to change things around, I have to touch fewer moving parts. A lot fewer, actually. That allow me to explore, figure out what works and doesn’t.

It is also going directly against all of my instincts and can be really annoying. I really want to do a certain piece of code properly, but focusing on perfecting a single door knob means that the whole structure will never see the light of day.

time to read 2 min | 279 words

imageAn interesting challenge with implementing graph queries is that you sometimes get into situations where the correct behavior is counter intuitive.

Consider the case of the graph on the right and the following query:


This will return:

  • Source: Arava, Destination: Oscar

But what would be the value of the Edge property? The answer to that is… complicated.  What we actually return is the edge itself. Let’s see what I mean by that.


And, indeed, the value of Edge in this query is going to be dogs/oscar.


This isn’t very helpful if we are talking about a simple edge like this. After all, we can deduce this from the Src –> Destination pair. This gets more interesting when the edge is more complex. Consider the following query:


What do you this should be the output here? In this case, the edge isn’t the Product property, it is the specific line that match the filter on the edge. Here is what the result looks like:


As you can imagine, knowing exactly what edge led you from one document to another can be very useful when you look at the query results.


  1. Using TLS with Rust: Authentication - 2 hours from now
  2. The role of domain model with CQRS / Event Sourcing - about one day from now
  3. Using TLS in Rust: Going to async I/O with Tokio - 4 days from now
  4. Investigating self inflicted wounds: The SSL failure on the Linux build server - 5 days from now
  5. Using TLS in Rust: tokio ain’t mere mortals - 6 days from now

And 4 more posts are pending...

There are posts all the way to Feb 04, 2019


  1. Using TLS with Rust (4):
    11 Jan 2019 - Part III–Will native tls do the trick?
  2. Data modeling with indexes (3):
    14 Jan 2019 - Predicting the future
  3. Reminder (9):
    03 Jan 2019 - I’ll be in CodeMash is next week
  4. Production postmortem (24):
    25 Dec 2018 - Handled errors and the curse of recursive error handling
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats