Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:


+972 52-548-6969

, @ Q j

Posts: 6,812 | Comments: 49,041

filter by tags archive
time to read 4 min | 735 words

imageAbout a month ago I wrote about a particular issue that we wanted to resolve. RavenDB is using X509 certificates for authentication. These are highly secured and are a good answer for our clients who need to host sensitive information or are working in highly regulated environments. However, certificates have a problem, they expire. In particular, if you are following common industry best practices, you’ll replace your certificates every 2 – 3 months. In fact, the common setup of using RavenDB with Let’s Encrypt will do just that. Certificates will be replaced on the fly by RavenDB without the need for an administrator involvement.

If you are running inside a single cluster, that isn’t something you need to worry about. RavenDB will coordinate the certificate update between the nodes in such a way that it won’t cause any disruption in service. However, it is pretty common in RavenDB to have multi cluster topologies. Either because you are deployed in a geo-distributed manner or because you are running using complex topologies (edge processing, multiple cooperating clusters, etc). That means that when cluster A replaces its certificate, we need to have a good story for cluster B still allowing it access, even though the certificate has changed.

I outlined our thinking in the previous post, and I got a really good hint,  13xforever suggested that we’ll look at HPKP (HTTP Public Key Pinning) as another way to handle this. HPKP is a security technology that was widely used, run into issues and was replaced (mostly by certificate transparency). With this hint, I started to investigate this further. Here is what I learned:

  • A certificate is composed of some metadata, the public key and the signature of the issuer (skipping a lot of stuff here, obviously).
  • Keys for certificates can be either RSA or ECDSA. In both cases, there is a 1:1 relationship between the public and private keys (in other words, each public key has exactly one private key).

Given these facts, we can rely on that to avoid the issues with certificate expiration, distributing new certificates, etc.

Whenever a cluster need a new certificate, it will use the same private/public key pair to generate the new certificate. Because the public key is the same (and we verify that the client has the private key during the handshake), even if the certificate itself changed, we can verify that the other side know the actual secret, the private key.

In other words, we slightly changed the trust model in RavenDB. From trusting a particular certificate, we trust that certificate’s private key. That is what grants access to RavenDB. In this way, when you update the certificate, as long as you keep the same key pair, we can still authenticate you.

This feature means that you can drastically reduce the amount of work that an admin has to do and lead you to a system that you setup once and just keeps working.

There are some fine details that we still had to deal with, of course. An admin may issue a certificate and want it to expire, so just having the user re-generate a new certificate with the private key isn’t really going to work for us. Instead, RavenDB validates that the chain of signatures on the certificate are the same. Actually, to be rather more exact, it verifies that the chain of signatures that signed the original (trusted by the admin) certificate and the new certificate that was just presented to us are signed by the same chain of public key hashes.

In this way, if the original issuer gave you a new certificate, it will just work. If you generate a new certificate on your own with the same key pair, we’ll reject that. The model that we have in mind here is trusting a driver’s license. If you have an updated driver’s license from the same source, that is considered just as valid as the original one on file. If the driver license is from Toys R Us, not so much.

Naturally, all such automatic certificate updates are going to be logged to the audit log, and we’ll show the updated certificates in the management studio as well.

As usual, we’ll welcome your feedback, the previous version of this post got us a great feature, after all.

time to read 4 min | 655 words

This post really annoyed me. Feel free to go ahead and go through it, I’ll wait. The gist of the post, titled: “WAL usage looks broken in modern Time Series Databases?” is that time series dbs that uses a Write Ahead Log system are broken, and that their system, which isn’t using a WAL (but uses Log-Structure-Merge, LSM) is also broken, but no more than the rest of the pack.

This post annoyed me greatly. I’m building databases for a living, and for over a decade or so, I have been focused primarily with building a distributed, transactional (ACID), database. A key part of that is actually knowing what is going on in the hardware beneath my software and how to best utilize that. This post was annoying, because it make quite a few really bad assumptions, and then build upon them. I particularly disliked the outright dismissal of direct I/O, mostly because they seem to be doing that on very partial information.

I’m not familiar with Prometheus, but doing fsync() every two hours basically means that it isn’t on the same plane of existence as far as ACID and transactions are concerned. Cassandra is usually deployed in cases where you either don’t care about some data loss or if you do, you use multiple replicas and rely on that. So I’m not going to touch that one as well.

InfluxDB is doing the proper thing and doing fsync after each write. Because fsync is slow, they very reasonable recommend batching writes. I consider this to be something that the database should do, but I do see where they are coming from.

Postgres, on the other hand, I’m quite familiar with, and the description on the post is inaccurate. You can configure Postgres to behave in this manner, but you shouldn’t, if you care about your data. Usually, when using Postgres, you’ll not get a confirmation on your writes until the data has been safely stored on the disk (after some variant of fsync was called).

What really got me annoyed was the repeated insistence of “data loss or corruption”, which shows a remarkable lack of understanding of how WAL actually works. Because of the very nature of WAL, the people who build them all have to consider the nature of a partial WAL write, and there are mechanisms in place to handle it (usually by considering this particular transaction as invalid and rolling it back).

The solution proposed in the post is to use SSTable (sorted strings table), which is usually a component in LSM systems. Basically, buffer the data in memory (they use 1 second intervals to write it to disk) and then write it in one go. I’ll note that they make no mention of actually writing to disk safely. So no direct I/O or calls to fsync. In other words, a system crash may leave you a lot worse off than merely 1 second of lost data.  In fact, it is possible that you’ll have some data there, and some not. Not necessarily in the order of arrival.

A proper database engine will:

  • Merge multiple concurrent writes into a single disk operation. In this way, we can handle > 100,000 separate writes per seconds (document writes, so significantly larger than the typical time series drops) on commodity hardware.
  • Ensure that if any write was confirmed, it actually hit durable storage and can never go away.
  • Properly handle partial writes or corrupted files, in such a way that none of the invariants on the system is violated.

I’m leaving aside major issues with LSM and SSTables, of which write amplification, and the inability to handle sustained high loads (because there is never a break in which you can do book keeping). Just the portions on the WAL usage (which shows broken and inefficient use) to justify another broken implementation is quite enough for me.

time to read 2 min | 246 words

imageOne of the primary reasons why businesses chose to use workflow engines is that they get pretty pictures that explain what is going on and look like they are easy to deal with. The truth is anything but that, but pretty sell.

My recommended solution for workflow has a lot going for it, if you are a developer. But if you’ll try to show a business analyst this code, they are likely to just throw their hands up in the air and give up.  Where are the pretty pictures?

One of the main advantages of this kind of approach is that it is very rigid. You are handling things in the event handlers, registering the next step in the workflow, etc. All of which is very regimented. This is so for a reason. First, it make it very easy to look at the code and understand what is going on. Second, it allow us to process the code in additional ways.

Consider the following AST visitor, which operate over the same code.

This took me about twenty minutes to write, mostly to figure out the Graphviz notation. It take advantage of the fact that the structure of the code is predictable to generate the actual flow of actions from the code.

You get to use readable code and maintainable practices and show pretty pictures to the business people.

time to read 3 min | 407 words

In my previous post, I talked about the driving forces toward a scripting solution to workflow behavior, and I presented the following code as an example of such a solution. In this post, I want to focus on the non obvious aspects of such a design.

The first thing to note about this code is that it is very structured. You are working on an event based system, and as such, the input / output for the system are highly visible. It also means that we have straightforward ways to deal with complexity. We can break some part of the behavior into a different file or even a different workflow that we’ll call into.

The second thing to note is that workflows tend to be long running processes. In the code above, we have a pretty obvious way to handle state. We get passed a state object, which we can freely modify. Changes to the state object are persisted between event invocations. That is actually a pretty important issue. Because if we store that state inside RavenDB, we also get the ability to do a bunch of other really interesting stuff:

  • You can query ongoing workflow and check their state.
  • You can use the revisions feature inside of RavenDB and be able to track down the state changes between invocations.

The input to the events is also an object, and that means that you can also store that natively, which means that you have full tracing capabilities.

The third important thing to note is that the script is just code, and even in complex cases, it is going to be pretty small. That means that you can run version resistant workflows. What do I mean by that?

Once a workflow process has started, you want to keep it on the same workflow script that is started with. This make versioning decision much nicer, and it is very easy for you to deal with changes over time.  On the other hand, sometimes you need to fix the script itself (there was a bug that allowed negative APR), in which case you can change it for just the ongoing workflows.

Actual storage of the script can be in Git, or as a separate document inside the database. Alternatively, you may actually want to include the script itself in every workflow. That is usually reserved for industries where you have to be able to reproduce exactly what happened and I wouldn’t recommend doing this in general.

time to read 6 min | 1018 words

I got a great comment on my previous post about using Map/Reduce indexes in RavenDB for event sourcing. The question was how to handle time sensitive events or ordered events in this manner. The simple answer is that you can’t, RavenDB intentionally don’t expose anything about the ordering of the documents to the index. In fact, given the distributed nature of RavenDB, even the notion of ordering documents by time become really hard.

But before we close the question as “cannot do that by design", let’s see why we want to do something like that. Sometimes, this really is just the developer wanting to do things in the way they are used to and there is no need for actually enforcing the ordering of documents. But in other cases, you want to do this because there is a business meaning behind these events. In those cases, however, you need to handle several things that are a lot more complex than they appear. Because you may be informed of an event long after that actually happened, and you need to handle that.

Our example for this post is going to be mortgage payments. This is a good example of a system where time matters. If you don’t pay your payments on time, that matters. So let’s see how we can model this as an event based system, shall we?

A mortgage goes through several stages, but the only two that are of interest for us right now are:

  • Approval – when the terms of the loan are set (how much money, what is the collateral, the APR, etc).
  • Withdrawal – when money is actually withdrawn, which may happen in installments.

Depending on the terms of the mortgage, we need to compute how much money should be paid on a monthly basis. This depend on a lot of factors, for example, if the principle is tied to some base line, changes to the base line will change the amount of the principle. If only some of the amount was withdrawn, if there are late fees, balloon payment, etc. Because of that, on a monthly basis, we are going to run a computation for the expected amount due for the next month.

And, obviously, we have the actual payments that are being made.

Here is what the (highly simplified) structure looks like:


This includes all the details about the mortgage, how much was approved, the APR, etc.

The following is what the expected amount to be paid looks like:


And here we have the actual payment:


All pretty much bare bones, but sufficient to explain what is going on here.

With that in place, let’s see how we can actually make use of it, shall we?

Here are the expected payments:


Here are the mortgage payments:


The first thing we want to do is to aggregate the relevant operations on a monthly basis, since this is how mortgages usually work. I’m going to use a map reduce index to do so, and as usual in this series of post, we’ll use JavaScript indexes to do the deed.

Unlike previous examples, now we have real business logic in the index. Most specifically, funds allocations for partial payments. If the amount of money paid is less than the expected amount, we first apply it to the interest, and only then to the principle.

Here are the results of this index:


You can clearly see that mistake that were made in the payments. On March, the amount due for the loan increased (took another installment from the mortgage) but the payments were made on the old amount.

We aren’t done yet, though. So far we have the status of the mortgage on a monthly basis, but we want to have a global view of the mortgage. In order to do that, we need to take a few steps. First, we need to define an Output Collection for the index, that will allow us to further process the results on this index.

In order to compute the current status of the mortgage, we aggregate both the mortgage status over time and the amount paid by the bank for the mortgage, so we have the following index:

Which gives us the following output:


As you can see, we have a PastDue marker on the loan. At this point, we can make another payment on the mortgage, to close the missing amount, like so:


This will update the monthly mortgage status and then the overall status. Of course, in a real system (I mentioned this is highly simplified, right?) we’ll need to take into account payments made in one time but applied to different times (which we can handle by an AppliedTo property) and a lot of the actual core logic isn’t in indexes. Please don’t do mortgage logic in RavenDB indexes, that stuff deserve its own handling, in your own code. And most certainly don’t do that in JavaScript. The idea behind this post is to explore how we can handle non trivial event projection using RavenDB. The example was chosen because I assume most people will be familiar with it and it wasn’t immediately obvious how to go about actually solving it.

If you want to play with this, you can import the following file (Settings > Import Data) to get the documents and index definitions.

time to read 3 min | 500 words

In the previous post I talked about how to use a map reduce index to aggregate events into a final model. You can see this on the right. This is an interesting use case of indexing, and it can consolidate a lot of complexity into a single place, at which point you can utilize additional tooling available inside of RavenDB.

As a reminder, you can get the dump of the database that you can import into your own copy of RavenDB (or our live demo instance) if you want to follow along with this post.

Starting from the previous index, all we need to do is edit the index definition and set the Output Collection, like so:


What does this do? This tell RavenDB that in addition to indexing the data, it should also take the output of the index and create new documents from it in the ShoppingCarts collection. Here is what these documents look like:


You can see at the bottom that this document is flagged as artificial and coming from an index. The document id is a hash of the reduce key, so changes to the same cart will always go to this document.

What is important about this feature is that once the result of the index is a document, we can operate it using all the usual tools for indexes. For example, we might want to create another index on top of the shopping carts, like the following example:

In this case, we are building another aggregation. Taking all the paid shopping carts and computing the total sales per product from these. Note that we are now operating on top of our event streams but are able to extract second level aggregation from the data.

Of course, normal indexes on top of the artificial ShoppingCarts allow you to do things like: “Show me my previous orders”. In essence, you are using the events for your writes, define the aggregation to the final model in an index and then RavenDB take care of the read model.

Some other options to pay attention to is the not doing the read model and the full work on the same database instance as your events. Instead, you can output the documents to a collection and then use RavenDB’s native ETL capabilities to push them to another database (which can be another RavenDB instance or a relational database) for further processing.

The end result is a system that is built on dynamic data flow. Add an event to the system, the index will go through it, aggregate it with other events on the same root and output it to a document, at which point more indexes will pick it up and do further work, ETL will push it to other databases, subscriptions can start operation on it, etc.

time to read 4 min | 679 words

RavenDB uses X509 certificates for many purposes. One of them is to enable authentication by using clients certificates. This create a highly secured authentication method with quite a lot to recommend it. But it does create a problem. Certificates, by their very nature, expire. Furthermore, certificates usually have relatively short expiration times. For example, Let’s Encrypt certificates expire in 3 months. We don’t have to use the same cert we use for server authentication for client authentication as well, but it does create a nice symmetry and simplify the job of the admin.

Except that every cert replacement ( 3 months, remember? ) the admin will now need to go to any of the systems that we talk to and update the list of allowed certificates whenever we update the Let’s Encrypt certificate. One of the reasons behind this 3 months deadline is to ensure that you’ll automate the process of cert replacement, so it is obvious that we need a way to automate the process of updating third parties about cert replacements.

Our current design goes like this:

  • This design applies only to the nodes for which we authenticate using our own server certificate (thus excluding Pull Replication, for example).
  • Keep track of all the 3rd parties RavenDB instances that we talk to.
  • Whenever we have an updated certificate, contact each of those instances and let them know about the cert change. This is done using a request that authenticate using the old certificate and providing the new one.
  • The actual certificate replacement is delayed until all of those endpoints have been reached or until the expiration of the current certificate is near.

Things to consider:

  • Certificate updates are written to the audit log. And you can always track the chain of updates backward.
  • Obviously, a certificate can only register a replacement as long as it is active.
  • The updated certificate will have the exact same permissions as the current certificate.
  • A certificate can only ever replace itself with one other certificate. We allow to do that multiple times, but the newly updated cert will replace the previous updated cert.
  • A certificate cannot replace a certificate that it updated if that certificate has updated certificate as well.

In other words, consider certificate A that is registered in a RavenDB instance:

  • Cert A can ask the RavenDB instance to register updated certificate B, at which point users can connect to the RavenDB instance using either A or B. Until certificate A expires. This is to ensure that during the update process, we won’t see some nodes that we need to talk to using cert A and some nodes that we need to talk to using cert B.
  • Cert A can ask the RavenDB instance to register updated certificate C, at which point, certificate B is removed and is no longer valid. This is done in case we failed to update the certificate and need to update with a different certificate.
  • Cert C can then ask the RavenDB instance to register updated certificate D. At this point, certificate A become invalid and can no longer be used. Only certs C and D are now active.

More things to consider:

  • Certain certificates, such as the ones exposing Pull Replication, are likely going to be used by many clients. I’m not sure if we should allow certificate replacement there. Given that we usually won’t use the server cert for authentication in Pull Replication, I don’t see that as a problem.
  • The certificate update process will be running on only a single node in the cluster, to avoid concurrency issues.
  • We’ll provide a way to the admin to purge all expired certificates (although, with one update every 3 months, I don’t expect there to be many).
  • We are considering limiting this to non admin certificates only. So you will not be able to update a certificate if it has admin privileges in an automated manner. I’m not sure if this is a security feature or a feel good feature.
  • We’ll likely provide administrator notification that this update has happened on the destination node, and that might be enough to allow updating of admin certificates.

Any feedback you have would be greatly appreciated.

time to read 5 min | 875 words

In this post, I want to take the notion of doing computation inside RavenDB’s indexes to the next stage. So far, we talked only about indexes that work on a single document at a time, but that is just the tip of the iceberg of what you can do with indexes inside RavenDB. What I want to talk about today is the ability to do computations over multiple documents and aggregate them. The obvious example is in the following RQL query:


That is easy to understand, it is simple aggregation of data. But it can get a lot more interesting. To start with, you can add your own aggregation logic in here, which open up some interesting ideas. Event Sourcing, for example, is basically a set of events on a subject that are aggregated into the final model. Probably the classiest example of event sourcing is the shopping cart example. In such a model, we have the following events:

  • AddItemToCart
  • RemoveItemFromCart
  • PayForCart

Here what these look like, in document form:


We add a couple of items to the cart, remove excess quantity and pay for the whole thing. Pretty simple model, right? But how does this relate to indexing in RavenDB?

Well, the problem here is that we don’t have a complete view of the shopping cart. We know what the actions were, but not what its current state is. This is where our index come into play, let’s see how it works.

The final result of the cart should be something like this:


Let’s see how we get there, shall we?

We’ll start by processing the add to cart events, like so:

As you can see, the map phase here build the relevant parts of the end model directly. But we still need to complete the work by doing the aggregation. This is done on the reduce phase, like so:

Most of the code here is to deal with merging of products from multiple add actions, but even that should be pretty simple. You can see that there is a business rule here. The customer will be paying the minimum price they encountered throughout the process of building their shopping cart.

Next, let’s handle the removal of items from the cart, which is done in two steps. First, we map the remove events:

There are a few things to note here, the quantity is negative, and the price is zeroed, that necessitate changes in the reduce as well. Here they are:

As you can see, we now only get the cheapest price, above zero, and we’ll remove empty items from the cart. The final step we have to take is handle the payment events. We’ll start with the map first, obviously.

Note that we added a new field to the output. Just like we set the Products fields in the pay for cart map to empty array, we need to update the rest of the maps to include a Paid: {} to match the structure. This is because all the maps (and the reduce) in an index must output the same shape out.

And now we can update the reduce accordingly. Here is the third version:

This is almost there, but we still need to do a bit more work to get the final output right. To make things interesting, I changed things up a bit and here is how we are paying for this cart:


And here is the final version of the reduce:

And the output of this for is:


You can see that this is a bit different from what I originally envisioned it. This is mostly because I’m bad at JavaScript and likely took many shortcuts along the way to make things easy for myself. Basically, I was easier to do the internal grouping using an object than using arrays.

Some final thoughts:

  • A shopping cart is usually going to be fairly small with a few dozens of events in the common case. This method works great for this, but it will also scale nicely if you need to aggregate over tens of thousands of events.
  • A key concept here is that the reduce portion is called recursively on all the items, incrementally building the data until we can’t reduce it any further. That means that the output we have get should also serve as the input to the reduce. This take some getting used to, but it is a very powerful technique.
  • The output of the index is a complete model, which you can use inside your system. I the next post, I’ll discuss how we can more fully flesh this out.

If you want to play with this, you can get the dump of the database that you can import into your own copy of RavenDB (or our live demo instance).

time to read 5 min | 962 words

I got into an interesting discussion about Event Sourcing in the comments for a post and that was interesting enough to make a post all of its own.

Basically, Harry is suggesting (I’m paraphrasing, and maybe not too accurately) a potential solution to the problem of having the computed model from all the events stored directly in memory. The idea is that you can pretty easily get machines with enough RAM to store stupendous amount of data in memory. That will give you all the benefits of being able to hold a rich domain model without any persistence constraints. It is also likely to be faster than any other solution.

And to a point, I agree. It is likely to be faster, but that isn’t enough to make this into a good solution for most problems. Let me to point out a few cases where this fails to be a good answer.

If the only way you have to build your model is to replay your events, then that is going to be a problem when the server restarts. Assuming a reasonably size data model of 128GB or so, and assuming that we have enough events to build something like that, let’s say about 0.5 TB of raw events, we are going to be in a world of hurt. Even assuming no I/O bottlenecks, I believe that it would be fair to state that you can process the events at a rate of 50 MB/sec. That gives us just under 3 hours to replay all the events from scratch. You can try to play games here, try to read in parallel, replay events on different streams independently, etc. But it is still going to take time.

And enough time that this isn’t a good technology to have without a good backup strategy, which means that you need to have at least a few of these machines and ensure that you have some failover between them. But even ignoring that, and assuming that you can indeed replay all your state from the events store, you are going to run into other problems with this kind of model.

Put simply, if you have a model that is tens or hundreds of GB in size, there are two options for its internal structure. On the one hand, you may have a model where each item stands on its own, with no relations to other items. Or if there are any relations to other items, they are well scoped to the a particular root. Call it the Root Aggregate model, with no references between aggregates. You can make something like that work, because you have a good isolation between the different items in memory, so you can access one of them without impacting another. If you need to modify it, you can lock it for the duration, etc.

However, if your model is interconnected, so you may traverse between one Root Aggregate to another, you are going to be faced with a much harder problem.

In particular, because there are no hard breaks between the items in memory, you cannot safely / easily mutate a single item without worrying about access from another item to it. You could make everything single threaded, but that is a waste of a lot of horsepower, obviously.

Another problem with in memory models is that they don’t do such a good job of allowing you to rollback operations. If you run your code mutating objects and hit an exception, what is the current state of your data?

You can resolve that. For example, you can decide that you have only immutable data in memory and replace that atomically. That… works, but it requires a lot of discipline and make it complex to program against.

Off the top of my head, you are going to be facing problems around atomicity, consistency and isolation of operations. We aren’t worried about durability because this is purely in memory solution, but if we were to add that, we would have ACID, and that does ring a bell.

The in memory solution sounds good, and it is usually very easy to start with, but it suffer from major issues when used in practice. To start with, how do you look at the data in production? That is something that you do surprisingly often, to figure out what is going on “behind the scenes”. So you need some way to peek into what is going on. If your data is in memory only, and you haven’t thought about how to explore it to the outside, your only option is to attach a debugger, which is… unfortunate. Given the reluctance to restart the server (startup time is high) you’ll usually find that you have to provide some scripting that you can run in process to make changes, inspect things, etc.

Versioning is also a major player here. Sooner or later you’ll probably put the data inside a memory mapped to allow for (much) faster restarts, but then you have to worry about the structure of the data and how it is modified over time.

None of the issues I have raised is super hard to figure out or fix, but in conjunction? They turn out to be a pretty big set of additional tasks that you have to do just to be in the same place you were before you started to put everything in memory to make things easier.

In some cases, this is perfectly acceptable. For high frequency trading, for example, you would have an in memory model to make decisions on as fast as possible as well as a persistent model to query on the side. But for most cases, that is usually out of scope. It is interesting to write such a system, though.

time to read 6 min | 1043 words

imageI had some really interesting discussions while I was in CodeMash, and a few of them touched on modeling concerns with non trivial architectures. In particular, I was asked about my opinion on the role of OR/M in systems that mostly do CQRS, event processing, etc.

This is a deep question, because on first glance, your requirements from the database are pretty much just:

INSERT INTO Events(EventId, AggregateId, Time, EventJson) VALUE (…)

There isn’t really the need to do anything more interesting than that. The other size of that is a set of processes that operate on top of these event streams and produce read models that are very simple to consume as well. There isn’t any complexity in the data architecture at all, and joy to world, etc, etc.

This is true, to an extent. But this is only because you have moved a critical component of your system, the beating heart of your business. The logic, the rules, the thing that make a system more than just a dumb repository of strings and numbers.

But first, let me make sure that we are on roughly the same page. In such a system, we have:

  • Commands – that cannot return a value (but will synchronously fail if invalid). These mutate the state of the system in some manner.
  • Events – represent something that has (already) happened. Cannot be rejected by the system, even if they represent invalid state. The state of the system can be completely rebuilt from replaying these events.
  • Queries – that cannot mutate the state

I’m mixing here two separate architectures, Command Query Responsibility Separation and Event Sourcing. They aren’t the same, but they often go together hand in hand, and it make sense to talk about them together.

And because it is always easier for me to talk in concrete, rather than abstract, terms, I want to discuss a system I worked on over a decade ago. That system was basically a clinic management system, and the part that I want to talk about today was the staff scheduling option.

Scheduling shifts is a huge deal, even before we get to the part where it directly impacts how much money you get at the end of the month. There are a lot of rules, regulations, union contracts, agreement and bunch of other staff that relate to it. So this is a pretty complex area, and when you approach it, you need to do so with the due consideration that it deserves. When we want to apply CQRS/ES to it, we can consider the following factors:

The aggregates that we have are:

  • The open scheduled for two months for now. This is mutable, being worked on by the head nurse and constantly changes.
  • The proposed scheduled for next month. This one is closed, changes only rarely and usually because of big stuff (something being fired, etc).
  • The planned schedule for the current month, frozen, cannot be changed.
  • The actual schedule for the current month. This is changed if someone doesn’t show to their shift, is sick, etc.

You can think of the first three as various stages of a PlannedScheduled, but the ActualSchedule is something different entirely. There are rules around how much divergence you can have between the planned and actual schedules, which impact compensation for the people involved, for example.

Speaking of which, we haven’t yet talked about:

  • Nurses / doctors / staff – which are being assigned to shifts.
  • Clinics – a nurse may work in several different locations at different times.

There is a lot of other stuff that I’m ignoring here, because it would complicate the picture even further, but that is enough for now. For example, regardless of the shifts that a person was assigned to and showed up, they may have worked more hours (had to come to a meeting, drove to a client) and that complicated payroll, but that doesn’t matter for the scheduling.

I want to focus on two actions in this domain. First, the act of the head nurse scheduling a staff member to a particular shift. And second, the ClockedOut event which happens when a staff member completes a shift.

The ScheduleAt command place a nurse at a given shift in the schedule, which seems fairly simple on its face. However, the act of processing the command is actually really complex. Here are some of the things that you have to do:

  • Ensure that this nurse isn’t schedule to another shift, either concurrently or too close to another shift in a different address.
  • Ensure that the nurse doesn’t work with X (because issues).
  • Ensure that the role the nurse has matches the required parameters for the schedule.
  • Ensure that the number of double shifts in a time period is limited.

The last one, in particular, is a sinkhole of time. Because at the same time, another business rule says that we must give each nurse N number of shifts in a time period, and yet another dictates how to deal with competing preferences, etc.

So at this point, we have: ScheduleAtCommand.Execute() and we need to apply logic, complex, changing, business critical logic.

And at this point, for that particular part of the system, I want to have a full domain, abstracted persistence and be able to just put my head down and focus on solving the business problem.

The same applies for the ClockedOut event. Part of processing it means that we have to look at the nurse’s employment contract, count the amount of overtime worked, compute total number of hours worked in a pay period, etc. Apply rules from the clinic to the time worked, apply clauses from the employment contract to the work, etc. Again, this gets very complex very fast. For example, if you have a shift from 10PM – 6 AM, how do you compute overtime? For that matter, if this is on the last day of the month, when do you compute overtime? And what pay period do you apply it to?

Here, too, I want to have a fully fleshed out model, which can operate in the problem space freely.

In other words, a CQRS/ES architecture is going to have the domain model (and some sort of OR/M) in the middle, doing the most interesting things and tackling the heart o complexity.


No future posts left, oh my!


  1. RavenDB 4.2 Features (4):
    19 Mar 2019 - Time travel and revisions revert
  2. Workflow design (4):
    06 Mar 2019 - Making the business people happy
  3. Data modeling with indexes (6):
    22 Feb 2019 - Event sourcing–Part III–time sensitive data
  4. Production postmortem (25):
    18 Feb 2019 - This data corruption bug requires 3 simultaneous race conditions
  5. Making money from Open Source Software (3):
    08 Feb 2019 - How we do it?
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats