Data modeling with indexesEvent sourcing–Part I
In this post, I want to take the notion of doing computation inside RavenDB’s indexes to the next stage. So far, we talked only about indexes that work on a single document at a time, but that is just the tip of the iceberg of what you can do with indexes inside RavenDB. What I want to talk about today is the ability to do computations over multiple documents and aggregate them. The obvious example is in the following RQL query:
That is easy to understand, it is simple aggregation of data. But it can get a lot more interesting. To start with, you can add your own aggregation logic in here, which open up some interesting ideas. Event Sourcing, for example, is basically a set of events on a subject that are aggregated into the final model. Probably the classiest example of event sourcing is the shopping cart example. In such a model, we have the following events:
- AddItemToCart
- RemoveItemFromCart
- PayForCart
Here what these look like, in document form:
We add a couple of items to the cart, remove excess quantity and pay for the whole thing. Pretty simple model, right? But how does this relate to indexing in RavenDB?
Well, the problem here is that we don’t have a complete view of the shopping cart. We know what the actions were, but not what its current state is. This is where our index come into play, let’s see how it works.
The final result of the cart should be something like this:
Let’s see how we get there, shall we?
We’ll start by processing the add to cart events, like so:
As you can see, the map phase here build the relevant parts of the end model directly. But we still need to complete the work by doing the aggregation. This is done on the reduce phase, like so:
Most of the code here is to deal with merging of products from multiple add actions, but even that should be pretty simple. You can see that there is a business rule here. The customer will be paying the minimum price they encountered throughout the process of building their shopping cart.
Next, let’s handle the removal of items from the cart, which is done in two steps. First, we map the remove events:
There are a few things to note here, the quantity is negative, and the price is zeroed, that necessitate changes in the reduce as well. Here they are:
As you can see, we now only get the cheapest price, above zero, and we’ll remove empty items from the cart. The final step we have to take is handle the payment events. We’ll start with the map first, obviously.
Note that we added a new field to the output. Just like we set the Products fields in the pay for cart map to empty array, we need to update the rest of the maps to include a Paid: {} to match the structure. This is because all the maps (and the reduce) in an index must output the same shape out.
And now we can update the reduce accordingly. Here is the third version:
This is almost there, but we still need to do a bit more work to get the final output right. To make things interesting, I changed things up a bit and here is how we are paying for this cart:
And here is the final version of the reduce:
And the output of this for is:
You can see that this is a bit different from what I originally envisioned it. This is mostly because I’m bad at JavaScript and likely took many shortcuts along the way to make things easy for myself. Basically, I was easier to do the internal grouping using an object than using arrays.
Some final thoughts:
- A shopping cart is usually going to be fairly small with a few dozens of events in the common case. This method works great for this, but it will also scale nicely if you need to aggregate over tens of thousands of events.
- A key concept here is that the reduce portion is called recursively on all the items, incrementally building the data until we can’t reduce it any further. That means that the output we have get should also serve as the input to the reduce. This take some getting used to, but it is a very powerful technique.
- The output of the index is a complete model, which you can use inside your system. I the next post, I’ll discuss how we can more fully flesh this out.
If you want to play with this, you can get the dump of the database that you can import into your own copy of RavenDB (or our live demo instance).
More posts in "Data modeling with indexes" series:
- (22 Feb 2019) Event sourcing–Part III–time sensitive data
- (11 Feb 2019) Event sourcing–Part II
- (30 Jan 2019) Event sourcing–Part I
- (14 Jan 2019) Predicting the future
- (10 Jan 2019) Business rules
- (08 Jan 2019) Introduction
Comments
all fine and dandy and i know this is only an imaginary web shop, but when adding items to a cart i would like to see the updated price right now, without waiting for the map-reduce to complete. And, since it's about event sourcing, shouldn't the system keep the order of events when processing them?
Rafal, Actually, typically you'll not go to the cart after adding an item, you'll go on to purchase something else. In most cases, the lag time between a new event and the map/reduce index running is very short (ms).
And ordering of events is typically something that has to be valid in the domain itself. For example, my bank will order events based on: Business Day and then by Deposit / Withdrawal. In other words, for any business day, first list all the incoming funds, then take out the money. This is done to avoid overdraft fees unless you actually hit an overdraft.
it's not a tiny difference, it's a radical change of application behavior if you switch from sync to async. And you know you can stare at the screen and nothing will update in your cart until you refresh the page, which is not so obvious to everyone. I myself thought many times how clever my async processing is and how great performance, all at a negligible cost of some delayed update - a single second doesnt' matter at all in normal busines. But the users kept complainging about ghosts they see in the system - some records that show up where they shouldn't, old information shown right after they have changed it, updates not visible to users after somebody else saved it while talking on the phone... they would not complain about how email works, but expect immediate and consistent information from a transactional system...
Rafal,
In many event sourcing systems, by definition the actual event processing in the backend is not tied to the request that generated the event itself. Breaking apart that dependency is one of the explicit goal of the architecture, after all
Instead, in the UI, you set things up so you will get the right results after the computation ends. In RavenDB, you can do that using the Changes API, for example, to watch for updates on a particular index. At that point, you'll send the updates to the page.
And if your system is based around showing static pages, that would be a problem, because whatever you got in the system as the user made the request is what it is. But we are in world where getting updates after the page load is common and expected, and that is something that you should take into account in your system.
yep, and now you need additional live updates on the page and who knows what else to keep the view in sync with data. But IMHO this looks like a bad match from the beginning - trying to mix web page view with async processing on the server - it means that no matter how fast your map-reduce is and how few milliseconds it takes to update the index, it's always too late because you've already sent stale data to the user. And such situation repeats everywhere in the system - one poor decision brings whole lot of consequences to every element of the software. Easier to mitigate that in a desktop application where you can maintain constant UI sync with the database, but doing this in a web app is just ... not rational.
Rafal, It is really important to put this technique in context .I absolutely agree that doing something like this for simple CRUD is way past overkill. However, most complex systems, you are already processing things in something similar to this. What do I mean?
Consider the case of searching or a flight. You are putting the dates and locations, and then you get a page that slowly fill up with details. Another example is getting mortgage approval, after filling up all your details, you'll see a page that tell you to wait, etc.
Usually, whenever you have a complex domain, you are already not making decisions on the spot. You are passing them to the backend for further processing. And giving humans' attention spans. If you can't do that in a second or so, you have to go async in the UI anyway.
Shopping cart is a great example to show the technical side of things, because it is simple and obvious what is going on. It wouldn't really be a good idea to go to that level if you have this simple a cart. But once you go for real things, that is absolutely something that you need to do. Processing your cart at the super market is, on the surface, about as simple as this, but it get really complex because you have to take into account things like: reward points, sales, taxes, etc. At that stage, it make sense to model this explicitly and not implicitly and applying event source is one way to do that.
agree, my point is that talking about CQRS in the context of a web application is a wrong idea and i'm not sure why so many people try to pursue it. Web UI is inherently synchronous in character, organized around HTTP requests. And when you POST some data you expect to also get some data in return, in the same roundtrip. At least that's how we did things so far. Now, CQRS tells me to separate commands and queries, so to really follow this pattern i would have to make two HTTP requests instead of each POST - first request to send my update, and second one to query for the data. And if the operation is async i have no idea when to query for the data to see the effects of my update, so i have to keep querying all the time.... so maybe it would hurt less to choose a different example instead of trying to imagine all painful consequences of doing orthodox CQRS in web UI.
Rafal, The web UI and the HTTP requests request/reply mode has nothing to do with it. It is very common to submit a request, get a ticket and then wait / poll on that ticket.
Two requests for each POST is also really common, Post, Redirect, Get - PRG has been a staple of web behavior for a very long time. It saves you the "resubmit the order on F5", after all, as well as many other things.
Comment preview