Data modeling with indexesEvent sourcing–Part II
In the previous post I talked about how to use a map reduce index to aggregate events into a final model. You can see this on the right. This is an interesting use case of indexing, and it can consolidate a lot of complexity into a single place, at which point you can utilize additional tooling available inside of RavenDB.
As a reminder, you can get the dump of the database that you can import into your own copy of RavenDB (or our live demo instance) if you want to follow along with this post.
Starting from the previous index, all we need to do is edit the index definition and set the Output Collection, like so:
What does this do? This tell RavenDB that in addition to indexing the data, it should also take the output of the index and create new documents from it in the ShoppingCarts collection. Here is what these documents look like:
You can see at the bottom that this document is flagged as artificial and coming from an index. The document id is a hash of the reduce key, so changes to the same cart will always go to this document.
What is important about this feature is that once the result of the index is a document, we can operate it using all the usual tools for indexes. For example, we might want to create another index on top of the shopping carts, like the following example:
In this case, we are building another aggregation. Taking all the paid shopping carts and computing the total sales per product from these. Note that we are now operating on top of our event streams but are able to extract second level aggregation from the data.
Of course, normal indexes on top of the artificial ShoppingCarts allow you to do things like: “Show me my previous orders”. In essence, you are using the events for your writes, define the aggregation to the final model in an index and then RavenDB take care of the read model.
Some other options to pay attention to is the not doing the read model and the full work on the same database instance as your events. Instead, you can output the documents to a collection and then use RavenDB’s native ETL capabilities to push them to another database (which can be another RavenDB instance or a relational database) for further processing.
The end result is a system that is built on dynamic data flow. Add an event to the system, the index will go through it, aggregate it with other events on the same root and output it to a document, at which point more indexes will pick it up and do further work, ETL will push it to other databases, subscriptions can start operation on it, etc.
More posts in "Data modeling with indexes" series:
- (22 Feb 2019) Event sourcing–Part III–time sensitive data
- (11 Feb 2019) Event sourcing–Part II
- (30 Jan 2019) Event sourcing–Part I
- (14 Jan 2019) Predicting the future
- (10 Jan 2019) Business rules
- (08 Jan 2019) Introduction
 



Comments
1) How does Raven handle hash collissions?
2) How would I (efficiently) find a certain cart in this secondary collection? I mean, the secondary document id is some unrelated hash that my application has no knowledge about. My agg-root id would still be the cart name (i.e 'carts/294-A'). But wouldn't querying ShoppingCarts collection by cart name trigger a full seach in the collection, instead of a much more efficient lookup by doc-id. Adding a new index to it would solve the problem, but I have a gut-feeling that it is a bit 'over-engineered' to add an index just to perform a basic lookup by doc-id.
Maybe using some predictable concatenation/aggregation/user-specified-formula/whatever of the reduce key would make this scenario easier to work with, at the price of putting the burden of guaranteeing uniqueness on the user.
Kurbien, 1) This is explicitly handled. In the rare chance that you'll have a hash collision, we will have document ids that looks like:
carts/some-hash/1,carts/some-hash/2, etc, instead of justcarts/some-hash, which is the normal behavior. We had to explicitly override the hash generation to be able to test it, since we are also using high quality hash function.2) You issue a query on this, and RavenDB will build the index for this. We considered using the raw value or letting the user control it, but that led to a lot of complexity. Note that the actual hash is completely predictable, so you can go from the well known value to the generated document id easily enough. But usually querying it will be easiest, and RavenDB will take care of optimizing access behind the scenes.
This post mentions database backup can be imported into the live instance at http://live-test.ravendb.net
However, when creating a new database from backup, there is no option to upload backup, only to select local one.
Thanks for this series. I see some interesting possibilities for offloading rather simple event sourcing operations to ravendb indices completely, i.e. without having to adopt more heavyweight infrastructure/libs or frameworks. What’s missing from the picture for me though is how I’d be able to to do ordered event processing, e.g. based on timestamp.
I always assumed that one of the core benefits of map reduce is that it scales horizontally because ordering is not important. So what options do I have to do ordered event processing with ravendb? I have thought of a separate index to build sorted aggregate docs and the work on those (con: doc size) or pushing it to the application using changes API. Would love to see your take and expand on this in a future post.
Can you elaborate on this? Does you mean querying the index, or the artificial document?
I’m in the process of using ETL to transform events into a read model right now, and wondering about the latency of ETL vs data subscriptions. It looks like they’re both implemented as server tasks, but does one method have a higher update rate than the other?
Dejan, Use the Settings, import for this
Ryan,
Querying the artificial documents directly, at which point RavenDB will optimize these queries with an index.
In both cases, the latency from document modification to the subscription / ETL being triggered is pretty much nil. The question is more about the processing time here. ETL is typically running entirely on the server and can push things faster out. However, it operates on the delete/create model, and you don't control the generated ids. Each time the source document is updated, the destination document match id changes.
Subscription will get the documents that were changed, and then can act on it however it wants. Which give you more freedom and flexibility.If you want to write it back to the server, you can do that, but will require another round trip. In most cases, I don't think it matters at all.
Johannes, Time is pretty hard to handle in a distributed system. And ordered events processing isn't a good idea when you do it in a distributed environment. The main issue is that you may get updates out of order. Now, you can try to sort them by time, but when a new event comes in, you'll have to revert to the previous state before that event, and then reply everything from that point forward.
In most cases, this isn't actually required because the aggregation doesn't care about the ordering itself. For example, aggregating transactions on account to get final tally. Let's take a case where it does matter. Paying the mortgage. If you pay a mortgage late, it works very differently than paying on time. However, if you paid on time and the event didn't go through (which happens a lot), you need to reverse all the state changes because of the lateness. I would deal with that in two stages. First, we define an index that operate over the events and uses the (load-id, month) as the key. That outputs the state of the loan in a particular month and output it to an artificial collection. Then you have another index that operate on those and aggregate the overall state of the loan. This way, a missed payment will show up (and I don't care about the order) and if the money was paid, the first index will be updated and then the second one, resulting in the behavior we want.
I'll write a blog post on this, since this is an interesting topic and require a proper example
Comment preview