Using RavenDB for data aggregation from dynamic sources

time to read 5 min | 810 words

image

I got an interesting question from a customer recently and thought that it would make for a fun blog post. The issue the customer is facing is that they are aggregating data from many sources, and they need to make sense of all the data in a nice manner. For some of the sources, they have some idea about the data, but in many cases, the format of the data they get is pretty arbitrary.

Consider the image on the right, we have four different documents, from separate sources:

  • titles/123-45-678/2022-01-28 – The car title
  • tickets/0000000000000000009-A – A parking ticket that was issued for a car
  • orders/0000000000000000010-A – An order from a garage about fixes made for a customer (which includes some cars)
  • claims/0000000000000000011-A – Details of a rejected insurance claim for a car

We need to make sense of all of this information and provide some information to the user about a car from all those sources. The system in question is primarily interested in cars, so what I would like to do is show a “car file”. All the information at hand that we have for a particular car. The problem is that this is not trivial to do. In some cases, we have a field with the car’s license plate, but each data source named it differently. In the case of the Order document, the details about the specific service for the car are deep inside the document, in a free form text field.

I can, of course, just index the whole thing and try to do a full text search on the data. It would work, but can we do better than that?

A license plate in the system has the following format: 123-45-768. Can we take advantage of that?

If you said regex, you now have two problems :-).

Let’s see what we can do about this…

One way to handle this is to create a multi map-reduce index inside of RavenDB, mapping the relevant items from each collection and then aggregating the values by the car’s license plate from all sources. The problem with this approach is that you’ll need to specialize for each and every data source you have. Sometimes, you know what the data is going to look like and can get valuable insight from that, but in other cases, we are dealing with whatever the data providers will give us…

For that reason, I created the following index, which uses a couple of neat techniques all at once to give me insight into the data that I have in the system, without taking too much time or complexity.

This looks like a lot of code, I know, but the most complex part is in the scanLicensePlates() portion. There we define a regex for the license plate and scan the documents recursively trying to find a proper match.

The idea is we’ll find a license plate in either the field directly (such as Title.LicensePlate) or part of the field contents (such as Orders.Lines.Task field). Regardless of where we find the data, in the map phase we’ll emit a separate value for each detected license plate in the document. We’ll then aggregate by the license plate in the reduce phase. Some part of the complexity here is because we are building a smart summary, here is the output of this index:

As you can see, the map-reduce index results will give us the following data items:

  • The license plate obviously (which is how we’ll typically search this index)
  • The summary for all the data items that we have for this particular license plate. That will likely be something that we’ll want to show to the user.
  • The ids of all the documents related to this license plate, which we’ll typically want to show to the user.

The nice thing about this approach is that we are able to extract actionable information from the system with very little overhead. If we have new types of data sources that we get in the future, they’ll seamlessly merge into the final output for this index.

Of course, if you know more about the data you are working with, you can probably extract more interesting information. For example, we may want to show the current owner of the car, which we can extract from the latest title document we have. Or we may want to compute how many previous owners a particular vehicle has, etc.

As the first step to aggregate information from dynamic data sources, that gives us a lot of power. You can apply this technique in a wide variety of scenarios. If you are finding yourself doing coarse grained searches and trying to regex your way to the right data, this sort of approach can drastically improve your performance and make it far easier to build a system that can be maintained over the long run.