Properly getting into jail: Data processing

Mar 16 2018

Properly getting into jailData processing

time to read 5 min | 942 words

In this series of blog posts, I have talked a lot about the way data flows, the notion of always talking to a local server and the difference between each location’s own data and the shared data that comes from the other parts of the system.

Here is how it looks like when modeling things:

To simplify things, we have the notion of the shared database (which is how I typically get data from the rest of the system) and my database. Data gets into the shared database using replication, which is using a gossip protocol, is resilient to network errors and can route around them, etc. The application will only ever write data to its own database, never directly to the shared one. ETL Processes will write the data from the application database to the local copy of the shared database, and from there it will be sent across to the other parties.

In terms of input/output, the process of writing locally to app DB, ETL process to local shared DB, automatic dissemination of data to the rest of the world is quite simple, once you have finished the setup. It means that you don’t really have to think about the way you publish information, but can still do that in such a way that you are not constrained in the development of the application (no shared database blues here, thank you!). However, that only deals with the outgoing side of things, how are we going to handle incoming data?

We need to remember that a core part of the design is that we aren’t just blindly copying data from the shared database. Even though this is trusted, we still need to process the data and reconcile it with what we have in our own database.

A good example of that might be the release inmate workflow that we already discussed. This is initiated by the Registration Office, and it is sent to all the parties in the prison. Let’s see how a particular block is going to handle the processing of such a core scenario.

The actual workflow for releasing an inmate needs to be handled by many parties. From the block’s perspective, this means getting the inmate physically into the release party and handing over responsibility for that inmate. When the workflow document for the inmate release reaches the block’s shared database, we need to start the internal process inside the block to handle that. We can use RavenDB Subscriptions for this purpose. A subscription is a persistent query, and any time a match is found on the subscription query, we’ll get the matching documents and can operate on that.

Here is what the subscription looks like:

Basically, it says “gimme all the release workflows for block B”. The idea of a persistent query is that whenever a new document arrives, if it matches the query, we’ll send it to the process that has this subscription opened. This means that we have a typical latency of a few milliseconds before we process the document in the worker process.

Now, let’s consider what we’ll need to do whenever we get a new release workflow. This can look like this:

I’m obviously skipping stuff here, but you should get the gist (pun intended) of what is going on.

There are a couple of interesting things in here. First, you can see that I’m writing the code here in Python. I could have also used Ruby, node.js, etc.

The idea is that this is an internal ingestion pipeline for a particular workflow. Independent of any other thing that happens in the system. Basically, the idea is to have a Open To Extension, Close To Modification kind of system. Integration with the outside world is done through subscriptions that filter the data that we care about and integration scripts that operate over the stream of data. I’m using a Python script in this manner because it is easy to show how fluid this can be. I could have use a compiled application using C# or Java just as easily. But the idea in this architecture is that it is possible and easy to modify and manage things on the fly.

The subscription workers ingesting the documents from the subscriptions take the data from the shared database, process and validate it and then make the decisions on what should be done further. On any batch of workflow documents for releasing inmates, we’ll alert the sergeant (either way, we need to release the inmate or we need to figure out why the warrant is on us while the inmate is not in our hands).

More complex script may check all the release workflows, in case the block that the Registration Office thinks the inmate is on is out of date, for example. We can also use these scripts to glue in additional players (preparing the release party to take the inmate, scheduling this in advanced, etc), but we might want to do that in our main app instead of in the scripts, to make it more visible what is going on.

The underlying architecture and shape of the application is quite explicit on the notion of data transfer, though, so it might be a good idea to do it in the scripts. A lot of that depends on whatever this is shared functionality, something that is customized per block, etc.

Tweet Share Share 0 comments

Tags:

Oren Eini

Oren Eini

CEO of RavenDB