Implementing Background Processes
I got the following question (originally about RavenDB, but I generalized it a bit):
I'm currently working on a open source project where I need background processing. The main scenarios are:
- Processing data from a queue of incoming messages, like processing incoming mail that's put in a queue.
- Processing data from a lot of different web services.
I've worked with scheduling frameworks like quartz.net before to schedule processing but in this case I'm looking at much bigger amounts of processing. It would be nice to add more workers depending on the load like raven db.
I think my main question is what's your experience when building background workers? What should I think about? Is there any framework that can help me?
The first thing to understand is that for data processing, actually implementing queuing is going to be a losing proposition. The absolutely major cost for most data processing task is IO, and the best way to handle that is to handle this via batching. Queues doesn’t really work for this scenario because they make it hard to process a batch of changes in one shot. Queues are natural for “pull from queue, process, move to next message”, which isn’t good when you are processing large amount of information.
The way this is implemented in RavenDB is that I have ensured that there is a cheap way to query by “last updated timestamp”. After that, it means that I am able to issues queries such as:
Give me the next batch of updated documents since update point 121.
Those queries are very cheap (they are fully indexed queries at the storage level).
Following that, each data processing task merely need to keep track of the last update point that it processed. Things get a little more complex when you assume that there can be periods of time where no activity happens, since you want to avoid polling in that scenario.
With RavenDB, if a processing task doesn’t find anything to process, it goes to sleep, and we ensured that this can work by raising a notification whenever the database change, in which case we can wake the waiting tasks. This approach allows us to efficiently process data without waiting for scheduled tasks (which result in update delays), without polling (which consume additional resources) and without complex logic (scheduling, determining what changed, queues, etc).
I find this to be quite an elegant solution.
Comments
"we ensured that this can work by raising a notification whenever the database change"
This works when everything is in one place and the DB is closely aware of the background processor. What if the storage the DB is one entity on one machine (that we don't control, as most of us didn't write our own RavenDB we care about this ;) ), and we have the processor running either as a separate process on same machine or on another machine entirely. Can this notify pattern work? What technology would be used to wake a sleeping .Net processor by the DB (or anything else we want to process stuff from) that isn't away of it?
Isn't the effect of a 'tight' notification the same of a queue?
You are in the risc of doing a single update each time, like with queueing.
Also how do you ensure not to miss a notification, when your task is running already?
// Ryan
So, is this a question or an answer?
Alex K,
There are many ways that you can handle this scenario.
For example, you can use notification triggers in the database, or notify via a pub/sub mechanism.
How you do the notification isn't that important, and worst case scenario, you can fall back to polling.
Rafal,
An answer.
Ryan,
Yes, there is some danger in that. The issue is a balance between freshness and overall speed.
In general, this approach tends to balance itself up pretty well.
Let us assume that the update rate in lower than or equal the processing rate. In that case, this pattern gives you results as fresh as possible.
That is actually a rare scenario, because it indicate low update rates, and the system would basically sleep part of the time, waiting for updates.
If the update rate is greater than the processing rate, it means that the first processing run would pick just the first item, but the second one would pick a full batch, and so on.
Another aspect of this system is that you aren't actually requiring any processing power to handle large updates, since we are don't keep any state for each task, just our current position.
Do you use or have you considered using the TPL? A combination of a framework for queuing combined with the built-in parallel extensions is a nice fit - sorry if this is a little simplistic an answer; just thought I'd throw it in there.
Matt,
It doesn't really solve the problem of how you initiate the process or select the data to process.
Ayende, do you use Lucene to query by "last updated timestamp" or is it some feature built in Raven? If so, could you point to the source code where is it held? Do you have any more optimizations in your DB?
Scooletz,
No, this is part of the storage itself, take a look at GetDocumentsAfter implementation
Comment preview