Dealing with massively distributed data flows

time to read 4 min | 610 words

imageImagine that you are the owner of Gary’s Shoes, and that you want to get data from all of your multitudes of stores into a centralized location. You’ll use that data to make decisions, predict future trends, etc. Given that each store must operate independently, you have a server in each location that will push up it changes (and get updates from) the HQ cluster. You can see an example of this kind of setup in this post.

This work quite well, but it does require the user to be aware of a potential issue. When you have a massively distributed data flow process setup, you need to also pay attention for the quite in the noise. What do I mean by that?

One of our customers have RavenDB deployed to tens of thousands of locations worldwide. At any given time, you are going to have at least some of those locations unavailable. In some locations, part of closing down for the day means literally flipping the master switch on electricity for the entire building. On others, you might have someone tripping over the router or have some local or regional network outage.

Part of the strategy for dealing with such a data set, coming from so many separate locations, is the need to monitor when we aren’t getting data. The fact that on most of our locations we have near real time data is very powerful for the business. But you also need to see where you aren’t getting the data from and setup proper alerts and monitoring for the missing data. From a business perspective, it is also advisable to surface that kind of detail all the way to the user. If you are going to be ordering inventory for the stores in a particular state, but the two major stores in the area are down because of a network issue and has been down for two days now, you want to be aware of that and figure out that you are working on out of date data.

To be honest, the issues isn’t so much about two days of lag in the case of once in blue moon type of error. In the scenario outlined above, in pretty much all business scenarios that I can think of, you won’t really see any impact on the decision making of the organization.

The killer is when you have some sort of a problem that goes on for a while. A DNS update that was missed because of bad DNS cache policy, for example. Now your updates to HQ go into the void in a consistent basis. On the other hand, everything else continue to function properly both locally and for HQ. If this isn’t accounted for, it is easy to miss this for a long period of time. I have seen such a case that was only discovered when the year’s end numbers didn’t quite match up what they were supposed to. Given that this was the second year in a row this happened, the investigation found that some network issue indeed cause a very long term topology failure. This was actually properly reported, in a log file that no one ever read.

Lesson learned, make sure that part of your data flow strategy accounts for such things and bring them to the users’ attention. Actually resolving the issue was a network configuration change that took minutes and the entire dataset was synchronized within a few hours afterward. But finding out that there was even a problem took effectively forever.