What is map/reduce for, anyway?

Mar 15 2010

What is map/reduce for, anyway?

time to read 3 min | 600 words

Yesterday I gave a visual explanation about map/reduce, and the question came up about how to handle computing navigation instructions using map/reduce. That made it clear that while (I hope) what map/reduce is might be clearer, what it is for is not.

Map/reduce is a technique aimed to solve a very simple problem, you have a lot of data and you want to go through it in parallel, probably on multiple machines. The whole idea with the concept is that you can crunch through massive data sets in realistic time frame. In order for map/reduce to be useful, you need several things:

The calculation that you need to run is one that can be composed. That is, you can run the calculation on a subset of the data, and merge it with the result of another subset.

Most aggregation / statistical functions allow this, in one form or another.

The final result is smaller than the initial data set.
The calculation has no dependencies external input except the dataset being processed.
Your dataset size is big enough that splitting it up for independent computations will not hurt overall performance.

Now, given those assumptions, you can create a map/reduce job, and submit it to a cluster of machines that would execute it. I am ignoring data locality and failover to make the explanation simple, although they do make for interesting wrinkles in the implementation.

Map/reduce is not applicable, however, in scenarios where the dataset alone is not sufficient to perform the operation. In the case of the navigation computation example, you can’t really handle this via map/reduce because you lack key data point (the starting and ending points). Trying to computing paths from all points to all other points is probably a losing proposition, unless you have a very small graph. The same applies if you have a 1:1 mapping between input and output. Oh, Map/Reduce will still work, but the resulting output is probably going to be too big to be really useful. It also means that you have a simple parallel problem, not a map/reduce sort of problem.

If you need fresh results, map/reduce isn’t applicable either, it is an inherently a batch operation, not an online one. Trying to invoke map/reduce operation for a user request is going to be very expensive, and not something that you really want to do.

Another set of problems that you can’t really apply map/reduce to are recursive problems. Fibonacci being the most well known among them. You cannot apply map/reduce to Fibonacci for the simple reason that you need the previous values before you can compute the current one. That means that you can’t break it apart to sub computations that can be run independently.

If you data size is small enough to fit on a single machine, it is probably going to be faster to process it as a single reduce(map(data)) operation, than go through the entire map/reduce process (which require synchronization). In the end, map/reduce is just a simple paradigm for gaining concurrency, as such it is subject to the benefits and limitations of all parallel programs. The most basic one of them is Amdahl's law.

Map/reduce is a very hot topic, but you need to realize what it is for. It isn’t some magic formula from Google to make things run faster, it is just Select and GroupBy, run over a distributed network.

Tweet Share Share 4 comments

Tags:

Architecture

Comments

15 Mar 2010
16:32 PM

j23tom

M/R is unusable in MS windows world - the main advantage is they have cheap hardware and cheap OS (Linux). Forget it if you r .NET developer ... or use Mono :-)

16 Mar 2010
09:05 AM

david

There is a good paper on this, covering the problem of merging, and Quoting from the paper:

"This new Map-Reduce-Merge programming model retains Map-Reduce’s many great features, while adding relational algebra to the list of database principles it upholds. It also contains several configurable components that enable many data-processing patterns."

Google "Map-reduce-merge: simplified relational data processing on large clusters " for details.

Its agood read!

16 Mar 2010
09:05 AM

schlachtzeuger

one of your better posts in the last time. i like it

16 Mar 2010
09:09 AM

Dennis

j23tom, Why so?

In the grand scheme of things, a windows license is not that big an expense. Sure, if you have 1000+ machines it is. But this algorithm is equally usable on 5-10 machines, and it will allow you to scale out instead of having to scale up.

I do wish that they had made better support for it, so that it was possible to just do out of the box. But I guess it is not that easy.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB