<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:copyright="http://blogs.law.harvard.edu/tech/rss" xmlns:image="http://purl.org/rss/1.0/modules/image/">
    <channel>
        <title>Architecture</title>
        <link>http://ayende.com/Blog/category/563.aspx</link>
        <description>Architecture</description>
        <language>en-US</language>
        <copyright>Ayende Rahien</copyright>
        <managingEditor>Ayende@ayende.com</managingEditor>
        <generator>Subtext Version 2.0.0.0</generator>
        <item>
            <title>What is map/reduce for, anyway?</title>
            <link>http://ayende.com/Blog/archive/2010/03/15/what-is-mapreduce-for-anyway.aspx</link>
            <description>&lt;p&gt;Yesterday I gave a &lt;a href="http://ayende.com/Blog/archive/2010/03/14/map-reduce-ndash-a-visual-explanation.aspx"&gt;visual explanation about map/reduce&lt;/a&gt;, and the question came up about how to handle computing navigation instructions using map/reduce. That made it clear that while (I hope) what map/reduce &lt;em&gt;is&lt;/em&gt; might be clearer, what it is &lt;em&gt;for&lt;/em&gt; is not.&lt;/p&gt;  &lt;p&gt;Map/reduce is a technique aimed to solve a very simple problem, you have a &lt;em&gt;lot&lt;/em&gt; of data and you want to go through it in parallel, probably on multiple machines. The whole idea with the concept is that you can crunch through massive data sets in realistic time frame. In order for map/reduce to be useful, you need several things:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;The calculation that you need to run is one that can be composed. That is, you can run the calculation on a subset of the data, and merge it with the result of another subset.&lt;/li&gt;    &lt;ul&gt;     &lt;li&gt;Most aggregation / statistical functions allow this, in one form or another.&lt;/li&gt;   &lt;/ul&gt;    &lt;li&gt;The final result is &lt;em&gt;smaller&lt;/em&gt; than the initial data set.&lt;/li&gt;    &lt;li&gt;The calculation has no dependencies external input except the dataset being processed.&lt;/li&gt;    &lt;li&gt;Your dataset size is big enough that splitting it up for independent computations will not hurt overall performance.&lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;Now, given those assumptions, you can create a map/reduce job, and submit it to a cluster of machines that would execute it. I am ignoring data locality and failover to make the explanation simple, although they do make for interesting wrinkles in the implementation.&lt;/p&gt;  &lt;p&gt;Map/reduce is &lt;em&gt;not&lt;/em&gt; applicable, however, in scenarios where the dataset alone is not sufficient to perform the operation. In the case of the navigation computation example, you can’t really handle this via map/reduce because you lack key data point (the starting and ending points). Trying to computing paths from all points to all other points is probably a losing proposition, unless you have a &lt;em&gt;very&lt;/em&gt; small graph. The same applies if you have a 1:1 mapping between input and output. Oh, Map/Reduce will still work, but the resulting output is probably going to be too big to be really useful. It also means that you have a simple parallel problem, not a map/reduce sort of problem. &lt;/p&gt;  &lt;p&gt;If you need fresh results, map/reduce isn’t applicable either, it is an inherently a batch operation, not an online one. Trying to invoke map/reduce operation for a user request is going to be very expensive, and not something that you really want to do.&lt;/p&gt;  &lt;p&gt;Another set of problems that you can’t really apply map/reduce to are recursive problems. Fibonacci being the most well known among them. You &lt;em&gt;cannot&lt;/em&gt; apply map/reduce to Fibonacci for the simple reason that you need the previous values before you can compute the current one. That means that you can’t break it apart to sub computations that can be run independently.&lt;/p&gt;  &lt;p&gt;If you data size is small enough to fit on a single machine, it is probably going to be faster to process it as a single reduce(map(data)) operation, than go through the entire map/reduce process (which require synchronization). In the end, map/reduce is just a simple paradigm for gaining concurrency, as such it is subject to the benefits and limitations of all parallel programs. The most basic one of them is &lt;a href="http://en.wikipedia.org/wiki/Amdahl's_law"&gt;Amdahl's law&lt;/a&gt;.&lt;/p&gt;  &lt;p&gt;Map/reduce is a &lt;em&gt;very&lt;/em&gt; hot topic, but you need to realize what it is for. It isn’t some magic formula from Google to make things run faster, it is just Select and GroupBy, run over a distributed network.&lt;/p&gt;&lt;img src="http://ayende.com/Blog/aggbug/11362.aspx" width="1" height="1" /&gt;</description>
            <dc:creator>Ayende Rahien</dc:creator>
            <guid>http://ayende.com/Blog/archive/2010/03/15/what-is-mapreduce-for-anyway.aspx</guid>
            <pubDate>Mon, 15 Mar 2010 10:00:00 GMT</pubDate>
            <wfw:comment>http://ayende.com/Blog/comments/11362.aspx</wfw:comment>
            <comments>http://ayende.com/Blog/archive/2010/03/15/what-is-mapreduce-for-anyway.aspx#feedback</comments>
            <slash:comments>4</slash:comments>
            <wfw:commentRss>http://ayende.com/Blog/comments/commentRss/11362.aspx</wfw:commentRss>
        </item>
        <item>
            <title>Map / Reduce &amp;ndash; A visual explanation</title>
            <link>http://ayende.com/Blog/archive/2010/03/14/map-reduce-ndash-a-visual-explanation.aspx</link>
            <description>&lt;p&gt;Map/Reduce is a term commonly thrown about these days, in essence, it is just a way to take a big task and divide it into discrete tasks that can be done in parallel. A common use case for Map/Reduce is in document database, which is why I found myself thinking deeply about this.&lt;/p&gt;  &lt;p&gt;Let us say that we have a set of documents with the following form:&lt;/p&gt;  &lt;blockquote&gt;   &lt;pre class="csharpcode"&gt;{
  &lt;span class="str"&gt;"type"&lt;/span&gt;: &lt;span class="str"&gt;"post"&lt;/span&gt;,
  &lt;span class="str"&gt;"name"&lt;/span&gt;: &lt;span class="str"&gt;"Raven's Map/Reduce functionality"&lt;/span&gt;,
  &lt;span class="str"&gt;"blog_id"&lt;/span&gt;: 1342,
  &lt;span class="str"&gt;"post_id"&lt;/span&gt;: 29293921,
  &lt;span class="str"&gt;"tags"&lt;/span&gt;: [&lt;span class="str"&gt;"raven"&lt;/span&gt;, &lt;span class="str"&gt;"nosql"&lt;/span&gt;],
  &lt;span class="str"&gt;"post_content"&lt;/span&gt;: &lt;span class="str"&gt;"&amp;lt;p&amp;gt;...&amp;lt;/p&amp;gt;"&lt;/span&gt;,
  &lt;span class="str"&gt;"comments"&lt;/span&gt;: [
    { 
      &lt;span class="str"&gt;"source_ip"&lt;/span&gt;: &lt;span class="str"&gt;'124.2.21.2'&lt;/span&gt;,
      &lt;span class="str"&gt;"author"&lt;/span&gt;: &lt;span class="str"&gt;"martin"&lt;/span&gt;,
      &lt;span class="str"&gt;"text"&lt;/span&gt;: &lt;span class="str"&gt;"..."&lt;/span&gt;
  }]
}&lt;/pre&gt;
  &lt;style type="text/css"&gt;&lt;![CDATA[
.csharpcode {
	background-color: #ffffff; font-family: consolas, "Courier New", courier, monospace; color: black; font-size: small
}
.csharpcode pre {
	background-color: #ffffff; font-family: consolas, "Courier New", courier, monospace; color: black; font-size: small
}
.csharpcode pre {
	margin: 0em
}
.csharpcode .rem {
	color: #008000
}
.csharpcode .kwrd {
	color: #0000ff
}
.csharpcode .str {
	color: #006080
}
.csharpcode .op {
	color: #0000c0
}
.csharpcode .preproc {
	color: #cc6633
}
.csharpcode .asp {
	background-color: #ffff00
}
.csharpcode .html {
	color: #800000
}
.csharpcode .attr {
	color: #ff0000
}
.csharpcode .alt {
	background-color: #f4f4f4; margin: 0em; width: 100%
}
.csharpcode .lnum {
	color: #606060
}]]&gt;&lt;/style&gt;&lt;/blockquote&gt;

&lt;p&gt;And we want to answer a question over more than a single document. That sort of operation requires us to use aggregation, and over large amount of data, that is best done using Map/Reduce, to split the work.&lt;/p&gt;

&lt;p&gt;Map / Reduce is just a pair of functions, operating over a list of data. In C#, LInq actually gives us a great chance to do things in a way that make it very easy to understand and work with. Let us say that we want to be about to get a count of comments per blog. We can do that using the following Map / Reduce queries:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;pre class="csharpcode"&gt;&lt;font color="#0000ff"&gt;from&lt;/font&gt; post &lt;span class="kwrd"&gt;in&lt;/span&gt; docs.posts
&lt;font color="#0000ff"&gt;select&lt;/font&gt; &lt;span class="kwrd"&gt;new&lt;/span&gt; {
  post.blog_id, 
  comments_length = comments.length 
  };

&lt;font color="#0000ff"&gt;from&lt;/font&gt; agg &lt;span class="kwrd"&gt;in&lt;/span&gt; results
&lt;font color="#0000ff"&gt;group&lt;/font&gt; agg by agg.key &lt;font color="#0000ff"&gt;into&lt;/font&gt; g
&lt;font color="#0000ff"&gt;select&lt;/font&gt; &lt;span class="kwrd"&gt;new&lt;/span&gt; { 
  agg.blog_id, 
  comments_length = g.Sum(x=&amp;gt;x.comments_length) 
  };&lt;/pre&gt;
  &lt;style type="text/css"&gt;&lt;![CDATA[
.csharpcode {
	background-color: #ffffff; font-family: consolas, "Courier New", courier, monospace; color: black; font-size: small
}
.csharpcode pre {
	background-color: #ffffff; font-family: consolas, "Courier New", courier, monospace; color: black; font-size: small
}
.csharpcode pre {
	margin: 0em
}
.csharpcode .rem {
	color: #008000
}
.csharpcode .kwrd {
	color: #0000ff
}
.csharpcode .str {
	color: #006080
}
.csharpcode .op {
	color: #0000c0
}
.csharpcode .preproc {
	color: #cc6633
}
.csharpcode .asp {
	background-color: #ffff00
}
.csharpcode .html {
	color: #800000
}
.csharpcode .attr {
	color: #ff0000
}
.csharpcode .alt {
	background-color: #f4f4f4; margin: 0em; width: 100%
}
.csharpcode .lnum {
	color: #606060
}]]&gt;&lt;/style&gt;&lt;/blockquote&gt;

&lt;p&gt;There are a couple of things to note here:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The first query is the map query, it maps the input document into the final format. &lt;/li&gt;

  &lt;li&gt;The second query is the reduce query, it operate over a set of results and produce an answer. &lt;/li&gt;

  &lt;li&gt;Note that the reduce query &lt;em&gt;must&lt;/em&gt; return its result in the same format that it received it, why will be explained shortly. &lt;/li&gt;

  &lt;li&gt;The first value in the result is the key, which is what we are aggregating on (think the group by clause in SQL). &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let us see how this works, we start by applying the map query to the set of documents that we have, producing this output:&lt;/p&gt;

&lt;p&gt;&lt;a href="http://ayende.com/Blog/images/ayende_com/Blog/WindowsLiveWriter/MapReduceAvisualexplanation_B89B/image_2.png"&gt;&lt;img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://ayende.com/Blog/images/ayende_com/Blog/WindowsLiveWriter/MapReduceAvisualexplanation_B89B/image_thumb.png" width="905" height="411" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The next step is to start reducing the results, in real Map/Reduce algorithms, we partition the original input, and work toward the final result. In this case, imagine that the output of the first step was divided into groups of 3 (so 4 groups overall), and then the reduce query was applied to it, giving us:&lt;/p&gt;

&lt;p&gt;&lt;a href="http://ayende.com/Blog/images/ayende_com/Blog/WindowsLiveWriter/MapReduceAvisualexplanation_B89B/image_4.png"&gt;&lt;img style="border-bottom: 0px; border-left: 0px; display: inline; border-top: 0px; border-right: 0px" title="image" border="0" alt="image" src="http://ayende.com/Blog/images/ayende_com/Blog/WindowsLiveWriter/MapReduceAvisualexplanation_B89B/image_thumb_1.png" width="647" height="597" /&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;You can see why it was called reduce, for every batch, we apply a sum by blog_id to get a new Total Comments value. We started with 11 rows, and we ended up with just 10. That is where it gets interesting, because we are still not done, we can still reduce the data further.&lt;/p&gt;

&lt;p&gt;This is what we do in the third step, reducing the data further still. That is why the input &amp;amp; output format of the reduce query must match, we will feed the output of several the reduce queries as the input of a new one. You can also see that now we moved from having 10 rows to have just 7.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://ayende.com/Blog/images/ayende_com/Blog/WindowsLiveWriter/MapReduceAvisualexplanation_B89B/image_6.png"&gt;&lt;img style="border-bottom: 0px; border-left: 0px; display: inline; border-top: 0px; border-right: 0px" title="image" border="0" alt="image" src="http://ayende.com/Blog/images/ayende_com/Blog/WindowsLiveWriter/MapReduceAvisualexplanation_B89B/image_thumb_2.png" width="632" height="593" /&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;And the final step is:&lt;/p&gt;

&lt;p&gt;&lt;a href="http://ayende.com/Blog/images/ayende_com/Blog/WindowsLiveWriter/MapReduceAvisualexplanation_B89B/image_8.png"&gt;&lt;img style="border-bottom: 0px; border-left: 0px; display: inline; border-top: 0px; border-right: 0px" title="image" border="0" alt="image" src="http://ayende.com/Blog/images/ayende_com/Blog/WindowsLiveWriter/MapReduceAvisualexplanation_B89B/image_thumb_3.png" width="623" height="341" /&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;And now we are done, we can reduce the data any further because all the keys are unique.&lt;/p&gt;

&lt;p&gt;There is another interesting property of Map / Reduce, let us say that I just added a comment to a post, that would obviously invalidate the results of the query, right?&lt;/p&gt;

&lt;p&gt;Well, yes, but not &lt;em&gt;all &lt;/em&gt;of them. Assuming that I added a comment to the post whose id is 10, what would I need to do to recalculate the right result?&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Map Doc #10 again&lt;/li&gt;

  &lt;li&gt;Reduce Step 2, Batch #3 again&lt;/li&gt;

  &lt;li&gt;Reduce Step 3, Batch #1 again&lt;/li&gt;

  &lt;li&gt;Reduce Step 4&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What is important is that I did &lt;em&gt;not &lt;/em&gt;have to touch quite a bit of the data, making the recalculation effort far cheaper than it would be otherwise.&lt;/p&gt;

&lt;p&gt;And that is (more or less) the notion of Map / Reduce.&lt;/p&gt;&lt;img src="http://ayende.com/Blog/aggbug/11353.aspx" width="1" height="1" /&gt;</description>
            <dc:creator>Ayende Rahien</dc:creator>
            <guid>http://ayende.com/Blog/archive/2010/03/14/map-reduce-ndash-a-visual-explanation.aspx</guid>
            <pubDate>Sun, 14 Mar 2010 10:00:00 GMT</pubDate>
            <wfw:comment>http://ayende.com/Blog/comments/11353.aspx</wfw:comment>
            <comments>http://ayende.com/Blog/archive/2010/03/14/map-reduce-ndash-a-visual-explanation.aspx#feedback</comments>
            <slash:comments>19</slash:comments>
            <wfw:commentRss>http://ayende.com/Blog/comments/commentRss/11353.aspx</wfw:commentRss>
        </item>
        <item>
            <title>Traditional architecture makes me flinch</title>
            <link>http://ayende.com/Blog/archive/2010/02/06/traditional-architecture-makes-me-flinch.aspx</link>
            <description>&lt;p&gt;I just finished drawing the following:&lt;/p&gt;  &lt;p&gt;&lt;a href="http://ayende.com/Blog/images/ayende_com/Blog/WindowsLiveWriter/Traditionalarchitecturemakesmeflinch_23A7/image_2.png"&gt;&lt;img style="border-bottom: 0px; border-left: 0px; display: inline; border-top: 0px; border-right: 0px" title="image" border="0" alt="image" src="http://ayende.com/Blog/images/ayende_com/Blog/WindowsLiveWriter/Traditionalarchitecturemakesmeflinch_23A7/image_thumb.png" width="626" height="144" /&gt;&lt;/a&gt; &lt;/p&gt;  &lt;p&gt;It makes me feel dirty inside, to do so. Mostly because I really don’t like or believe in building applications in this manner anymore. I would really like to be able to do this:&lt;/p&gt;  &lt;p&gt;&lt;a href="http://ayende.com/Blog/images/ayende_com/Blog/WindowsLiveWriter/Traditionalarchitecturemakesmeflinch_23A7/image_4.png"&gt;&lt;img style="border-bottom: 0px; border-left: 0px; display: inline; border-top: 0px; border-right: 0px" title="image" border="0" alt="image" src="http://ayende.com/Blog/images/ayende_com/Blog/WindowsLiveWriter/Traditionalarchitecturemakesmeflinch_23A7/image_thumb_1.png" width="525" height="573" /&gt;&lt;/a&gt; &lt;/p&gt;  &lt;p&gt;Unfortunately, I am talking about another subject in the context where I am showing the first architectural diagram, and I need to present only a single new concept at a time.&lt;/p&gt;&lt;img src="http://ayende.com/Blog/aggbug/11305.aspx" width="1" height="1" /&gt;</description>
            <dc:creator>Ayende Rahien</dc:creator>
            <guid>http://ayende.com/Blog/archive/2010/02/06/traditional-architecture-makes-me-flinch.aspx</guid>
            <pubDate>Sat, 06 Feb 2010 10:00:00 GMT</pubDate>
            <wfw:comment>http://ayende.com/Blog/comments/11305.aspx</wfw:comment>
            <comments>http://ayende.com/Blog/archive/2010/02/06/traditional-architecture-makes-me-flinch.aspx#feedback</comments>
            <slash:comments>18</slash:comments>
            <wfw:commentRss>http://ayende.com/Blog/comments/commentRss/11305.aspx</wfw:commentRss>
        </item>
    </channel>
</rss>