Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,124 | Comments: 45,475

filter by tags archive

On Hadoop

time to read 2 min | 207 words

Yesterday or the day before that I read the available chapters for Hadoop in Action. Hadoop is a Map Reduce implementation in Java, and it includes some very interesting ideas.

The concept of Map Reduce isn't new, but I liked seeing the actual code examples, which made it so much easier to follow what is actually going on. As usual, an In Action book has a lot of stuff in it that relates to getting things started, and since I don't usually work in Java, they were of little interest to me. But the core ideas are very interesting.

It does seems to be limited to a very small set of scenarios, needing to, in essence, index large sets of data. Some of the examples in the book made sense as theoretical problems, but I think that I am still missing the concrete "order to cash" scenario, seeing how we take a business problem and turn that into a set of technical challenges that can be resolved by utilizing Map Reduce in some part of the problem.

As I said, only the first 4 chapters are currently available, and I was reading the early access version, so it is likely will be fixed when more chapters are coming in.


Comments

Sasha Goldshtein

You might want to take a look at DryadLINQ ( research.microsoft.com/en-us/projects/DryadLINQ/). It is a framework that extends LINQ to the Dryad distributed execution environment. Basically you write LINQ queries (including action queries) and they are automatically distributed to a cluster.

Chris Patterson

Hadoop, to me at least, is more than just a MR implementation.

Hadoop includes a number of useful subsystems, including HDFS (the Hadoop File System). HDFS is a distributed, replicated storage that feeds the splitting/grouping parts of the MR process.

I've been looking at HDFS from a purely low-tech way of long term document storage. Since all of the documents are identified by a key, quick retrieval is easy and the data is replicated across cheap machines. Since I could then build access methods on top using MR to get at the data and filter/query the contents, the infrequent projections of data into some sort of document list/report would be easy to build.

I've been spending more time in Java the past few weeks, and it has been nice to just pull down an OS project and use it instead of constantly thinking "Okay, now this is how they did it in Java, maybe I should port it to .NET"

Mind you, I'm not a convert away from .NET, I just a thriving ecosystem of Java open source projects that are helping us get things done without a lot of pain.

pb
pb

I think most order to cash scenarios don't involve a cluster of processing (though they may be load balanced to some degree) which is why you don't see too many examples like that. The kind of problems google has to solve are very different than most business problems. Unless the business scenario involves huge amounts of data that can't be represented in the normal ways I think you're unlikely to really need all that and the standard stuff will work fine.

Ayende Rahien

pb,

My point was, I want to see the reasons for why you would do that.

Not how you do it, but what you are doing.

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

  1. The design of RavenDB 4.0: Making Lucene reliable - 8 hours from now
  2. RavenDB 3.5 whirl wind tour: I’ll find who is taking my I/O bandwidth and they SHALL pay - about one day from now
  3. The design of RavenDB 4.0: Physically segregating collections - 2 days from now
  4. RavenDB 3.5 Whirlwind tour: I need to be free to explore my data - 3 days from now
  5. RavenDB 3.5 whirl wind tour: I'll have the 3+1 goodies to go, please - 6 days from now

And 13 more posts are pending...

There are posts all the way to May 30, 2016

RECENT SERIES

  1. RavenDB 3.5 whirl wind tour (14):
    02 May 2016 - You want all the data, you can’t handle all the data
  2. The design of RavenDB 4.0 (13):
    28 Apr 2016 - The implications of the blittable format
  3. Tasks for the new comer (2):
    15 Apr 2016 - Quartz.NET with RavenDB
  4. Code through the looking glass (5):
    18 Mar 2016 - And a linear search to rule them
  5. Find the bug (8):
    29 Feb 2016 - When you can't rely on your own identity
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats