Differences in Map/Reduce between RavenDB & MongoDB
Ben Foster has a really cool article showing some of the similarities and differences between MongoDB & RavenDB with regards to their map/reduce implementation.
However, there is a very important distinction that was missed. Map/reduce operations are run online in MongoDB, that means that for large collections, map/reduce is going to be very expensive. MongoDB has the option of taking the result of a map/reduce operation and writing it to a collection, so you don’t need to run map/reduce jobs all the time. However, that is a snapshot view of the data, not a live view. Ben mentioned that you can do something called incremental map/reduce, but that isn’t actually really good idea at all.
Let us look at the following sequence of operations:
1: db.items.insert({name: 'oren', ts: 1 });2: db.items.insert({name: 'ayende', ts: 2});3:
4: var map = function Map() { emit(this.name,null); };5: var reduce = function(key, val) { return key; };6:
7: db.items.mapReduce(map,reduce, { out: 'distinct_item_names' });
This creates two items, and give me the distinct names in a separate collection. Now, let us see how that works with updates…
1: db.items.insert({name: 'eini', ts: 3 });2:
3: db.items.mapReduce(map,reduce, { out: {reduce: 'distinct_item_names'}, query: {ts: {$gt: 2} } });
This is actually nice, mongo is able to merge the previous results with the new results, so you only have to do the work on the new data. But this has several implications:
- You have to kick something like ‘ts’ property around to check for new stuff. And you have to _udpate_ that ts property on every update.
- You have to run this on a regular basis yourself, mongo won’t do that for you.
- It can’t work with deletes.
It is the last part that is really painful:
1: db.items.remove({name: 'oren'});
Now, there is just no way for you to construct a map/reduce job that would remove the name when it is gone.
This sort of thing works very nicely when what you want is to just append stuff. That is easy. It is PITA when we are talking about actually using it for live data, that can change and be modified.
Contrast that with the map/reduce implementation in RavenDB:
- No need to manually maintain state, the database does it for you.
- No need to worry about updates & deletes, the database does it for you.
- No need to schedule map/reduce job updates, database does it for you.
- Map/reduce queries are very fast, regardless of data size.
To be frank, the map/reduce implementation in RavenDB is complex, and pretty much all of it comes down to the fact that we don’t do stupid stuff like run a map/reduce operation on a large database on every query, and that we support edge cases scenarios like data that is actually updated or deleted.
Naturally I’m biased, but it seems to me that trying to use map/reduce in Mongo just means that you have to do a lot of hand holding yourself, while with RavenDB, we take care of everything and leaving you to actually do stuff.
Comments
Really surprised Mongo doesn't automatically trigger map/reduce runs on document updates and have to roll all that yourself.
Judah, They can't do that. They don't support updates/deletes.
Actually, it's quite simple if you can 'reverse' the mapping operation (for given key find all documents matching that key): you just delete aggregate record with specified key and run incremental map-reduce on all matching documents. In today's example, you would delete the aggregate with key='oren' and then run map reduce with a query:
db.items.mapReduce(map,reduce, { out: {reduce: 'distinct_item_names'}, query: {name: 'oren' } });
Rafal, Now try to do your suggestion concurrently...
Oh come on, don't change the rules during a game ;) But seriously, i don't think theres one-size-fits-all solution, even RavenDB's smart incremental method has a worst-case scenario. I'd like to know what it is.
It's worth mentioning that I was able to get the MongoDB map-reduce collections updating automatically (insert/update/delete) by monitoring the MongoDB OpLog (you either need to be running in a replica set or just run a single instance with the --replSet switch).
You can create a tailable cursor (since the OpLog is a capped collection - P.S. capped collections would be a nice addition to RavenDB) and listen for new documents in the OpLog which could then be used to re-execute an incremental Map-Reduce. I've only done this in node.js although a few people have had memory issues when trying to attempt using the C# driver.
That said, you still would need to run this in your own process yourself rather than it being it handled by the database natively (where it should be IMO).
For .NET development I've yet to find any compelling reason to move away from RavenDB (other than cost). It would be good to see a comprehensive feature and performance comparison between the two - especially as my new employer currently uses MongoDB and I'd love to sell RavenDB to them.
As I understand it, Couchbase offers incremental map-reduce that automatically supports updates and deletes to underlying documents (though consistency is 'eventual').
http://docs.couchbase.com/couchbase-manual-2.0/#view-operation
SImon, Yes, CouchDB is using a very similar model. I'm not sure what sort of actual implementation behavior they have, but they are better in this regard than MongoDB.
Rafal the worst case scenario of map/reduce for RavenDB is if it is TOO CONSISTENT. Suppose you have high throughput data collection such as sensors from a robot. RavenDB will recompute the map/reduce every single insert forever. It's possible you could over saturate IO and create a scenario where the index can never catch up. This would also have severe impacts on the rest of the server. From what was discussed here with Mongo, Mongo could be a better fit as you could use regular intervals and say only recompute the M/R collection every 4hours.
I would expect you could achieve similar results from Raven using some of its more advanced internals. Likely something along the lines of changing index priority, perhaps setting it Abandoned and then back to normal using a scheduling process.
Chris, Actually, no, that isn't how it works. RavenDB will notice this behavior and switch from a low latency low throughput mode to a high latency high throughput. That means that you'll see batching of operations. And if you only want to do something like only run map/reduce every 4 hours, you can absolutely do that. All you need it to disable indexing on that index. At a later point, you enable it again, and it will catch up.
There is very little point in doing that, though
Comment preview