Raven Xyz: Trying out some ideas

time to read 8 min | 1426 words

One of the things that we are planning for Raven 3.0 is the introducing of additional options. In addition to having RavenDB, we will also have RavenFS, which is a replicated file system with an eye toward very large files. But that isn’t what I want to talk about today. Today I would like to talk about something that is currently just in my head. I don’t even have a proper name for it yet.

Here is the deal, RavenDB is very good for data that you care about individually. Orders, customers, etc. You track, modify and work with each document independently. If you are writing a lot of data that isn’t really relevant on its own, but only as an aggregate, that is probably not a good use case for RavenDB.

Examples for such things include logs, click streams, event tracking, etc. The trivial example would be any reality show, where you have a lot of users sending messages to vote for a particular candidate, and you don’t really care for the individual data points, only the aggregate. Other things might be to want to track how many items were sold in a particular period based on region, etc.

The API that I had in mind would be something like:

   1: foo.Write(new PurchaseMade { Region = "Asia", Product = "products/1", Amount = 23 } );
   2: foo.Write(new PurchaseMade { Region = "Europe", Product = "products/3", Amount = 3 } );

And then you can write map/reduce statements on them like this:

   1: // map
   2: from purchase in purchases
   3: select new
   4: {
   5:     purchase.Region,
   6:     purchase.Item,
   7:     purchase.Amount
   8: }
  10: // reduce
  11: from result in results
  12: group result by new { result.Region, result.Item }
  13: into g
  14: select new
  15: {
  16:     g.Key.Region,
  17:     g.Key.Item,
  18:     Amount = g.Sum(x=>x.Amount)
  19: }

Yes, this looks pretty much like you would have in RavenDB, but there are important distinctions:

  • We don’t allow modifying writes, nor deleting them.
  • Most of the operations are assumed to be made on the result of the map/reduce statements.
  • The assumption is that you don’t really care for each data point.
  • There is going to be a lot of those data points, and they are likely to be coming in at a relatively high rate.