Ayende @ Rahien

It's a girl

Raven Xyz: Trying out some ideas

One of the things that we are planning for Raven 3.0 is the introducing of additional options. In addition to having RavenDB, we will also have RavenFS, which is a replicated file system with an eye toward very large files. But that isn’t what I want to talk about today. Today I would like to talk about something that is currently just in my head. I don’t even have a proper name for it yet.

Here is the deal, RavenDB is very good for data that you care about individually. Orders, customers, etc. You track, modify and work with each document independently. If you are writing a lot of data that isn’t really relevant on its own, but only as an aggregate, that is probably not a good use case for RavenDB.

Examples for such things include logs, click streams, event tracking, etc. The trivial example would be any reality show, where you have a lot of users sending messages to vote for a particular candidate, and you don’t really care for the individual data points, only the aggregate. Other things might be to want to track how many items were sold in a particular period based on region, etc.

The API that I had in mind would be something like:

   1: foo.Write(new PurchaseMade { Region = "Asia", Product = "products/1", Amount = 23 } );
   2: foo.Write(new PurchaseMade { Region = "Europe", Product = "products/3", Amount = 3 } );

And then you can write map/reduce statements on them like this:

   1: // map
   2: from purchase in purchases
   3: select new
   4: {
   5:     purchase.Region,
   6:     purchase.Item,
   7:     purchase.Amount
   8: }
   9:  
  10: // reduce
  11: from result in results
  12: group result by new { result.Region, result.Item }
  13: into g
  14: select new
  15: {
  16:     g.Key.Region,
  17:     g.Key.Item,
  18:     Amount = g.Sum(x=>x.Amount)
  19: }

Yes, this looks pretty much like you would have in RavenDB, but there are important distinctions:

  • We don’t allow modifying writes, nor deleting them.
  • Most of the operations are assumed to be made on the result of the map/reduce statements.
  • The assumption is that you don’t really care for each data point.
  • There is going to be a lot of those data points, and they are likely to be coming in at a relatively high rate.

Thoughts?

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Chris Marisic
05/23/2013 01:43 PM by
Chris Marisic

I think this is an excellent idea.

I don't think i would use "Write" as the verb, that could easily be confused for users (unless this isn't hanging off session anyway).

Maybe HighFrequencyWrite? I don't know i'm struggling for terms here.

Ayende Rahien
05/23/2013 01:45 PM by
Ayende Rahien

Chris, As I said, that isn't something has been decided, it is all pretty nebulous concept right now.

Nic Wise
05/24/2013 09:34 AM by
Nic Wise

I very much like the idea, and I'd most likey use it right now if it was available (we do a limited amount of this in Raven already)

Not sure if it would be a feature of Raven, or a new product.... either way, tho...

Graeme Christie
05/24/2013 09:42 AM by
Graeme Christie

Sounds like what you are describing is a raven event store... (As in the store for event sourcing patterns such as cqrs/es) Which I think would be a great idea... Having a raven event store that could project to a raven db for the domain model/ read side ... Using ravens own publish/subscribe model for consistency sounds really interesting ...

Piers Lawson
05/24/2013 12:03 PM by
Piers Lawson

Nice idea.

Rather than "Write" what about "Record" and I think the XYZ you record objects in would be a "Log". Logs are well understood as a fast, write only (i.e. no update) analyse later concept.

Matt
05/24/2013 12:33 PM by
Matt

What about something along the lines of a materialised view? Every write triggers triggers a function that updates the view?

Jonas
05/24/2013 12:55 PM by
Jonas

This is really exciting! something I missed using RavenDB and would use it right away to do analytics. To do these aggregations or queries over large datasets in the past I’ve been importing data into column databases or running Rhino-ETL jobs to aggregate data, very tedious. I could actually see a use for drilling down to see what data points an aggregate is built on.

Karhgath
05/24/2013 12:58 PM by
Karhgath

This is a great idea, RavenES (Event Store? RavenStream?) where you can write and read streams of data related to one Id (ContextId? StreamId? - Log file, Aggregate Root, GPS coordinates, etc.) and aggregate the values in map/reduce. Each item related to an Id has a Revision/Sequence and it is a read-only, forward-only stream you can access. You could also access substreams (lets say log entries for a specific day, or an aggregate root events up to a specific revision) but always in order.

What would be cool is if you could easily do an IEnumerable.Aggregate on a stream and it would run server side (for example, rebuild and Aggregate Root from an event stream), or even better run an aggregation and write the result to RavenDB as a document, something like CreateDocumentFromStream? For logs it would be building a stats document, for GPS location maybe an itinerary, etc.

DaveNay
05/24/2013 01:09 PM by
DaveNay

I think this is a fantastic idea! This use case is exactly why we ended up not using RavenDB in our application. We need to log lots of information quickly, and then perform off-line ad-hoc queries against that data for statistical data regarding production runs.

Khalid Abuhakmeh
05/24/2013 02:13 PM by
Khalid Abuhakmeh

I like the idea, but inevitably people are going to be curious as to how they got a certain result. This means they'll want to dive into smaller subsections of the overall stream. The smallest subset would obviously be one document / item.

The idea is solid, but the execution will be more important.

Ayende Rahien
05/24/2013 02:36 PM by
Ayende Rahien

Nic, We thinking about making this a separate product.

Ayende Rahien
05/24/2013 02:37 PM by
Ayende Rahien

Piers, Yes, we got a bunch of discussions about this, and I think that might very be what we end up calling this. Raven Log, and the method would be Append, or something like that.

Ayende Rahien
05/24/2013 02:38 PM by
Ayende Rahien

Matt, That is why I had the map/reduce there. Note that I dislike doing things on the write, better to do that in an async manner.

Ayende Rahien
05/24/2013 02:40 PM by
Ayende Rahien

Karhgath, That is pretty much what we had in mind there, yes. The aggregation is meant to be done in the map/reduce.

Ayende Rahien
05/24/2013 02:40 PM by
Ayende Rahien

Dave, I am not sure about ad hoc queries, that is something that is generally expensive :-)

Ayende Rahien
05/24/2013 02:41 PM by
Ayende Rahien

Khalid, You could get to the individual item, sure. But the question is why / what you would do with them

DaveNay
05/24/2013 02:45 PM by
DaveNay

"Ad-Hoc" might be a little too liberal of a term. We have well defined "types" of statistics that we need extracted, but the time-date range is what can shift (i.e. I need a report for last month, last week, last shift, etc)

Khalid Abuhakmeh
05/24/2013 02:56 PM by
Khalid Abuhakmeh

I guess the better question is what this product will allow you to do?

  1. will it let you see an evolution to the final result? You could do this if you had another mechanism for snapshots based on a frequency set in the map/reduce definition. This gives the developer the ability to set up some form of historical context to their data.

ex. On Monday we saw that we were up 20% from Tuesday. (Graphs).

  1. Will it let you see only the final result?

You could do this if you implemented it with snapshots, or without. Implementing it without snapshots would mean you would only every know the final result.

Time is the context here, and you can either choose to say all results are in the present or embrace time into the architecture. You could do snapshots for the user, or let the user query and save snapshots into another system (RavenDB?) based on their own approaches: scripting, C# client, Ruby, etc.

This system would be perfect for the MarkedUp team (markedup.com). Maybe you should reach out to them and get their thoughts.

Jason
05/24/2013 04:15 PM by
Jason

This sounds similar to EventStore. Rob Ashton did a recent blog series on using it. http://codeofrob.com/entries/playing-with-the-eventstore.html

How would this new product differ from EventStore?

Luke
05/25/2013 02:03 AM by
Luke

I've been dying for something like this. Would happily buy it yesterday.

Rafal
05/25/2013 07:21 PM by
Rafal

Did you just invent a bloated rrdtool?

Matt Johnson
05/26/2013 05:45 AM by
Matt Johnson

This reminds me of my sensors sample. https://github.com/mj1856/RavenSensors. I'll echo the others by saying that time is of the essence. One thing Raven isn't good at is querying data over an arbitrary time range. You have to predetermine the granularity of the buckets. If you can improve on this in any way, it would be a big deal.

Ayende Rahien
05/26/2013 08:24 AM by
Ayende Rahien

Matt, Arbitrary time ranges are problematic, mostly because they mean that you have to process the entire date range to get something done.

Quinton
05/27/2013 05:41 AM by
Quinton

Any guestimate when the product will be available?

I actually like the "write" verb, as the record is being written.

Ayende Rahien
05/27/2013 11:00 AM by
Ayende Rahien

Quinton, This is probably going to be in Raven 3.0

Alex
06/20/2013 01:43 PM by
Alex

Any ideea when Raven 3.0 will be available ? Even as a beta version?

Ayende Rahien
06/20/2013 05:03 PM by
Ayende Rahien

Alex, RavenDB 3.0 is scheduled for Q1 2014

Comments have been closed on this topic.