Raven Xyz: Trying out some ideas
One of the things that we are planning for Raven 3.0 is the introducing of additional options. In addition to having RavenDB, we will also have RavenFS, which is a replicated file system with an eye toward very large files. But that isn’t what I want to talk about today. Today I would like to talk about something that is currently just in my head. I don’t even have a proper name for it yet.
Here is the deal, RavenDB is very good for data that you care about individually. Orders, customers, etc. You track, modify and work with each document independently. If you are writing a lot of data that isn’t really relevant on its own, but only as an aggregate, that is probably not a good use case for RavenDB.
Examples for such things include logs, click streams, event tracking, etc. The trivial example would be any reality show, where you have a lot of users sending messages to vote for a particular candidate, and you don’t really care for the individual data points, only the aggregate. Other things might be to want to track how many items were sold in a particular period based on region, etc.
The API that I had in mind would be something like:
1: foo.Write(new PurchaseMade { Region = "Asia", Product = "products/1", Amount = 23 } );2: foo.Write(new PurchaseMade { Region = "Europe", Product = "products/3", Amount = 3 } );
And then you can write map/reduce statements on them like this:
1: // map2: from purchase in purchases3: select new4: {
5: purchase.Region,
6: purchase.Item,
7: purchase.Amount
8: }
9:
10: // reduce11: from result in results12: group result by new { result.Region, result.Item }13: into g
14: select new15: {
16: g.Key.Region,
17: g.Key.Item,
18: Amount = g.Sum(x=>x.Amount)
19: }
Yes, this looks pretty much like you would have in RavenDB, but there are important distinctions:
- We don’t allow modifying writes, nor deleting them.
- Most of the operations are assumed to be made on the result of the map/reduce statements.
- The assumption is that you don’t really care for each data point.
- There is going to be a lot of those data points, and they are likely to be coming in at a relatively high rate.
Thoughts?
Comments
I think this is an excellent idea.
I don't think i would use "Write" as the verb, that could easily be confused for users (unless this isn't hanging off session anyway).
Maybe HighFrequencyWrite? I don't know i'm struggling for terms here.
Chris, As I said, that isn't something has been decided, it is all pretty nebulous concept right now.
I very much like the idea, and I'd most likey use it right now if it was available (we do a limited amount of this in Raven already)
Not sure if it would be a feature of Raven, or a new product.... either way, tho...
Sounds like what you are describing is a raven event store... (As in the store for event sourcing patterns such as cqrs/es) Which I think would be a great idea... Having a raven event store that could project to a raven db for the domain model/ read side ... Using ravens own publish/subscribe model for consistency sounds really interesting ...
Nice idea.
Rather than "Write" what about "Record" and I think the XYZ you record objects in would be a "Log". Logs are well understood as a fast, write only (i.e. no update) analyse later concept.
What about something along the lines of a materialised view? Every write triggers triggers a function that updates the view?
This is really exciting! something I missed using RavenDB and would use it right away to do analytics. To do these aggregations or queries over large datasets in the past I’ve been importing data into column databases or running Rhino-ETL jobs to aggregate data, very tedious. I could actually see a use for drilling down to see what data points an aggregate is built on.
This is a great idea, RavenES (Event Store? RavenStream?) where you can write and read streams of data related to one Id (ContextId? StreamId? - Log file, Aggregate Root, GPS coordinates, etc.) and aggregate the values in map/reduce. Each item related to an Id has a Revision/Sequence and it is a read-only, forward-only stream you can access. You could also access substreams (lets say log entries for a specific day, or an aggregate root events up to a specific revision) but always in order.
What would be cool is if you could easily do an IEnumerable.Aggregate on a stream and it would run server side (for example, rebuild and Aggregate Root from an event stream), or even better run an aggregation and write the result to RavenDB as a document, something like CreateDocumentFromStream? For logs it would be building a stats document, for GPS location maybe an itinerary, etc.
I think this is a fantastic idea! This use case is exactly why we ended up not using RavenDB in our application. We need to log lots of information quickly, and then perform off-line ad-hoc queries against that data for statistical data regarding production runs.
I like the idea, but inevitably people are going to be curious as to how they got a certain result. This means they'll want to dive into smaller subsections of the overall stream. The smallest subset would obviously be one document / item.
The idea is solid, but the execution will be more important.
Nic, We thinking about making this a separate product.
Piers, Yes, we got a bunch of discussions about this, and I think that might very be what we end up calling this. Raven Log, and the method would be Append, or something like that.
Matt, That is why I had the map/reduce there. Note that I dislike doing things on the write, better to do that in an async manner.
Karhgath, That is pretty much what we had in mind there, yes. The aggregation is meant to be done in the map/reduce.
Dave, I am not sure about ad hoc queries, that is something that is generally expensive :-)
Khalid, You could get to the individual item, sure. But the question is why / what you would do with them
"Ad-Hoc" might be a little too liberal of a term. We have well defined "types" of statistics that we need extracted, but the time-date range is what can shift (i.e. I need a report for last month, last week, last shift, etc)
I guess the better question is what this product will allow you to do?
ex. On Monday we saw that we were up 20% from Tuesday. (Graphs).
You could do this if you implemented it with snapshots, or without. Implementing it without snapshots would mean you would only every know the final result.
Time is the context here, and you can either choose to say all results are in the present or embrace time into the architecture. You could do snapshots for the user, or let the user query and save snapshots into another system (RavenDB?) based on their own approaches: scripting, C# client, Ruby, etc.
This system would be perfect for the MarkedUp team (markedup.com). Maybe you should reach out to them and get their thoughts.
This sounds similar to EventStore. Rob Ashton did a recent blog series on using it. http://codeofrob.com/entries/playing-with-the-eventstore.html
How would this new product differ from EventStore?
I've been dying for something like this. Would happily buy it yesterday.
Did you just invent a bloated rrdtool?
This reminds me of my sensors sample. https://github.com/mj1856/RavenSensors. I'll echo the others by saying that time is of the essence. One thing Raven isn't good at is querying data over an arbitrary time range. You have to predetermine the granularity of the buckets. If you can improve on this in any way, it would be a big deal.
Matt, Arbitrary time ranges are problematic, mostly because they mean that you have to process the entire date range to get something done.
Any guestimate when the product will be available?
I actually like the "write" verb, as the record is being written.
Quinton, This is probably going to be in Raven 3.0
Any ideea when Raven 3.0 will be available ? Even as a beta version?
Alex, RavenDB 3.0 is scheduled for Q1 2014
Comment preview