Designing a document database: Aggregation Recalculating

architecture (616) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (642) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (397) rss
programming (1088) rss
raven (1457) rss
ravendb.net (541) rss
reviews (184) rss

2025
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB - High-Performance NoSQL Document Database

Mar 14 2009

Designing a document databaseAggregation Recalculating

time to read 4 min | 775 words

One of the more interesting problems with document databases is the views, and in particular, how are we going to implement views that contain aggregation. In my previous post, I discussed the way we will probably expose this to the users. But it turn out that there are significant challenges in actually implementing the feature itself, not just in the user visible parts.

For projection views, the actual problem is very simple, when a document is updated/removed, all we have to do is to delete the old view item, and create a new item, if applicable.

For aggregation views, the problem is much harder, mostly because it is not clear what the result of adding, updating or removing a document may be. As a reminder, here is how we plan on exposing aggregation views to the user:

Let us inspect this from the point of view of the document database. Let us say that we have 100,000 documents already, and we introduce this view. A background process is going to kick off, to transform the documents using the view definition.

The process goes like this:

Note that the process depict above is a serial process. This isn’t really useful in the real world. Let us see why. I want to add a new document to the system, how am I going to update the view? Well… an easy option would be this:

I think you can agree with me that this is not a really good thing to do from performance perspective. Luckily for us, there are other alternative. A more accurate representation of the process would be:

We run the map/reduce process in parallel, producing a lot of separate reduced data points. Now we can do the following:

We take the independent reduced results and run a re-reduce process on them again. That is why we have the limitation that map & reduce must return objects in the same shape, so we can use reduce for data that came from map or from reduce, without caring where it came from.

This also means that adding a document is a much easier task, all we need to do is:

We get the single reduced result from the whole process, and now we can generate the final result very easily:

All we have to do is run the reduce on the final result and the new result. The answer from that would be identical to the answer running the full process on all the documents. Things get more interesting, however, when we talk about document update or document removal. Since update is just a special case of atomic document removal and addition, I am going to talk about document removal only, in this case.

Removing a document invalidate the final aggregation results, but it doesn’t necessarily necessitate recalculating the whole thing from scratch. Do you remember the partial reduce results that we mentioned earlier? Those are not only useful for parallelizing the work, they are also very useful in this scenario. Instead of discarding them when we are done with them, we are going to save them as well. They wouldn’t be exposed to the user at any way, but they are persisted. They are going to be useful when we need to recalculate. The fun thing about them is that we don’t really need to recalculate everything. All we have to do is recalculate the batch that the removed document resided on, without that document. When we have the new batch, we can now reduce the whole thing to a final result again.

I am guessing that this is going to be… a challenging task to build, but from design perspective, it looks pretty straightforward.

Tweet Share Share 11 comments

Tags:

Databases

Comments

14 Mar 2009
16:35 PM

Rafal

Okay, map-reduce is very spectacular and appealing, but can you please describe some real-world problem solved using map-reduce on documents? In typical business applications you usually perform operations on single entities and don't aggregate them. Aggregation is usually done when reporting and involves separate report database or OLAP system. I think map-reduce can be used for indexing document data - is it the main reason why you are writing about it?

14 Mar 2009
16:47 PM

Ayende Rahien

Rafal,

Reporting scenarios is a major consideration, certainly.

But it is not just that, there are numerous reasons to want to be able to do aggregation in most systems.

Look at the right side of the blog, you see the category list, and the monthly list? Those are aggregations.

In many scenarios, it is important to be able to do so as efficiently as possible.

Leaving that aside, a good reporting story is pretty important, don't you think?

I have a possible scenario of having to handle lots of small databases, mostly with reports on them.

14 Mar 2009
17:45 PM

Rafal

You're right, RDBMS-based systems usually have problems with data aggregation - that's why we're using separate report databases for larger applications. Aggregations done in a transactional system are too heavy for the database server, also they usually don't cache query results or partial results and perform aggregation each time data is requested. So map-reduce with automatic caching of partial results would help in such cases. Example: task management system where each user and group of users has its own 'inbox' for keeping todo list and each user has its own dashboard with statistics. If you want to calculate statistics for each logged in user based on raw transactional data, you'll probably kill the database server.

14 Mar 2009
20:22 PM

Evgeny Kobzev

"...a challenging task to build, but from design perspective, it looks pretty straightforward." I think implementation is the main problem here. Failure at reducing node during calculation and so on. But the idea looks good, thank you for the post :) We have interesting discussion about it at Friday :)

14 Mar 2009
21:10 PM

Ayende Rahien

Rafal,

That is why I specified that the aggregation is done as part of a background process.

That way, you can still serve requests while still maintaining the perf of the server.

Evgeny,

Yes, the implementation would be challenging, but not complex, just hard.

14 Mar 2009
21:49 PM

configurator

Is this the map-reduce algorithm used by Google?

What data would you return to the user while aggregation is being done?

15 Mar 2009
03:45 AM

josh

For those like me who aren't as familiar with MapReduce:

http://labs.google.com/papers/mapreduce.html
http://en.wikipedia.org/wiki/MapReduce

I only read the google page because it made sense to me after that, but the wiki page looks like it covers a little more detail.

15 Mar 2009
04:28 AM

Ayende Rahien

configurator,

That is a great question, I don't really know.

15 Mar 2009
13:30 PM

Chris Wright

You also have duplication. If you want to be able to read duplicated data for added efficiency rather than just keeping it as backups, you might decide not to record which copy of the data is considered real -- maybe all copies are the real copy.

But when you want to do this kind of map/reduce thing, you need to know whether to include this entry in the results, and duplicates should be excluded.

This means, though, that when a node goes down, you have to discover that fact, and select another node that contains a copy of its non-duplicate data to replace it.

The alternative is to write your queries in such a way that duplicates can be resolved by the client, but that really isn't the client's concern, and it's inefficient.

15 Mar 2009
13:42 PM

Mr_Simple

@Ayende

"Yes, the implementation would be challenging, but not complex, just hard. "

I agree with complex but not hard. I often tell my clients exactly that.

Programming should never be measured in simple or hard. The variable of time is much more useful and as time holds all solutions, programmers simply have time to solve an issue or not.

Length of time determines cost and whether the solution can afford to be found.

15 Mar 2009
14:35 PM

Ayende Rahien

Chris,

Right now, I am not considering yet how to actually get the entire document / view space distributed, it looks much easier to simply replicate things with sharding algorithm.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

Designing a document databaseAggregation Recalculating

More posts in "Designing a document database" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "Designing a document database" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication