Ayende @ Rahien

Jan 07 2025

Aggregating trees with RavenDB

time to read 8 min | 1561 words

Tweet Share Share 2 comments

Tags:

We got an interesting question in the RavenDB discussion group:How to do aggregation on a tree structure?

The task is to build a Work Breakdown Structure, where you have:

Projects

Major deliverables

Sub-deliverables

Work packages

The idea is to be able to track EstimatedHours and CompletedHours across the entire tree. For example, let’s say that I have the following:

Project: Bee Keeper Chronicle App

Major deliverable: App Design

Sub-deliverable: Wireframes all screens

Work Package: Login page wireframe

Users can add the EstimatedHours and CompletedHours at any level, and we want to be able to aggregate the data upward. So the Project: “Bee Keeper Chronicle App” should have a total estimated time and the number of hours that were worked on.

The question is how to model & track that in RavenDB. Here is what I think the document structure should look like:

{
    "Name": "Login page wire frame",
    "Parent": {
        "Type": "Subs",
        "Id": "subs/0000000000000000009-A"
    },
    "EsimatedHours": 8,
    "CompletedHours": 3,
    "@metadata": {
        "@collection": "WorkPackages"
    }
}


{
    "Name": "Wire frames all screens",
    "Parent": {
        "Type": "Majors",
        "Id": "major/0000000000000000008-A"
    },
    "EsimatedHours": 20,
    "CompletedHours": 7,
    "@metadata": {
        "@collection": "Subs"
    }
}


{
    "Name": "App Design",
    "Parent": {
        "Type": "Projects",
        "Id": "projects/0000000000000000011-A"
    },
    "EsimatedHours": 50,
    "CompletedHours": 12,
    "@metadata": {
        "@collection": "Majors"
    }
}


{
    "Name": "Bee Keeper Chronicle App",
    "EsimatedHours": 34,
    "CompletedHours": 21,
    "@metadata": {
        "@collection": "Projects"
    }
}

I used a Parent relationship, since that is the most flexible (it allows you to move each item to a completely different part of the tree easily). Now, we need to aggregate the values for all of those items using a Map-Reduce index.

Because of the similar structure, I created the following JS function:

function processWorkBreakdownHours(doc) {
    let hours = {
        EsimatedHours: doc.EsimatedHours,
        CompletedHours: doc.CompletedHours
    };
    let results = [Object.assign({
        Scope: id(doc)
    }, hours)];


    let current = doc;
    while (current.Parent) {
        current = load(current.Parent.Id, current.Parent.Type);
        results.push(Object.assign({
            Scope: id(current)
        }, hours));
    }
    return results;
}

This will scan over the hierarchy and add the number of estimated and completed hours to all the levels. Now we just need to add this file as Additional Sources to an index and call it for all the relevant collections, like this:

map("WorkPackages",processWorkBreakdownHours);
map("Subs",processWorkBreakdownHours);
map("Majors",processWorkBreakdownHours);
map("Projects",processWorkBreakdownHours);

And the last step is to aggregate across all of them in the reduce function:

groupBy(x => x.Scope).aggregate(g => {
    return {
        Scope: g.key,
        EsimatedHours: g.values.reduce((c, val) => val.EsimatedHours + c, 0),
        CompletedHours: g.values.reduce((c, val) => val.CompletedHours + c, 0)
    };
})

You can see the full index definition here.

The end result is automatic aggregation at all levels. Change one item, and it will propagate upward.

Considerations:

I’m using load() here, which creates a reference from the parent to the child. The idea is that if we move a Work Package from one Sub-deliverable to another (in the same or a different Major & Project), this index will automatically re-index what is required and get you the right result.

However, that also means that whenever the Major document changes, we’ll have to re-index everything below it (because it might have changed the Project). There are two ways to handle that.

Instead of using load(), we’ll use noTracking.load(), which tells RavenDB that when the referenced document changes, we should not re-index.
Provide the relevant scopes at the document level, like this:

{
    "Name": "Login page wire frame",
    "Scope": [
       "subs/0000000000000000009-A",
       "major/0000000000000000008-A",
       "projects/0000000000000000011-A"
    ],
    "EsimatedHours": 8,
    "CompletedHours": 3,
    "@metadata": {
        "@collection": "WorkPackages"
    }
}

Note that in this case, changing the root will be more complex because you have to scan / touch everything if you move between parts of the tree.

In most cases, that is such a rare event that it shouldn’t be a consideration, but it depends largely on your context.

And there you have it, a simple Map-Reduce index that can aggregate across an entire hierarchy with ease.

Performance discoveryManaged vs. Unmanaged memory

time to read 7 min | 1357 words

Tweet Share Share 0 comments

Tags:

When building RavenDB, we occasionally have to deal with some ridiculous numbers in both size and scale. In one of our tests, we ran into an interesting problem. Here are the performance numbers of running a particular query 3 times.

First Run: 19,924 ms

Second Run: 3,181 ms

Third Run: 1,179 ms

Those are not good numbers, so we dug into this to try to figure out what is going on. Here is the query that we are running:

from index 'IntFloatNumbers-Lucene' where Int > 0

And the key here is that this index covers 400 million documents, all of which are actually greater than 0. So this is actually a pretty complex task for the database to handle, mostly because of the internals of how Lucene works.

Remember that we provide both the first page of the results as well as its total number. So we have to go through the entire result set to find out how many items we have. That is a lot of work.

But it turns out that most of the time here isn’t actually processing the query, but dealing with the GC. Here are some entries from the GC log while the queries were running:

2024-12-12T12:39:40.4845987Z, Type: GC, thread id: 30096, duration: 2107.9972ms, index: 25, generation: 2, reason: Induced
2024-12-12T12:39:53.1359744Z, Type: GC, thread id: 30096, duration: 1650.9207ms, index: 26, generation: 2, reason: Induced
2024-12-12T12:40:07.5835527Z, Type: GC, thread id: 30096, duration: 1629.1771ms, index: 27, generation: 2, reason: Induced
2024-12-12T12:40:20.2205602Z, Type: GC, thread id: 30096, duration: 776.24ms, index: 28, generation: 2, reason: Induced

That sound you heard was me going: Ouch!

Remember that this query actually goes through 400M results. Here are the details about its Memory Usage & Object Count:

Number of objects for GC (under LuceneIndexPersistence): 190M (~12.63GB)
Managed Memory: 13.01GB
Unmanaged Memory: 4.53MB

What is going on? It turns out that Lucene handles queries such as Int>0 by creating an array with all the unique values, something similar to:

string[] sortedTerms = new string[190_000_000];
long[] termPostingListOffset = new long[190_000_000];

This isn’t exactly how it works, mind. But the details don’t really matter for this story. The key here is that we have an array with a sorted list of terms, and in this case, we have a lot of terms.

Those values are cached, so they aren’t actually allocated and thrown away each time we query. However, remember that the .NET GC uses a Mark & Sweep algorithm. Here is the core part of the Mark portion of the algorithm:

long _marker;
void Mark()
{
    var currentMarker = ++_marker;


    foreach (var root in GetRoots())
    {
        Mark(root);
    }


    void Mark(object o)
    {
        // already visited
        if (GetMarket(o) == currentMarker)
            return;


        foreach (var child in GetReferences(node))
        {
            Mark(child);
        }
    }
}

Basically, start from the roots (static variables, items on the stack, etc.), scan the reachable object graph, and mark all the objects in use. The code above is generic, of course (and basically pseudo-code), but let’s consider what the performance will be like when dealing with an array of 190M strings.

It has to scan the entire thing, which means it is proportional to the number of objects. And we do have quite a lot of those.

The problem was the number of managed objects, so we pulled all of those out. We moved the term storage to unmanaged memory, outside the purview of the GC. As a result, we now have the following Memory Usage & Object Count:

Number of objects for GC (under LuceneIndexPersistence): 168K (~6.64GB)
Managed Memory: 6.72GB
Unmanaged Memory: 1.32GB

Looking at the GC logs, we now have:

2024-12-16T18:33:29.8143148Z, Type: GC, thread id: 8508, duration: 93.6835ms, index: 319, generation: 2, reason: Induced
2024-12-16T18:33:30.7013255Z, Type: GC, thread id: 8508, duration: 142.1781ms, index: 320, generation: 2, reason: Induced
2024-12-16T18:33:31.5691610Z, Type: GC, thread id: 8508, duration: 91.0983ms, index: 321, generation: 2, reason: Induced
2024-12-16T18:33:37.8245671Z, Type: GC, thread id: 8508, duration: 112.7643ms, index: 322, generation: 2, reason: Induced

So the GC time is now in the range of 100ms, instead of several seconds. This change helps both reduce overall GC pause times and greatly reduce the amount of CPU spent on managing garbage.

Those are still big queries, but now we can focus on executing the query, rather than managing maintenance tasks. Incidentally, those sorts of issues are one of the key reasons why we built Corax, which can process queries directly on top of persistent structures, without needing to materialize anything from the disk.

Oren Eini

Oren Eini

CEO of RavenDB

Aggregating trees with RavenDB

Considerations:

Performance discoveryManaged vs. Unmanaged memory

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed