Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 405 words

Before we move on to discussing how to implement package search, I wanted to take a bit of time to discuss the we structured the data. In particular, there are a bunch of properties that feel very relational in nature. In particular, these two properties:

  • Tags: Ian_Mercer Natural_Language Abodit NLP
  • Dependencies: AboditUnits:1.0.4|Autofac.Mef:2.5.2.830|ImpromptuInterface:5.6.2|log4net:1.2.11

In the current version of NuGet, those properties are actually stored as symbol separated strings. The reason for that? In relational databases, if you want to have a collection, you have to have another table, then join to it, then take care of it, and wake up in the middle of the night to take it to a walk. So people go the obvious route and just concatenate strings and hope for the best. Note that in the dependencies case, we have multi level concatenation.

In RavenDB, we have full fledged support for storing complex objects, so the tags above will become:

image

And what about the dependencies? Those we store in an array of complex objects, like so:

image

RavenDB allows us to store the model in a way that is easy on the eye ,natural to work with and in general making our lives easier.

Let us say that I wanted to add a feature to NuGet, “show me all the packages that use this package”?

image

And allow me to brag a little bit?

image

By the way, just to be sure that everyone has full grasp about what is going on, I am writing this post while on 30,000 feet. The laptop I am using is NOT connected to power, and the data set that I am using is the full NuGet dataset.

Compare the results you get from RavenDB to what you have to do in SQL: Dependencies LIKE ‘%log4net%’

You can kiss your performance goodbye with these sort of queries.

time to read 8 min | 1455 words

The first thing that we will do with RavenDB and the NuGet data is to issue the same logical query as the one used to populate the packages page. As a reminder, here is how it looks:

SELECT        TOP (30) 
          -- ton of fields removed for brevity
FROM        (

            SELECT        Filtered.Id
                    ,    Filtered.PackageRegistrationKey
                    ,    Filtered.Version
                    ,    Filtered.DownloadCount
                    ,    row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number]
            FROM        (
                        SELECT        PackageRegistrations.Id
                                ,    Packages.PackageRegistrationKey
                                ,    Packages.Version
                                ,    PackageRegistrations.DownloadCount
                        FROM        Packages
                        INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey
                        WHERE        Packages.IsPrerelease <> cast(1 as bit)
                        ) Filtered
            ) Paged
INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey
INNER JOIN    Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version
WHERE        Paged.[row_number] > 30
ORDER BY    PackageRegistrations.DownloadCount DESC
        ,    Paged.Id

Despite the apparent complexity ,this is a really trivial query. What is does is say:

  • Give me the first 30 – 60 rows
  • Where IsPrerelease is false
  • Order by the download count and then the id

With Linq, the client side query looks something like this:

var results = Session.Query<Package>()
                         .Where(x=>x.IsPrerelease == false)
                         .OrderBy(x=>x.DownloadCount).ThenBy(x=>x.Id)
                         .Skip(30)
                         .Take(30)
                         .ToList();

Now, I assume that this is what the NuGet code is also doing, it is just that the relational database has made it so they have to go to the data in a really convoluted way.

With RavenDB, to match the same query, I could just issue the following query, but there are subtle differences between how the query works in SQL and how it works in RavenDB. in particular, the data that we have in RavenDB is the output of this query, but it isn’t the raw output. For example, we don’t have the Id column available, which is used for sorting. Now, I think that the logic is meaning to say, “sort by download count descending and then by age ascending”. So old and popular packages are more visible than new and fresh packages.

In order to match the same behavior (and because we will need it to the next post) we will define the following index in RavenDB:

image

And querying it:

image

The really nice thing about this?

image

This is the URL for this search:

/indexes/Packages/Listing?query=IsPrerelease:false&start=0&pageSize=128&aggregation=None&sort=-DownloadCount&sort=Created

This is something that RavenDB can do in its sleep, because it is a very cheap operation. Consider the query plan that would for the SQL query above. You have to join 5 times just to get to the data that you want, paging is a real mess, and the database actually have to work a lot to answer this fiddling little query.

Just to give you some idea here. We are talking about something that conceptually should be the same as:

select top 30 skip 30 * from Data where IsPrerelease = 0

But it get really complex really fast with the joins and the tables and all the rest.

In comparison, in RavenDB, we actually do have just a property match to do. Because we keep the entire object graph in a single location, we can do very efficient searches on it.

In the next post, I’ll discuss the actual way I modeled the data, and then we get to do exciting searches Smile.

time to read 7 min | 1397 words

The first part of actually showing how RavenDB can handle the NuGet scenario was to actually get the data to RavenDB. Luckily, NuGet makes the data accessible using OData, so I quickly hacked up the following program:

using (var store = new DocumentStore
    {
        Url = "http://localhost:8080",
        DefaultDatabase = "Nuget"
    }.Initialize())
{
    string url = "https://nuget.org/api/v2/Packages";
    while (true)
    {
        Console.WriteLine("GET {0}", url);
        if (url == null)
            break;
        var webRequest = (HttpWebRequest)WebRequest.Create(url);
        webRequest.Accept = "application/json";
        using (var resp = webRequest.GetResponse())
        using (var strema = resp.GetResponseStream())
        {
            url = WritePackagesToRaven(strema, store);
        }
    }
}

This is imply going to NuGet and asking for the packages in json format. It is very easy for us to work with json data with RavenDB, so that is what we are doing.

The next stage is to actually read the response and write the packages to RavenDB, this is handled here:

private static string WritePackagesToRaven(Stream strema, IDocumentStore store)
{
    var json = RavenJToken.ReadFrom(new JsonTextReader(new StreamReader(strema)))
        .Value<RavenJObject>("d");


    using (var session = store.OpenSession())
    {
        foreach (RavenJObject result in json.Value<RavenJArray>("results"))
        {
            ModifyResult(result);
            session.Advanced.Defer(new PutCommandData
                {
                    Document = result,
                    Metadata = new RavenJObject
                        {
                            {"Raven-Entity-Name", "Packages"}
                        },
                    Key = "packages/" + result.Value<string>("Id") + "/" + result.Value<string>("Version")
                });
        }
        session.SaveChanges();
    }
    return json.Value<string>("__next");
}

I am not really sure why we have this “d” as the beginning of the json results, but that is what NuGet returns. We iterate over the query results, and write all of them to RavenDB.

You might note that we use the Defer() option, which means that we can rely on the session to handle batching for us and only go to the server once, when we call SaveChanges(). We also set the document metadata to be pretty basic, merely indicating that this should go on the Packages collection. Finally, we set the id to be composed of the package id and the version, resulting in a unique and human readable key for the imported package.

Note that we return the next page location, and continue on working on that in the next page loop.

There is one thing that we need to do, the NuGet data is still highly relational, and quite ugly at times. For example, let us take Tags and Dependencies. Here is how they show up in the raw results:

  • Dependencies: AboditUnits:1.0.4|Autofac.Mef:2.5.2.830|ImpromptuInterface:5.6.2|log4net:1.2.11
  • Tags: Ian_Mercer Natural_Language Abodit NLP

That isn’t a really nice way to work with the data, so before we save the results to RavenDB, we modify it slightly.

private static void ModifyResult(RavenJObject result)
{
    var tags = result.Value<string>("Tags");
    if (tags != null)
    {
        result["Tags"] =
            new RavenJArray(tags.Split(new[] {' ', ',', ';'}, StringSplitOptions.RemoveEmptyEntries));
    }
    else
    {
        result["Tags"] = new RavenJArray();
    }
    var deps = result.Value<string>("Dependencies");
    if (deps != null)
    {
        result["Dependencies"] =
            new RavenJArray(deps.Split(new[] {'|'}, StringSplitOptions.RemoveEmptyEntries)
                                .Select(s =>
                                    {
                                        var strings = s.Split(':');
                                        return RavenJObject.FromObject(new {Package = strings[0], Version = strings[1]});
                                    }));
    }
    result.Remove("__metadata");
}

Finally, let us take a peek at RavenDB and see how the results look like there:

image

Now that is much more like it.

On my next post, I am going to show how to do some queries againt this data, which currently have about 66,483 results.

time to read 14 min | 2742 words

It appears that NuGet has some perf problems recently and Jeff Handley posted the problematic queries as well as the new, hand optimized queries.

You can see the original problematic queries and the optimized code (still tentative) here.

Hand optimized query to load the pages for the packages page:

SELECT        TOP (30)
            Paged.PackageRegistrationKey
        ,    Paged.Id
        ,    Paged.Version
        ,    Packages.FlattenedAuthors
        ,    Packages.Copyright
        ,    Packages.Created
        ,    Packages.FlattenedDependencies
        ,    Packages.Description
        ,    PackageRegistrations.DownloadCount
        ,    Packages.ExternalPackageUrl
        ,    N'packages/' + PackageRegistrations.Id + N'/' + Packages.Version AS C1
        ,    Packages.IconUrl
        ,    Packages.IsLatestStable
        ,    Packages.Language
        ,    Packages.LastUpdated
        ,    Packages.LicenseUrl
        ,    Packages.Hash
        ,    Packages.HashAlgorithm
        ,    Packages.PackageFileSize
        ,    Packages.ProjectUrl
        ,    CASE Packages.Listed WHEN 1 THEN Packages.Published ELSE NULL END AS C2
        ,    Packages.ReleaseNotes
        ,    N'package/ReportAbuse/' + PackageRegistrations.Id + N'/' + Packages.Version AS C3
        ,    Packages.RequiresLicenseAcceptance
        ,    Packages.Summary
        ,    CASE WHEN Packages.Tags IS NULL THEN CAST(NULL as varchar(1)) ELSE N' ' + LTRIM(RTRIM(Packages.Tags)) + N' ' END AS C4
        ,    ISNULL(Packages.Title, PackageRegistrations.Id) AS C5
        ,    Packages.DownloadCount AS DownloadCount1
        ,    cast(0 as float(53)) AS C6
FROM        (

            SELECT        Filtered.Id
                    ,    Filtered.PackageRegistrationKey
                    ,    Filtered.Version
                    ,    Filtered.DownloadCount
                    ,    row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number]
            FROM        (
                        SELECT        PackageRegistrations.Id
                                ,    Packages.PackageRegistrationKey
                                ,    Packages.Version
                                ,    PackageRegistrations.DownloadCount
                        FROM        Packages
                        INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey
                        WHERE        Packages.IsPrerelease <> cast(1 as bit)
                        ) Filtered
            ) Paged
INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey
INNER JOIN    Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version
WHERE        Paged.[row_number] > 30
ORDER BY    PackageRegistrations.DownloadCount DESC
        ,    Paged.Id
 

This monster query is actually translated to something like:

Give me the top 30 packages which are not pre released, ordered by the download count and then by their id.

It takes a great deal of complexity to deal with that for one major reason, the data is split up across multiple tables in a way that make it hard get all of it easily. The minor reason is that there is really no good way to do paging in SQL Server (shocking, I know). One would assume that such a basic feature would have a bit more attention.

What is worse is the optimized version of the search feature:

SELECT        TOP (30)
            Paged.PackageRegistrationKey
        ,    Paged.Id
        ,    Paged.Version
        ,    Packages.FlattenedAuthors
        ,    Packages.Copyright
        ,    Packages.Created
        ,    Packages.FlattenedDependencies
        ,    Packages.Description
        ,    PackageRegistrations.DownloadCount
        ,    Packages.ExternalPackageUrl
        ,    N'packages/' + PackageRegistrations.Id + N'/' + Packages.Version AS C1
        ,    Packages.IconUrl
        ,    Packages.IsLatestStable
        ,    Packages.Language
        ,    Packages.LastUpdated
        ,    Packages.LicenseUrl
        ,    Packages.Hash
        ,    Packages.HashAlgorithm
        ,    Packages.PackageFileSize
        ,    Packages.ProjectUrl
        ,    CASE Packages.Listed WHEN 1 THEN Packages.Published ELSE NULL END AS C2
        ,    Packages.ReleaseNotes
        ,    N'package/ReportAbuse/' + PackageRegistrations.Id + N'/' + Packages.Version AS C3
        ,    Packages.RequiresLicenseAcceptance
        ,    Packages.Summary
        ,    CASE WHEN Packages.Tags IS NULL THEN CAST(NULL as varchar(1)) ELSE N' ' + LTRIM(RTRIM(Packages.Tags)) + N' ' END AS C4
        ,    ISNULL(Packages.Title, PackageRegistrations.Id) AS C5
        ,    Packages.DownloadCount AS DownloadCount1
        ,    cast(0 as float(53)) AS C6
FROM        (

            SELECT        Filtered.Id
                    ,    Filtered.PackageRegistrationKey
                    ,    Filtered.Version
                    ,    Filtered.DownloadCount
                    ,    row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number]
            FROM        (
                        SELECT        PackageRegistrations.Id
                                ,    Packages.PackageRegistrationKey
                                ,    Packages.Version
                                ,    PackageRegistrations.DownloadCount
                        FROM        Packages
                        INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey
                        WHERE        ((((Packages.IsPrerelease <> cast(1 as bit)))))
                                ((((AND    Packages.IsLatestStable = 1))))
                                ((((AND    Packages.IsLatest = 1))))
                                AND    (
                                        PackageRegistrations.Id LIKE '%jquery%' ESCAPE N'~'
                                    OR    PackageRegistrations.Id LIKE '%ui%' ESCAPE N'~'

                                    OR    Packages.Title LIKE '%jquery%' ESCAPE N'~'
                                    OR    Packages.Title LIKE '%ui%' ESCAPE N'~'

                                    OR    Packages.Tags LIKE '%jquery%' ESCAPE N'~'
                                    OR    Packages.Tags LIKE '%ui%' ESCAPE N'~'
                                    )
                        ) Filtered
            ) Paged
INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey
INNER JOIN    Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version
WHERE        Paged.[row_number] > 30
ORDER BY    PackageRegistrations.DownloadCount DESC
        ,    Paged.Id

One thing that immediately popped up to me was the use of queries such as "’%jquery%’. This is a flat out killer for performance in relational databases, since they cannot do any indexes on this and are forced to do a table scan.

I decided to take a stub at moving the NuGet data to RavenDB, which is  a much better fit (in my own obviously utterly unbiased opinion). In the next post, we will start with the actual import process, then we get to actual queries.

time to read 1 min | 137 words

Well, it seems that we have reached this stage of maturity that RavenDB is starting to develop it own eco system and 3rd party toolsets. In particular, I am very excited to preset GahbonMining, a Data Mining Server for RavenDB.

What does this means, Data Mining for RavenDB?

Well, it basically means that your app can actually learn from what is going on in the database. For example, if we are talking about StackOverflow, you can train the server on the existing data, and when a user submit a new question, you can use GahbonMining to suggest the appropriate tags for the question. Or suggest similar questions that have already been asked and answered.

GahbonMining is just out in alpha, but it is already showing quite a bit of promise, do take a look, you won’t regret it.

time to read 1 min | 81 words

I am going to be in the Professional .NET 2012 conference in Vienna, Austria next month, and I think we can plan for a special surprise.

Sep 10 is going to be the close off date for new features the next release of RavenDB (feature freeze), and in my talk, I am going to take you through a tour of not only what is RavenDB, but what are all the goodies that you can expect from the next version.

See your there…

time to read 3 min | 552 words

The following represent a stream of thoughts while I was refactoring the replication behavior for RavenDB. It may not be very meaningful, since I am mostly using this to sort things out in my head.

Previously in RavenDB, we created what are known as tombstone documents in RavenDB to represent deleted documents. That caused problem under some specialized conditions (if you had a lot of deletes, you had a lot of tombstone documents to go through). So we changed the way we store tombstone documents to use a separate list that isn’t part of the document storage.

That means that now we need to update the part that accepts a replication request and have it understand what is going on. This is a bit complex because, ignoring concurrency, we have the following states:

  • Item does not exist in the current database.
    • Item was deleted (mark it as deleted? leave it alone?)
    • Item was created/updated (persist it)
  • Item was deleted locally.
    • Item was deleted (should we generate a conflict if the history is not similar?)
    • Item was created/updated (should generate a conflict if the history isn’t consistent).
  • Item exists locally:
    • Item was deleted (need to check for history for conflicts).
    • Item was created/updated (need to check for history for conflicts).
  • Item was modified locally.
    • Item was deleted (need to check for history for conflicts).
    • Item was created/updated (need to check for history for conflicts).
  • Item is conflicted:
    • Item was deleted (generate conflict).
    • Item was created/updated (generate conflict).

We also need to make sure that we have the right behavior when you load a conflicted document, which make for interesting behavior when you have conflicts with deleted documents.

The first thing to do was to actually simply things, instead of having all of those options, I decided to branch early, so we would have one branch for when we replicate a delete and one branch when we replicate a create/update.

  • Replicating delete
    • Item does not exists – ignore
    • Item is conflicted – add to the conflicts (save as standard tombstone document, not as a separate item. That makes it possible to handle the client side conflict resolution easily).
    • Item was deleted locally – merge histories and do not generate conflict even if there is difference.
    • Item wasn’t modified locally – delete it
    • Item was modified locally – create conflict

There were actually to major decisions that made it drastically simpler. The first was to have the delete yes/no branching early. The second is more subtle, but probably even more important. The reason for this change was to remove the cost of storing the deleted tombstones as documents. The problem is what to do when we are actually getting a conflict. Previously, we would save the conflict into the document store, and that would be accessible using the standard RavenDB tooling. But how do you store a conflict between a deleted item and a locally modified item?

I decided that for that specific scenario, we are going to continue storing them as documents. That means that externally, nothing will change. This drastically reduced the complexity in the codebase and the problem set that I had to resolve.

Everything now works, and we are good to go.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}