Ayende @ Rahien

It's a girl

NuGet Perf, Part IV–Modeling the packages

Before we move on to discussing how to implement package search, I wanted to take a bit of time to discuss the we structured the data. In particular, there are a bunch of properties that feel very relational in nature. In particular, these two properties:

  • Tags: Ian_Mercer Natural_Language Abodit NLP
  • Dependencies: AboditUnits:1.0.4|Autofac.Mef:2.5.2.830|ImpromptuInterface:5.6.2|log4net:1.2.11

In the current version of NuGet, those properties are actually stored as symbol separated strings. The reason for that? In relational databases, if you want to have a collection, you have to have another table, then join to it, then take care of it, and wake up in the middle of the night to take it to a walk. So people go the obvious route and just concatenate strings and hope for the best. Note that in the dependencies case, we have multi level concatenation.

In RavenDB, we have full fledged support for storing complex objects, so the tags above will become:

image

And what about the dependencies? Those we store in an array of complex objects, like so:

image

RavenDB allows us to store the model in a way that is easy on the eye ,natural to work with and in general making our lives easier.

Let us say that I wanted to add a feature to NuGet, “show me all the packages that use this package”?

image

And allow me to brag a little bit?

image

By the way, just to be sure that everyone has full grasp about what is going on, I am writing this post while on 30,000 feet. The laptop I am using is NOT connected to power, and the data set that I am using is the full NuGet dataset.

Compare the results you get from RavenDB to what you have to do in SQL: Dependencies LIKE ‘%log4net%’

You can kiss your performance goodbye with these sort of queries.

NuGet Perf, Part III–Displaying the Packages page

The first thing that we will do with RavenDB and the NuGet data is to issue the same logical query as the one used to populate the packages page. As a reminder, here is how it looks:

SELECT        TOP (30) 
          -- ton of fields removed for brevity
FROM        (

            SELECT        Filtered.Id
                    ,    Filtered.PackageRegistrationKey
                    ,    Filtered.Version
                    ,    Filtered.DownloadCount
                    ,    row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number]
            FROM        (
                        SELECT        PackageRegistrations.Id
                                ,    Packages.PackageRegistrationKey
                                ,    Packages.Version
                                ,    PackageRegistrations.DownloadCount
                        FROM        Packages
                        INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey
                        WHERE        Packages.IsPrerelease <> cast(1 as bit)
                        ) Filtered
            ) Paged
INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey
INNER JOIN    Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version
WHERE        Paged.[row_number] > 30
ORDER BY    PackageRegistrations.DownloadCount DESC
        ,    Paged.Id

Despite the apparent complexity ,this is a really trivial query. What is does is say:

  • Give me the first 30 – 60 rows
  • Where IsPrerelease is false
  • Order by the download count and then the id

With Linq, the client side query looks something like this:

var results = Session.Query<Package>()
                         .Where(x=>x.IsPrerelease == false)
                         .OrderBy(x=>x.DownloadCount).ThenBy(x=>x.Id)
                         .Skip(30)
                         .Take(30)
                         .ToList();

Now, I assume that this is what the NuGet code is also doing, it is just that the relational database has made it so they have to go to the data in a really convoluted way.

With RavenDB, to match the same query, I could just issue the following query, but there are subtle differences between how the query works in SQL and how it works in RavenDB. in particular, the data that we have in RavenDB is the output of this query, but it isn’t the raw output. For example, we don’t have the Id column available, which is used for sorting. Now, I think that the logic is meaning to say, “sort by download count descending and then by age ascending”. So old and popular packages are more visible than new and fresh packages.

In order to match the same behavior (and because we will need it to the next post) we will define the following index in RavenDB:

image

And querying it:

image

The really nice thing about this?

image

This is the URL for this search:

/indexes/Packages/Listing?query=IsPrerelease:false&start=0&pageSize=128&aggregation=None&sort=-DownloadCount&sort=Created

This is something that RavenDB can do in its sleep, because it is a very cheap operation. Consider the query plan that would for the SQL query above. You have to join 5 times just to get to the data that you want, paging is a real mess, and the database actually have to work a lot to answer this fiddling little query.

Just to give you some idea here. We are talking about something that conceptually should be the same as:

select top 30 skip 30 * from Data where IsPrerelease = 0

But it get really complex really fast with the joins and the tables and all the rest.

In comparison, in RavenDB, we actually do have just a property match to do. Because we keep the entire object graph in a single location, we can do very efficient searches on it.

In the next post, I’ll discuss the actual way I modeled the data, and then we get to do exciting searches Smile.

Nuget Perf Problem, Part II–Importing To RavenDB

The first part of actually showing how RavenDB can handle the NuGet scenario was to actually get the data to RavenDB. Luckily, NuGet makes the data accessible using OData, so I quickly hacked up the following program:

using (var store = new DocumentStore
    {
        Url = "http://localhost:8080",
        DefaultDatabase = "Nuget"
    }.Initialize())
{
    string url = "https://nuget.org/api/v2/Packages";
    while (true)
    {
        Console.WriteLine("GET {0}", url);
        if (url == null)
            break;
        var webRequest = (HttpWebRequest)WebRequest.Create(url);
        webRequest.Accept = "application/json";
        using (var resp = webRequest.GetResponse())
        using (var strema = resp.GetResponseStream())
        {
            url = WritePackagesToRaven(strema, store);
        }
    }
}

This is imply going to NuGet and asking for the packages in json format. It is very easy for us to work with json data with RavenDB, so that is what we are doing.

The next stage is to actually read the response and write the packages to RavenDB, this is handled here:

private static string WritePackagesToRaven(Stream strema, IDocumentStore store)
{
    var json = RavenJToken.ReadFrom(new JsonTextReader(new StreamReader(strema)))
        .Value<RavenJObject>("d");


    using (var session = store.OpenSession())
    {
        foreach (RavenJObject result in json.Value<RavenJArray>("results"))
        {
            ModifyResult(result);
            session.Advanced.Defer(new PutCommandData
                {
                    Document = result,
                    Metadata = new RavenJObject
                        {
                            {"Raven-Entity-Name", "Packages"}
                        },
                    Key = "packages/" + result.Value<string>("Id") + "/" + result.Value<string>("Version")
                });
        }
        session.SaveChanges();
    }
    return json.Value<string>("__next");
}

I am not really sure why we have this “d” as the beginning of the json results, but that is what NuGet returns. We iterate over the query results, and write all of them to RavenDB.

You might note that we use the Defer() option, which means that we can rely on the session to handle batching for us and only go to the server once, when we call SaveChanges(). We also set the document metadata to be pretty basic, merely indicating that this should go on the Packages collection. Finally, we set the id to be composed of the package id and the version, resulting in a unique and human readable key for the imported package.

Note that we return the next page location, and continue on working on that in the next page loop.

There is one thing that we need to do, the NuGet data is still highly relational, and quite ugly at times. For example, let us take Tags and Dependencies. Here is how they show up in the raw results:

  • Dependencies: AboditUnits:1.0.4|Autofac.Mef:2.5.2.830|ImpromptuInterface:5.6.2|log4net:1.2.11
  • Tags: Ian_Mercer Natural_Language Abodit NLP

That isn’t a really nice way to work with the data, so before we save the results to RavenDB, we modify it slightly.

private static void ModifyResult(RavenJObject result)
{
    var tags = result.Value<string>("Tags");
    if (tags != null)
    {
        result["Tags"] =
            new RavenJArray(tags.Split(new[] {' ', ',', ';'}, StringSplitOptions.RemoveEmptyEntries));
    }
    else
    {
        result["Tags"] = new RavenJArray();
    }
    var deps = result.Value<string>("Dependencies");
    if (deps != null)
    {
        result["Dependencies"] =
            new RavenJArray(deps.Split(new[] {'|'}, StringSplitOptions.RemoveEmptyEntries)
                                .Select(s =>
                                    {
                                        var strings = s.Split(':');
                                        return RavenJObject.FromObject(new {Package = strings[0], Version = strings[1]});
                                    }));
    }
    result.Remove("__metadata");
}

Finally, let us take a peek at RavenDB and see how the results look like there:

image

Now that is much more like it.

On my next post, I am going to show how to do some queries againt this data, which currently have about 66,483 results.

NuGet Perf Problems, part I

It appears that NuGet has some perf problems recently and Jeff Handley posted the problematic queries as well as the new, hand optimized queries.

You can see the original problematic queries and the optimized code (still tentative) here.

Hand optimized query to load the pages for the packages page:

SELECT        TOP (30)
            Paged.PackageRegistrationKey
        ,    Paged.Id
        ,    Paged.Version
        ,    Packages.FlattenedAuthors
        ,    Packages.Copyright
        ,    Packages.Created
        ,    Packages.FlattenedDependencies
        ,    Packages.Description
        ,    PackageRegistrations.DownloadCount
        ,    Packages.ExternalPackageUrl
        ,    N'packages/' + PackageRegistrations.Id + N'/' + Packages.Version AS C1
        ,    Packages.IconUrl
        ,    Packages.IsLatestStable
        ,    Packages.Language
        ,    Packages.LastUpdated
        ,    Packages.LicenseUrl
        ,    Packages.Hash
        ,    Packages.HashAlgorithm
        ,    Packages.PackageFileSize
        ,    Packages.ProjectUrl
        ,    CASE Packages.Listed WHEN 1 THEN Packages.Published ELSE NULL END AS C2
        ,    Packages.ReleaseNotes
        ,    N'package/ReportAbuse/' + PackageRegistrations.Id + N'/' + Packages.Version AS C3
        ,    Packages.RequiresLicenseAcceptance
        ,    Packages.Summary
        ,    CASE WHEN Packages.Tags IS NULL THEN CAST(NULL as varchar(1)) ELSE N' ' + LTRIM(RTRIM(Packages.Tags)) + N' ' END AS C4
        ,    ISNULL(Packages.Title, PackageRegistrations.Id) AS C5
        ,    Packages.DownloadCount AS DownloadCount1
        ,    cast(0 as float(53)) AS C6
FROM        (

            SELECT        Filtered.Id
                    ,    Filtered.PackageRegistrationKey
                    ,    Filtered.Version
                    ,    Filtered.DownloadCount
                    ,    row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number]
            FROM        (
                        SELECT        PackageRegistrations.Id
                                ,    Packages.PackageRegistrationKey
                                ,    Packages.Version
                                ,    PackageRegistrations.DownloadCount
                        FROM        Packages
                        INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey
                        WHERE        Packages.IsPrerelease <> cast(1 as bit)
                        ) Filtered
            ) Paged
INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey
INNER JOIN    Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version
WHERE        Paged.[row_number] > 30
ORDER BY    PackageRegistrations.DownloadCount DESC
        ,    Paged.Id
 

This monster query is actually translated to something like:

Give me the top 30 packages which are not pre released, ordered by the download count and then by their id.

It takes a great deal of complexity to deal with that for one major reason, the data is split up across multiple tables in a way that make it hard get all of it easily. The minor reason is that there is really no good way to do paging in SQL Server (shocking, I know). One would assume that such a basic feature would have a bit more attention.

What is worse is the optimized version of the search feature:

SELECT        TOP (30)
            Paged.PackageRegistrationKey
        ,    Paged.Id
        ,    Paged.Version
        ,    Packages.FlattenedAuthors
        ,    Packages.Copyright
        ,    Packages.Created
        ,    Packages.FlattenedDependencies
        ,    Packages.Description
        ,    PackageRegistrations.DownloadCount
        ,    Packages.ExternalPackageUrl
        ,    N'packages/' + PackageRegistrations.Id + N'/' + Packages.Version AS C1
        ,    Packages.IconUrl
        ,    Packages.IsLatestStable
        ,    Packages.Language
        ,    Packages.LastUpdated
        ,    Packages.LicenseUrl
        ,    Packages.Hash
        ,    Packages.HashAlgorithm
        ,    Packages.PackageFileSize
        ,    Packages.ProjectUrl
        ,    CASE Packages.Listed WHEN 1 THEN Packages.Published ELSE NULL END AS C2
        ,    Packages.ReleaseNotes
        ,    N'package/ReportAbuse/' + PackageRegistrations.Id + N'/' + Packages.Version AS C3
        ,    Packages.RequiresLicenseAcceptance
        ,    Packages.Summary
        ,    CASE WHEN Packages.Tags IS NULL THEN CAST(NULL as varchar(1)) ELSE N' ' + LTRIM(RTRIM(Packages.Tags)) + N' ' END AS C4
        ,    ISNULL(Packages.Title, PackageRegistrations.Id) AS C5
        ,    Packages.DownloadCount AS DownloadCount1
        ,    cast(0 as float(53)) AS C6
FROM        (

            SELECT        Filtered.Id
                    ,    Filtered.PackageRegistrationKey
                    ,    Filtered.Version
                    ,    Filtered.DownloadCount
                    ,    row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number]
            FROM        (
                        SELECT        PackageRegistrations.Id
                                ,    Packages.PackageRegistrationKey
                                ,    Packages.Version
                                ,    PackageRegistrations.DownloadCount
                        FROM        Packages
                        INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey
                        WHERE        ((((Packages.IsPrerelease <> cast(1 as bit)))))
                                ((((AND    Packages.IsLatestStable = 1))))
                                ((((AND    Packages.IsLatest = 1))))
                                AND    (
                                        PackageRegistrations.Id LIKE '%jquery%' ESCAPE N'~'
                                    OR    PackageRegistrations.Id LIKE '%ui%' ESCAPE N'~'

                                    OR    Packages.Title LIKE '%jquery%' ESCAPE N'~'
                                    OR    Packages.Title LIKE '%ui%' ESCAPE N'~'

                                    OR    Packages.Tags LIKE '%jquery%' ESCAPE N'~'
                                    OR    Packages.Tags LIKE '%ui%' ESCAPE N'~'
                                    )
                        ) Filtered
            ) Paged
INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey
INNER JOIN    Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version
WHERE        Paged.[row_number] > 30
ORDER BY    PackageRegistrations.DownloadCount DESC
        ,    Paged.Id

One thing that immediately popped up to me was the use of queries such as "’%jquery%’. This is a flat out killer for performance in relational databases, since they cannot do any indexes on this and are forced to do a table scan.

I decided to take a stub at moving the NuGet data to RavenDB, which is  a much better fit (in my own obviously utterly unbiased opinion). In the next post, we will start with the actual import process, then we get to actual queries.

GahbonMining - a Data Mining Server for Raven DB

Well, it seems that we have reached this stage of maturity that RavenDB is starting to develop it own eco system and 3rd party toolsets. In particular, I am very excited to preset GahbonMining, a Data Mining Server for RavenDB.

What does this means, Data Mining for RavenDB?

Well, it basically means that your app can actually learn from what is going on in the database. For example, if we are talking about StackOverflow, you can train the server on the existing data, and when a user submit a new question, you can use GahbonMining to suggest the appropriate tags for the question. Or suggest similar questions that have already been asked and answered.

GahbonMining is just out in alpha, but it is already showing quite a bit of promise, do take a look, you won’t regret it.

Tags:

Published at

Originally posted at

Comments (7)

Professional .NET in Vienna–Sep 14–and what is NEW is RavenDB

I am going to be in the Professional .NET 2012 conference in Vienna, Austria next month, and I think we can plan for a special surprise.

Sep 10 is going to be the close off date for new features the next release of RavenDB (feature freeze), and in my talk, I am going to take you through a tour of not only what is RavenDB, but what are all the goodies that you can expect from the next version.

See your there…

Published at

Originally posted at

Comments (1)

Replication Refactoring

The following represent a stream of thoughts while I was refactoring the replication behavior for RavenDB. It may not be very meaningful, since I am mostly using this to sort things out in my head.

Previously in RavenDB, we created what are known as tombstone documents in RavenDB to represent deleted documents. That caused problem under some specialized conditions (if you had a lot of deletes, you had a lot of tombstone documents to go through). So we changed the way we store tombstone documents to use a separate list that isn’t part of the document storage.

That means that now we need to update the part that accepts a replication request and have it understand what is going on. This is a bit complex because, ignoring concurrency, we have the following states:

  • Item does not exist in the current database.
    • Item was deleted (mark it as deleted? leave it alone?)
    • Item was created/updated (persist it)
  • Item was deleted locally.
    • Item was deleted (should we generate a conflict if the history is not similar?)
    • Item was created/updated (should generate a conflict if the history isn’t consistent).
  • Item exists locally:
    • Item was deleted (need to check for history for conflicts).
    • Item was created/updated (need to check for history for conflicts).
  • Item was modified locally.
    • Item was deleted (need to check for history for conflicts).
    • Item was created/updated (need to check for history for conflicts).
  • Item is conflicted:
    • Item was deleted (generate conflict).
    • Item was created/updated (generate conflict).

We also need to make sure that we have the right behavior when you load a conflicted document, which make for interesting behavior when you have conflicts with deleted documents.

The first thing to do was to actually simply things, instead of having all of those options, I decided to branch early, so we would have one branch for when we replicate a delete and one branch when we replicate a create/update.

  • Replicating delete
    • Item does not exists – ignore
    • Item is conflicted – add to the conflicts (save as standard tombstone document, not as a separate item. That makes it possible to handle the client side conflict resolution easily).
    • Item was deleted locally – merge histories and do not generate conflict even if there is difference.
    • Item wasn’t modified locally – delete it
    • Item was modified locally – create conflict

There were actually to major decisions that made it drastically simpler. The first was to have the delete yes/no branching early. The second is more subtle, but probably even more important. The reason for this change was to remove the cost of storing the deleted tombstones as documents. The problem is what to do when we are actually getting a conflict. Previously, we would save the conflict into the document store, and that would be accessible using the standard RavenDB tooling. But how do you store a conflict between a deleted item and a locally modified item?

I decided that for that specific scenario, we are going to continue storing them as documents. That means that externally, nothing will change. This drastically reduced the complexity in the codebase and the problem set that I had to resolve.

Everything now works, and we are good to go.

Tags:

Published at

Originally posted at

Improving Map/Reduce performance in RavenDB

This is another stream of conciseness post, with me trying to figure out what is the best way to resolve a given problem.

Update: I ended up solving this in a drastically different way. I'll post about this later on. I am keeping this here because it is a good post about how I think about a problem.

A while ago, I posted a visual description of how Map/Reduce is supposed to work. That was while I was working on RavenDB map/reduce implementation, but that isn’t actually how the current RavenDB map/reduce works.

Instead, it works like this:

Map phase:

for item in docs:
   for result in reduce(map(item)):
        Persist(item.id, result.key, result)

And the reduce phase:

for result in reduce(results):
     WriteToIndex(result)

There are a few things that are interesting here.  First, we have both map and reduce run during the map phase, why is that?

Well, to do an immediate reduction of the values, of course. This has two major benefits.

  • It means that in the reduce phase, the reduce accepts the output of the reduce – this is important because it prepare us for the next step that we would have to do, multiple reduce steps.
  • Second, it means that if you have multiple results from the same documents that share the same key, they would be reduced immediately, rather than have multiple results with the same key.

You might notice an issue with the code above, though. It seems like we are only running reduce once, and only once.

Indeed, this is how RavenDB 1.0 behaves. It only run the reduce once, regardless of how many entries you have per key.

Wait! I probably need to explain things. Let us talk for a second about the following map/reduce index:

//map
from order in docs.Orders
from line in order.OrderLines
select new
{
   line.Product,
   line.Qty
}

//reduce
from result in results
group result by result.Product into g
select new
{
    Product = g.Key,
    Qty = g.Sum(x=>x.Qty)
}

Now, if we have an order with two line items for the same product, they would be immediately reduced (on the same document) and saved as the map results.

Now, the reduce key in this case is the line item product. So when we execute the reduce, we load all the map results that share the same reduce key and run them through the reduce function together.

As I said, this is how RavenDB 1.0 behaves. And it works really nicely, except that it behave less nicely if you have a lot of results for the same reduce key. What happen if we had a really popular product, that was purchased by a million different order?

Every time that we would get a new order for this product, we would have to re-reduce the entire set. That means that we would have to re-reduce 1 millions items.

We recently got a problem when one of our customers run into an issue with running map/reduce indexes over the entire US census data. One of the problematic indexes was something like:

//map 
from person in docs.CensusPeople
select new
{
  person.State,
  Count = 1
}

//reduce
from result in results
group result by result.State into g
select new
{
  State = g.Key,
  Count = g.Sum(x=>x.Count)
}

As you can imagine, this sort of thing is going to have a lot of items for the same key.

For example, for California, we would need to run reduce over about 38 million items, and for Texas it would be over 25 million items. This is really not what we had in mind, so we turned to a long standing bug in RavenDB and started to implement multi step reduce.

The issue is how to do so. Ideally, we do not want to have to make any changes between map/reduce indexes that have a large number of items per key and map/reduce indexes that have small number of indexes per key.

The question is how to do this, and how to make sure that we don’t affect the overall system performance. As I mentioned before, it is very easy to modify things to fit one particular scenario, while forgetting all about the other scenarios.

Things get interesting after that, and here is where you might get lost, because this part is mostly written from the point of view of the implementers.

The actual behavior of the system in the map phase is more complex, because we need to invalidate old items, it looks more like this:

for item in docs:
   keys = DeleteMapResultsFor(item.id)
   for result in reduce(map(item)):
           keys.Remove(result.key)
        Persist(item.id, result.key, result)
   ReReduceRemovedKeys(keys)

Instead of this, we will have this:

for item in docs:
   result = DeleteMapResultsFor(item.id)
   keys = new HashSet<string>(result.Select(x=>x.Key))
   lookups = result.ToLookup(x=>new {x.id, x.key})

   for result in reduce(map(item)):
           keys.Remove(result.key)
           int bucketId
           if not lookups.TryFind(new { item.Id, result.key}, out bucketId):
               bucketId = -1
        Persist(item.id, result.key, bucketId, result)
   ReReduceRemovedKeys(keys)

Note that we now have the notion of a bucket, and by default that bucket is set to –1. But note that we keep the same bucketId if it already has one, this will be important later on.

The interesting thing happens in the reduce phase:

def Reduce(string key, int level):
    bool hasMore =  true
    bool hadMore = false
    while hasMore:
        results = GetMappedResults(key, level, out hasMore)
        hadMore |= hasMore
        if hasMore:
            newBucketId = GetNewBucketId(key, level)
            UpdateBucketId(results, newBucketId)
        for result in reduce(results):
                Persist(key, level +1, result)
                if not hadMore:
                    WriteToIndex(key, result)
    if hadMore:
        ScheduleReduce(key, level +1)

Here is where the important things happen. If we have less than 1,024 items for the same key, we just proceed normally, there is nothing to see there.

If we have more than that, then we create a bucket for all of those results and schedule a re-reduce for the next level up.

In other words, it looks like this, here it the map phase, notice that we start out with all of the bucket ids being –1.

image

When running the reduce phase with level = 0, we get three buckets, like this:

 

image

Which we now need to re-reduce again, this is where we are called with level = 1. Let us assume that the results of bucket 1 & 2 are over 1,024 still, so we will have:

 

image

And finally:

image

 

So far, this looks nice, but there are several practical problems that we still need to solve.

To start with, when does this end? We have users who write map/reduce queries like this:

//map #1
from customer in docs.Customers
select new 
{
   CustomerId = customer.Id,
   CustomerName = customer.Name,
   OrderId = (string)null,
}

// map #2
from order in docs.Orders
select new
{
  order.CustomerId,
  CustomerName = (string)null,
  OrderId = order.Id
}

//reduce
from result in results
group result by result.CustomerId into g
let name = g.FirstOrDefault(x=>x.CustomerName!=null).CustomerName
from item in g
select new
{
   CustomerId = g.Key,
   CustomerName = name,
   item.OrderId
}

This is a frowned upon (but working) way of allow you to query and sort by the customer name while searching for indexes. The problem with this method is that if we have 15,000 orders per customer, we are going to have the same number come out of the reduce phase as well.

Now, the reason this is frowned upon? Because while this is using map/reduce, it isn’t actually… you know.. reducing the data. In order to resolve this issue, we are going to make sure that all of the items generated from a single reduce step will always go into the same bucket. This will mean that we keep pretty much the same behavior as we have now, it is going to be inefficient, but that was always going to be the case.

We are also going to limit the number of levels to three, which still gives us the ability to handle over a billion results before a reduce phase would need to see more than 1,024 items at once.

Take the California example, we would have 37,691,912 people, each of them reduce to 37,691,912 map results at bucket –1. Then we have 36,809 buckets for the second level. And finally 35 levels at the third level. All of which are computed for the final result.

The next step from here is to actually handle updates, which means that we have to keep track of the bucket ids going forward, so we start with deleting a person, which means that we need to delete their map result. Which means that we need to re-reduce the bucket they belong to at the appropriate level, and then upward, etc. In total, we would have to compute 1,024 + 1,024 + 35 items, instead of 37,691,912.

Okay, enough talking, let us see if I straighten things out enough for me to actually be able to implement this.

Tags:

Published at

Originally posted at

Comments (4)

Businesses that are a joy to work with: Skills Matter

I don’t believe that I ever did this, but I was just completely blown away by Skills Matter.

Please note, I never actually dealt with them as a student, although I got great feedback from the people I taught about them in that capacity. I am talking here solely about dealing with them as a service provider.

I have been working with them since 2008, and we have a great working relationship. I am routinely dealing with many businesses, and almost always you have this… friction. I usually have great rapport with the technical people with whom I work, but then it comes to dealing with other departments, it can be… annoying.

With Skills Matter, not only was it never the case. They go out of their way to make it easy and fun to work for them.  I can’t talk about the “straw that broke the camel’s back” and was the trigger for this post, but I can talk about some other things.

From spreading the word about RavenDB and NHibernate by organising talks with me for the Skills Matter community to helping me with a payment dispute that I had with a hotel (that I reserved, but they took care of all the details of getting my money back and saved me tons of international calls an angst) to being willing and able to accommodate screw ups (oops, I missed the plane) in the most pleasant way possible and all the stuff they do for the developer community.

In my time working with them, I found them to be honest, hardworking, professional, ethical and in general Very Good People.

Published at

Originally posted at

Comments (4)

RavenDB Awesome Feature of the Day, Formatted Indexes

There is a chance that you’ll look at me strangely for calling this the “Feature of the Day”. But that is actually quite a important little feature.

Here is the deal, let us say that you have the following index:

public class Orders_Search : AbstractIndexCreationTask<Order, Orders_Search.ReduceResult>
{
    public class ReduceResult
    {
        public string Query { get; set; }
        public DateTime LastPaymentDate { get; set; }
    }

    public Orders_Search()
    {
        Map = orders => from order in orders
                        let lastPayment = order.Payments.LastOrDefault()
                        select new
                        {
                            Query = new object[]
                            {
                                order.FirstName, 
                                order.LastName, 
                                order.OrderNumber, 
                                order.Email, 
                                order.Email.Split('@'),
                                order.CompanyName,
                                order.Payments.Select(payment => payment.PaymentIdentifier),
                                order.LicenseIds
                            },
                            LastPaymentDate = lastPayment == null ? order.OrderedAt : lastPayment.At
                        };
    }
}

And you are quite happy with it. But that is the client side perspective. We don’t have any types on the server, so you can’t just execute this there. Instead, we send a string representing the index to the server. That string is actually the output of the linq expression, which looks like this:

image

This is… somewhat hard to read, I think you’ll agree. So we had some minimal work done to improve this, and right now what you’ll get is (you’ll likely see it roll off the screen, that is expected):

docs.Orders
    .Select(order => new {order = order, lastPayment = order.Payments.LastOrDefault()})
    .Select(__h__TransparentIdentifier0 => new {Query = new System.Object []{__h__TransparentIdentifier0.order.FirstName, __h__TransparentIdentifier0.order.LastName, __h__TransparentIdentifier0.order.OrderNumber, __h__TransparentIdentifier0.order.Email, __h__TransparentIdentifier0.order.Email.Split(new System.Char []{'@'}), __h__TransparentIdentifier0.order.CompanyName, __h__TransparentIdentifier0.order.Payments
    .Select(payment => payment.PaymentIdentifier), __h__TransparentIdentifier0.order.LicenseIds}, LastPaymentDate = __h__TransparentIdentifier0.lastPayment == null ? __h__TransparentIdentifier0.order.OrderedAt : __h__TransparentIdentifier0.lastPayment.At})

This is still quite confusing, actually. But still better than the alternative.

As I said, it seems like a little thing, but those things are important. An index in its compiled form that is hard to understand for a user is a support issue for us. We needed to resolve this issue.

The problem is that source code beautifying is non trivial. I started playing with parsers a bit, but it was all way too complex. Then I had an epiphany. I didn’t actually care about the code, I just wanted it sorted. There aren’t many C# code beautifiers around, but there are a lot for JavaScript.

I started with the code from http://jsbeautifier.org/, which Rekna Anker had already ported to C#. From there, it was an issue of making sure that for my purposes, the code generated the right output. I had to teach it C# idioms such as @foo, null coalescent and lambda expressions, but that sounds harder than it actually was. With that done, we go this output:

docs.Orders.Select(order => new {
    order = order,
    lastPayment = order.Payments.LastOrDefault()
}).Select(__h__TransparentIdentifier0 => new {
    Query = new System.Object[] {
        __h__TransparentIdentifier0.order.FirstName,
        __h__TransparentIdentifier0.order.LastName,
        __h__TransparentIdentifier0.order.OrderNumber,
        __h__TransparentIdentifier0.order.Email,
        __h__TransparentIdentifier0.order.Email.Split(new System.Char[] {
            '@'
        }),
        __h__TransparentIdentifier0.order.CompanyName,
        __h__TransparentIdentifier0.order.Payments.Select(payment => payment.PaymentIdentifier),
        __h__TransparentIdentifier0.order.LicenseIds
    },
    LastPaymentDate = __h__TransparentIdentifier0.lastPayment == null ? __h__TransparentIdentifier0.order.OrderedAt : __h__TransparentIdentifier0.lastPayment.At
})

And this is actually much better. Still not good enough, mind. we can do better than that. It is a simple change:

docs.Orders.Select(order => new {
    order = order,
    lastPayment = order.Payments.LastOrDefault()
}).Select(this0 => new {
    Query = new System.Object[] {
        this0.order.FirstName,
        this0.order.LastName,
        this0.order.OrderNumber,
        this0.order.Email,
        this0.order.Email.Split(new System.Char[] {
            '@'
        }),
        this0.order.CompanyName,
        this0.order.Payments.Select(payment => payment.PaymentIdentifier),
        this0.order.LicenseIds
    },
    LastPaymentDate = this0.lastPayment == null ? this0.order.OrderedAt : this0.lastPayment.At
})

And now we got to something far more readable Smile.

Tags:

Published at

Originally posted at

Comments (13)

Über Profiler v2.0–Private Beta is open

Well, it took a while, but the next version of the NHibernate Profiler is ready to go to a private beta.

We go to private beta before the product is done, because we want to get as much early feedback as we can.

We have 10 spots for NHibernate Profiler v2.0 and 10 spots for Entity Framework Profiler v2.0.

If you are interested, please send me an email about this.

Published at

Originally posted at

The “features” that no one talks about makes all the difference

You might have noticed that I am talking a lot about support and operations recently.  This is because we have been doing a lot of work around that area. Making sure that RavenDB is even more forthcoming and open about what is going on.

This week it has been about making sure that the shutdown phase of RavenDB is as transparent as it could be. Debugging those sort of issues is a PITA, because you very rarely really stop to consider them. But we got some feedback from customers about a common set of issues there.

Under some circumstances, shutting down RavenDB might take a while. We identified several things that can cause this, mostly indexing in progress.

The first thing we did is change the way we are doing indexing to allow aborting them during shutdown, but there are still a set of operations that might take a while that we have to complete even in shutdown scenario.

For example, we might be just in the middle of flushing to disk, and we really want that to complete successfully (otherwise we would need to run an expensive check on startup).

Therefor, we added this:

image

You’ll still have to wait, sure. But now if you watch the logs you can see why, and have some understanding about how long this is going to take.

Tags:

Published at

Originally posted at

Comments (9)

Feature dependencies chains

I wanted to improve the RavenDB query optimizer, so it can merge similar indexes into a single index.

It is actually fairly simple to do (conceptually). But that leaves us with the problem of what to do with the indexes that we just superseded.

The whole point here is to reduce the number of indexes, not to create an index that will never be used.

So the next step was to make sure that we can cleanup unused auto indexes. That is great…

Except that we don’t have any way of tracking unused indexes, so before we could do that, we have to add tracking for query times.

Okay, let us do that.

Wait, we don’t have a place to put those in, and anyway, we don’t want to save it all the time, that would mean IO for every query, and what about concurrent queries?

So we need to add a way to save this, and make sure that there is some batching going on in there, so we won’t go to the disk all the time. And wait, there is this other feature over there that needs something similar, so we might as well solve both problems at once…

Okay, let us add that.

Now, where were we again?

That annoying thing about this? You only realize that those are actually problems when you run some deep analysis on the actual feature request. And usually you have to have multiple cycles to get it all out.

It is also very easy to go down one branch, all the way to the bottom, then go up to another branch, which was opened up because of the work you just did. Make for a good workday, but make it harder to predict what you are going to do today.

Tags:

Published at

Originally posted at

Comments (5)

OH: This is beautiful

I just said that when I got this on the debugger:

image

I think that I need to get off the computer, and I’ll do so, just as soon as I am done with this feature.

The next BIG Thing in RavenDB 1.2

Is something that you probably wouldn’t even notice, to be perfectly honest. We are going to work on our map/reduce implementation.

This is freakishly complex, because we need to do updatable, persistent map/reduce. It got so complex that I decided that I can’t really implement this on my own in the RavenDB solution and moved to spiking the solution in isolation.

If you care, you can look at this here. There would be additional optimizations to worry about in RavenDB, but it is pretty much all there, in less than 400 lines of code.

I couldn’t find anything out there which was nearly as useful. Most of the map/reduce implementations are about distributing the work load and scheduling it. None of them really deal with the notion of updatable map/reduce results.

Note that the Storage layer there is both only there for the sole purpose of actually showing we can persist and restart from any point and also has critical behavior in its behavior (for example, scheduling).

I’ll probably do a set of posts about this, but for now, here is the source, have fun poking at it: https://github.com/ayende/updatable-persistent-map-reduce

Tags:

Published at

Originally posted at

Comments (10)

I hate Hello World questions

That is what I call things like this:

image

It was part of a question I was asked, and the question contained things like:

I've a field (Field2) in MyClass as nested Dictionary and i have an index for Field 1.

There are several issues with this style of question:

  • It make it harder to answer, because you have to keep a mapping of that in your head.
  • There is no meaning to the question, we can’t figure out whatever this is a good or bad scenario.
  • Usually the problem description contains references to the real model, which I haven’t seen and don’t know anything about.

Annoying.

Tags:

Published at

Originally posted at

Comments (9)

It isn’t a feature that is killing you, it is the intersection of features

Over time, projects get more features. And that is fine, as long as they are orthogonal features. It is when those features overlap that they are really putting the hurt on us.

For example, with the recent Changes API addition to RavenDB, one of the things that was really annoying is that in order to actually implement this feature, I had to implement this to:

  • Embedded
  • Client Server
  • Sharded
  • Silverlight

And that one is easy. We have bugs right now that are happening because people are using two or three bundles at the same time, and each of them works fine, but in conjunction, you get strange results.

What should happen when the Unique Constraints bundle creates an internal document when you have the Versioning bundle enabled? How about when we add replication to the mix?

I am not sure if I have any good ideas about the matter. Most of the things that we do are orthogonal to one another, but when used in combination, they actually have to know about their impact on other things.

My main worry is that as time goes by, we have more & more of those intersections. And that adds to the cost of maintaining and support the product.

Awesome RavenDB feature of the Day, Encryption

Another upcoming feature for the next RavenDB release is full encryption support. We got several requests for this feature from customers that are interested in using RavenDB to store highly critical data. Think financial records, health information, etc.

For those sort of applications, there are often regulatory concerns about the data at rest. I’ll state upfront that I think that a lot of those regulations make absolutely no sense from a practical standpoint, but…

The end result is that we can never put any plaintext data on disk. Which result in some interesting problems for our use case. To start with, it is fairly easy to go about creating encryption for documents. They are independent from one another and are usually read / written in a single chunk. In fact, there have been several articles published on exactly how to do that. The problem is with indexes, which are not read as a whole, in fact, we have to support random reads through it.

In the currently released version, you could encrypt documents using a custom bundle, but you can’t get real encryption across the board. This like in flight documents, partial map/reduce data and the indexes themselves will not be encrypted and be saved in plain text format, even with a custom bundle.

In the next version of RavenDB (this feature will be available for the Enterprise version), we have made sure that all of that just works. Everything is encrypted, and there is no plain text data on disk for any reason. RavenDB will transparently encrypt/decrypt the data for you when it is actually sent to disk.

By default, we use AES-128 (you can change that, if you want, but there is a not insignificant hit if you want to just to AES-256 and it is just as secure, barring a quantum computer) to encrypt the data.

The funny part (or not so funny part) is that the actual process of encrypting the data was a relatively straightforward process. We had to spend a lot more time & effort on the actual management aspect of this feature.

For example, encryption requires an encryption key, so how do you manage that?

In RavenDB, we have two types of configurations. Server wide, which is usually located at the App.config file and database specific, which is located at the System database. For the App.config file, we provide support for encrypting the file using DPAPI, using the standard .NET config file encryption system. For database specific values, we provide our own support for encrypting the values using DPAPI.

So, the end result is:

  • Your documents and indexes are encrypted when they are on disk using strong encryption.
  • You can use a server wide or database specific key for the encryption (for that matter, you turn on/off encryption at the database level).
  • Your encryption key is guarded using DPAPI.
  • Obviously, you should backup the encryption key, because we have no way of recovering your data without it. 
  • The data is safely encrypted on disk, and the OS guarantee that no one can access the encryption key.

And, finally: You get to tick off the “no plaintext data at rest” checkbox and move on to do actual feature development Smile.

Tags:

Published at

Originally posted at

Comments (10)

Self “joins” in RavenDB

A customer had an interesting problem in the mailing list. He had the following model:

public class Case
{
    public string Id { get; set; }
    public string[] CasesCitedByThisCase { get; set; }
    public string Name { get; set; }
    // other interesting properties
}

And he needed to be able to query and sort by the case’s properties, but also by its importance. A case important is determined by how many other cases are citing it.

This seems hard to do, because you cannot access other documents during standard indexing, so you have to use map reduce to make this work. The problem is that it is really hard to just map/reduce to make this work, you need two disparate sets of information from the documents.

This is why we have multi map, and we can create the appropriate index like this:

public class Cases_Search : AbstractMultiMapIndexCreationTask<Cases_Search.Result>
{
    public class Result
    {
        public string Id { get; set; }
        public int TimesCited { get; set; }
        public string Name { get; set; }
    }

    public Cases_Search()
    {
        AddMap<Case>(cases =>
                        from c in cases
                        select new
                        {
                        c.Id,
                        c.Name,
                        TimesCited = 0
                        }
            );

        AddMap<Case>(cases =>
                        from c in cases
                        from cited in c.CasesCitedByThisCase
                        select new
                        {
                            Id = cited,
                            Name = (string)null,
                            TimesCited = 1
                        }
            );

        Reduce = results =>
                    from result in results
                    group result by result.Id
                    into g
                    select new
                    {
                    Id = g.Key,
                    Name = g.Select(x => x.Name).FirstOrDefault(x => x != null),
                    TimesCited = g.Sum(x => x.TimesCited)
                    };
    }
}

We are basically doing two passes on the cases, one to get the actual case information, the next to get the information about the cases it cite. We then take all of that information together and reduce it together, resulting in the output that we wanted.

The really nice thing about this? This being a RavenDB index, all of the work is done once, and queries on this are going to be really cheap.

Tags:

Published at

Originally posted at

Comments (20)