Ayende @ Rahien

filter by tags archive

architecture (617) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (645) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (397) rss
programming (1091) rss
raven (1458) rss
ravendb.net (543) rss
reviews (184) rss

2025
- August (3)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB - High-Performance NoSQL Document Database

Sep 17 2012

NuGet Perf, The Final Part – Load Testing – Source Code

time to read 1 min | 79 words

Tweet Share Share 0 comments

Tags:

This is just some logistical cleanups.

The code for the entire series can be found here: https://github.com/ayende/nuget.perf

No, I’ll not do a similar SQL version, if you want to, I would be very interested in seeing one, but that isn’t something that I intend to do.

Yes, it is a simple and trivial implementation, but that was pretty much the whole point. Being able to get to that scale without actually doing anything special is what we strive for in RavenDB.

Sep 17 2012

NuGet Perf, The Final Part – Load Testing – Results ^ 2

time to read 4 min | 790 words

Tweet Share Share 3 comments

Tags:

After seeing how well RavenDB does in perf testing, I decided to take it up a notch.

Starting from 10 users, with a step duration of 1 sec, add 50 users for each step, all the way to 3,000.
Start with a warm up period of 20 seconds, then run the test for 10 minutes.

Let us see what happens, okay?

Just to be clear, this is a RavenDB application running with three thousands concurrent users, on an off the shelve laptop while I was busy doing other stuff.

One word of warning before hand, because I run everything on a single machine, just running so many users on the machine significantly slowed down how RavenDB is reacting. Basically, the code for managing the perf test took so many resources that RavenDB had to fight to get some to actually answer the queries.

Scared yet, because here are the results in graph form.

Now you can actually see that we have some fluctuations in the graphs, the number of users grows and grows until we get to 3,000 and we have 0.37 seconds response times.

Again, I remind you, we have done zero optimizations and this is idiomatic RavenDB code. And we were able to serve requests at a frankly pretty amazing rate of speed.

And here are they in their full details:

Load Test Summary

Test Run Information

Load test name	LoadTest1
Description
Start time	04/09/12 15:28:48
End time	04/09/12 15:38:48
Warm-up duration	00:00:20
Duration	00:10:00
Controller	Local run
Number of agents	1
Run settings used	Load

Overall Results

Max User Load	3,000
Tests/Sec	196
Tests Failed	0
Avg. Test Time (sec)	14.3
Transactions/Sec	0
Avg. Transaction Time (sec)	0
Pages/Sec	741
Avg. Page Time (sec)	0.37
Requests/Sec	741
Requests Failed	0
Requests Cached Percentage	0
Avg. Response Time (sec)	0.37
Avg. Content Length (bytes)	3,080

Key Statistic: Top 5 Slowest Pages

URL (Link to More Details)	95% Page Time (sec)
Page 1	0.83
Page 0	0.82
Page 2	0.82
Page 1	0.82
http://localhost:52688/api/search	0.81

Key Statistic: Top 5 Slowest Tests

Name	95% Test Time (sec)
Browsing	20.8
BrowseAndSearch	19.8
Searching	12.9

6 Test Results

Name	Scenario	Total Tests	Avg. Test Time (sec)
Browsing	Load	31,843	17.4
BrowseAndSearch	Load	33,989	16.8
Searching	Load	51,650	10.8

6 Page Results

URL (Link to More Details)	Scenario	Test	Avg. Page Time (sec)	Count
Page 2	Load	Browsing	0.40	32,338
Search yui	Load	Searching	0.39	52,597
Page 1	Load	Browsing	0.39	32,627
http://localhost:52688/api/search	Load	BrowseAndSearch	0.39	68,576
Page 0	Load	Browsing	0.38	32,803
Search grid	Load	Searching	0.38	52,283
Page 1	Load	BrowseAndSearch	0.37	34,766
Page 0	Load	BrowseAndSearch	0.36	34,982
Search debug	Load	Searching	0.35	51,991
Search ravendb	Load	Searching	0.33	51,846

6 Transaction Results

Name	Scenario	Test	Response Time (sec)	Elapsed Time (sec)	Count

6 System Under Test Resources

Machine Name	% Processor Time	Available Memory at Test Completion (Mb)

6 Controller and Agents Resources

Machine Name	% Processor Time	Available Memory at Test Completion (Mb)
RAVEN	85.4	1,203

6 Errors

Type	Subtype	Count	Last Message

Note that the reason fro the high CPU usage is that the tests and RavenDB were running on the same machine.

Sep 17 2012

NuGet Perf, The Final Part – Loading Testing – Results

time to read 5 min | 862 words

Tweet Share Share 2 comments

Tags:

The test was run locally (no network involved ) on a Lenovo W520 laptop with 8 cores & 8 GB RAM with an SSD card. The storage engine we used was Esent, Safe Transactions. Default RavenDB configuration, running in console, with logging disabled.

We took the most obvious approach both in the code we wrote and the test approach. I am pretty sure that I’ll get a lot of helpful suggestions about the load testing. The code is available here, and you are more than welcome to take it for a spin and get your own results. What is important for me to note is that we have done exactly zero performance tuning. That is relevant to both the index we use, to the code that we wrote, everything. I just wrote things down, and didn’t worry about performance, even though this code is going to go through a load test.

Why don’t I worry about it? Because RavenDB is setup to do the Right Thing. It will self optimize itself without you need to take care of that.

With that said, here are the test results:

You can see that the red line is the number of users we have, and we have this worrying green line that seems to go crazy…

Except that this is actually the number of page served. The part that we care about is actually the Avg. Page Time, and that is the blue line.

This line, however, is basically flat no matter the load.

Here are the test results in details

Load Test Summary

Test Run Information

Load test name	LoadTest1
Description
Start time	04/09/12 14:16:38
End time	04/09/12 14:21:38
Warm-up duration	00:00:20
Duration	00:05:00
Controller	Local run
Number of agents	1
Run settings used	Run Settings1

Overall Results

Max User Load	300
Tests/Sec	20.0
Tests Failed	0
Avg. Test Time (sec)	12.5
Transactions/Sec	0
Avg. Transaction Time (sec)	0
Pages/Sec	77.1
Avg. Page Time (sec)	0.0062
Requests/Sec	77.1
Requests Failed	0
Requests Cached Percentage	0
Avg. Response Time (sec)	0.0062
Avg. Content Length (bytes)	3,042

Key Statistic: Top 5 Slowest Pages

URL (Link to More Details)	95% Page Time (sec)
Page 0	0.018
Page 0	0.018
Page 2	0.014
http://localhost:52688/api/search	0.014
Search ravendb	0.014

Key Statistic: Top 5 Slowest Tests

Name	95% Test Time (sec)
Browsing	19.3
BrowseAndSearch	17.6
Searching	10.6

6 Test Results

Name	Scenario	Total Tests	Avg. Test Time (sec)
Browsing	Load	1,533	16.0
BrowseAndSearch	Load	1,685	15.0
Searching	Load	2,770	9.00

6 Page Results

URL (Link to More Details)	Scenario	Test	Avg. Page Time (sec)	Count
Page 0	Load	Browsing	0.0072	1,629
Page 0	Load	BrowseAndSearch	0.0071	1,783
http://localhost:52688/api/search	Load	BrowseAndSearch	0.0064	3,443
Search ravendb	Load	Searching	0.0064	2,798
Page 1	Load	Browsing	0.0063	1,617
Page 2	Load	Browsing	0.0063	1,580
Page 1	Load	BrowseAndSearch	0.0063	1,760
Search debug	Load	Searching	0.0055	2,810
Search grid	Load	Searching	0.0055	2,839
Search yui	Load	Searching	0.0054	2,866

6 Transaction Results

Name	Scenario	Test	Response Time (sec)	Elapsed Time (sec)	Count

6 System Under Test Resources

Machine Name	% Processor Time	Available Memory at Test Completion (Mb)

6 Controller and Agents Resources

Machine Name	% Processor Time	Available Memory at Test Completion (Mb)
RAVEN	13.0	1,356

6 Errors

Type	Subtype	Count	Last Message

You can dig in and look at the data, it is quite interesting. Under the load of 300 users, the average page response time was… 0.0062 seconds.

And RavenDB was using just 13% of the CPU, and that include running the agents running the tests.

In my next post, we will go totally crazy…

Sep 17 2012

NuGet Perf, The Final Part – Load Testing – The Tests

time to read 2 min | 243 words

Tweet Share Share 0 comments

Tags:

For the tests, we used VS 2012 load testing tool.

We defined the following tests:

Just browsing through the packages listing:

Browsing a bit then searching, and then narrowing the search:

And finally, searching a few packages by their id, tags, etc:

I then defined the following load test:

With the following distribution:

Finally, we have the way we actually run the test:

We get 20 seconds of warm up, then 5 minutes of tough load.

On my next post, we will see how we did.

Sep 17 2012

NuGet Perf, The Final Part – Load Testing – Setup

time to read 9 min | 1794 words

Tweet Share Share 11 comments

Tags:

So, after talking so long about the perf issues, here is the final part of this series. In which we actually take this for a spin using Load Testing.

I built a Web API application to serve as the test bed. It has a RavenController, which looks like this:

public class RavenController : ApiController
{
    private static IDocumentStore documentStore;

    public static IDocumentStore DocumentStore
    {
        get
        {
            if (documentStore == null)
            {
                lock (typeof (RavenController))
                {
                    if (documentStore != null)
                        return documentStore;
                    documentStore = new DocumentStore
                        {
                            Url = "http://localhost:8080",
                            DefaultDatabase = "Nuget"
                        }.Initialize();
                    IndexCreation.CreateIndexes(typeof (Packages_Search).Assembly, documentStore);
                }
            }
            return documentStore;
        }
    }

    public IDocumentSession DocumentSession { get; set; }

    public override async Task<HttpResponseMessage> ExecuteAsync(HttpControllerContext controllerContext, CancellationToken cancellationToken)
    {
        using (DocumentSession = DocumentStore.OpenSession())
        {
            HttpResponseMessage result = await base.ExecuteAsync(controllerContext, cancellationToken);
            DocumentSession.SaveChanges();
            return result;
        }
    }
}

And now we have the following controllers:

public class PackagesController : RavenController
{
    public IEnumerable<Packages_Search.ReduceResult> Get(int page = 0)
    {
        return DocumentSession.Query<Packages_Search.ReduceResult, Packages_Search>()
            .Where(x=>x.IsPrerelease == false)
            .OrderByDescending(x=>x.DownloadCount)
                .ThenBy(x=>x.Created)
            .Skip(page*30)
            .Take(30)
            .ToList();
    }
}

public class SearchController : RavenController
{
    public IEnumerable<Packages_Search.ReduceResult> Get(string q, int page = 0)
    {
        return DocumentSession.Query<Packages_Search.ReduceResult, Packages_Search>()
            .Search(x => x.Query, q)
            .Where(x => x.IsPrerelease == false)
            .OrderByDescending(x => x.DownloadCount)
                .ThenBy(x => x.Created)
            .Skip(page * 30)
            .Take(30)
            .ToList();
    }
}

And, just for completeness sake, the Packages_Search index looks like this:

public class Packages_Search : AbstractIndexCreationTask<Package, Packages_Search.ReduceResult>
{
    public class ReduceResult
    {
        public DateTime Created { get; set; }
        public int DownloadCount { get; set; }
        public string PackageId { get; set; }
        public bool IsPrerelease { get; set; }
        public object[] Query { get; set; }
    }

    public Packages_Search()
    {
        Map = packages => from p in packages
                          select new
                              {
                                  p.Created, 
                                  DownloadCount = p.VersionDownloadCount, 
                                  p.PackageId, 
                                  p.IsPrerelease,
                                  Query = new object[] { p.Tags, p.Title, p.PackageId}
                              };
        Reduce = results =>
                 from result in results
                 group result by new {result.PackageId, result.IsPrerelease}
                 into g
                 select new
                         {
                             g.Key.PackageId,
                             g.Key.IsPrerelease,
                             DownloadCount = g.Sum(x => x.DownloadCount),
                             Created = g.Select(x => x.Created).OrderBy(x => x).First(),
                             Query = g.SelectMany(x=>x.Query).Distinct()
                         };

        Store(x=>x.Query, FieldStorage.No);
    }
}

That is enough setup, in the next post, I’ll discuss the actual structure of the load tests.

Sep 06 2012

NuGet Perf, Part VIII: Correcting a mistake and doing aggregations

time to read 4 min | 610 words

Tweet Share Share 10 comments

Tags:

I hope this is the last one, because I can never recall what is the next Latin number.

At any rate, it has been pointed out to me that I made an error in importing the data. I assumed that the DownloadCount field that I got from the Nuget API is the download count for the specific package, but it appears that this is the total downloads count, across all versions of this package. The actual download number for a specific package is: VersionDownloadCount.

That changes things a bit, because the way Nuget sorts things is based on the total download count, not the download count for a specific version. The reason this complicate things is that we aren’t going to store the total download count in all the version documents. First, let us see the sort of query we need to write. In SQL, it would look like this:

select top 30 skip 30 
    Id,
    PackageId,
     Created, 
    (select sum(VersionDownloadCount) from Packages all where all.PackageId = p.PackageId) as TotalDownloadsCount
from Packages p
where IsPrerelease = 0
order by TotalDownloadsCount desc, Created

This is a much simplified version of the real query, and something that you can’t actually write this simply in SQL, most probably. But it gets the point.

Note that in order to process this query, the RDMBS would have to first aggregate all of the data (for each row, mind) then do the paging, then give you the results. Sure, you can keep a counter for all the downloads for a package, but considering the fact that downloads are highly parallel and happen all the time, waiting for writers to finish doing their update.

Instead, with RavenDB, we are going to use a map/reduce index and query on that.

This should be fairly simple to follow. In the map we go over all the packages, and output their package id, whatever they have been released, the specific version download count and the date it was created.

In the reduce, we group by the package id and whatever is was pre released or not ( I am assuming that we usually don’t want to show the pre-release stuff there).

Finally, we sum up all of the individual package downloads and we output the oldest created date. Using all of that, we can now move to the next step, and actually query that:

There is a small bug here, since I don’t see RavenDB in the results, but I guess I’ll have to wait until I get the updated data from Nuget.

Actually, that is not quite true, for pre-released software, we are pretty high up:

That explains much, RavenDB 1.2 is pretty awesome.

Sep 05 2012

NuGet Perf, Part VII AKA getting results is only half the work

time to read 4 min | 735 words

Tweet Share Share 6 comments

Tags:

So far, we have been focusing on various ways to get the raw results from RavenDB. What are the packages that match your queries, and whatever we can be really smart about it.

But let us say that we got the results that we wanted, this is still just half the work, because we can give the user additional information about those results. In particular, in this post I am going to talk about facets.

Facets are a way to provide easily understood context to a search, allowing the user to narrow down what he is looking for quickly. In our case, let us take a look what it takes to add facet supports to our NuGet console app. The first thing to do, of course, is to actually define the facets we want to work on. In this case, we care only for the Tags:

using (var session = store.OpenSession())
{
    session.Store(new FacetSetup
        {
            Id = "facets/PackagesTags",
            Facets =
            {
                new Facet
                    {
                        Name = "Tags",
                        MaxResults = 4,
                        Mode = FacetMode.Default,
                        TermSortMode = FacetTermSortMode.HitsDesc
                    }
            },
        });
    session.SaveChanges();
}

When doing facet search using this document, we will use the Tags field, using a value per each term found. We want to get the top 4, sorted by their hits.

And here is how we are actually doing the faceted query:

var facetResults = q.ToFacets("facets/PackagesTags");
foreach (var result in facetResults.Results)
{
    Console.WriteLine();
    Console.Write("{0}:\t", result.Key);
    foreach (var val in result.Value.Values)
    {
        Console.Write("{0} [{1:#,#}] | ", val.Range, val.Hits);
    }
    Console.WriteLine();
}

It is a one liner, with all of the rest of the code dedicated to just printing things out.

Finally, here are the results:

As you can see, searching for “dal”, we can narrow the searches for linq, orm, etc. Searching for events, we get reactive extensions, etc.

Using facets gives the user additional information about his search (including things like, am I close to what I want), discoverability over your dataset and additional tools to explore it.

All in all, I think that this is a pretty neat thing.

Sep 04 2012

NuGet Perf, Part VI AKA how to be the most popular dev around

time to read 8 min | 1490 words

Tweet Share Share 14 comments

Tags:

So far, we imported the NuGet data to RavenDB and seen how we can get it out for the packages page and then looked into how we can utilize RavenDB features to help us in package search. I think we did a good job there, but we can probably do better still. In this post, I am going to stop showing off things in the Studio and focus on code. In particular, advanced searching options.

We will start from the simplest search possible. Or not, because we are doing full text search and quite a few other things aside even in the base line search. Anyway, here is the skeleton program:

while (true)
{
    Console.Write("Search: ");
    var search = Console.ReadLine();
    if(string.IsNullOrEmpty(search))
    {
        Console.Clear();
        continue;
    }
    using (var session = store.OpenSession())
    {
        var q = session.Query<PackageSearch>("Packages/Search")
            .Search(x => x.Query, search)
            .Where(x => x.IsLatestVersion && x.IsAbsoluteLatestVersion && x.IsPrerelease == false)
            .As<Package>()
            .OrderByDescending(x => x.DownloadCount).ThenBy(x => x.Created)
            .Take(3);
        var packages = q.ToList();

        foreach (var package in packages)
        {
            Console.WriteLine("\t{0}", package.Id);
        }
    }
}

Now, we are going to run this and see what we get.

So far, so good. Now let us try to improve things. What happens when we search for “jquryt”? Nothing is found, and that is actually pretty sad, because to a human, it is obvious what you are trying to search on.

If you have fat fingers and have a tendency to creatively spell words, I am sure you can emphasize with this feeling. Luckily for us, RavenDB is going to help, let us see how:

What?!

How did it do that? Well, let us look at the changes in the code, shall we?

private static void PeformQuery(IDocumentSession session, string search, bool guessIfNoResultsFound = true)
{
    var packages = session.Query<PackageSearch>("Packages/Search")
        .Search(x => x.Query, search)
        .Where(x => x.IsLatestVersion && x.IsAbsoluteLatestVersion && x.IsPrerelease == false)
        .As<Package>()
        .OrderByDescending(x => x.DownloadCount).ThenBy(x => x.Created)
        .Take(3).ToList();

    if (packages.Count > 0)
    {
        foreach (var package in packages)
        {
            Console.WriteLine("\t{0}", package.Id);
        }
    }
    else if(guessIfNoResultsFound)
    {
        DidYouMean(session, search);
    }
    else
    {
        Console.WriteLine("\tNo search results were found");
    }
}

The only major change was the call to DidYouMean(), so let us see what is going on in there.

private static void DidYouMean(IDocumentSession session, string search)
{
    var suggestionQueryResult = session.Query<PackageSearch>("Packages/Search")
        .Search(x => x.Query, search)
        .Suggest();
    switch (suggestionQueryResult.Suggestions.Length)
    {
        case 0:
            Console.WriteLine("\tNo search results were found");
            break;
        case 1:
            // we may have it filtered because of the other conditions, don't recurse again
            Console.WriteLine("\tSearch corrected to: {0}", suggestionQueryResult.Suggestions[0]);
            Console.WriteLine();

            PeformQuery(session, suggestionQueryResult.Suggestions[0], guessIfNoResultsFound: false);
            break;
        default:
            Console.WriteLine("\tDid you mean?");
            foreach (var suggestion in suggestionQueryResult.Suggestions)
            {
                Console.WriteLine("\t - {0} ?", suggestion);
            }
            break;
    }
}

Here, we ask RavenDB, “we couldn’t find anything what we had, can you give me some other ideas?” RavenDB can check the actual data that we have on disk and suggest similar alternative.

In essence, we asked RavenDB for what is nearby, and it provided us with some useful suggestions. Because the suggestions are actually based on the data we have in the db, searches on that will produce correct results.

Note that we have three code paths here, if there is one suggestion, we are going to select that immediately. Let us see how this looks like in practice:

Users tend to fall in love with those sort of features, and with RavenDB you can provide them in just a few lines of code and absolutely no hassle.

In my next post (and probably the last in this series) we will discuss even more awesome search features .

Sep 03 2012

NugGet Perf, Part V–Searching Packages

time to read 10 min | 1810 words

Tweet Share Share 5 comments

Tags:

Now we get to the good parts, actually doing searches for Packages, not just showing them in packages page, but doing complex and interesting searches. The current (after optimization) query looks like this:

SELECT        TOP (30)
       -- fields removed for brevity
FROM        (

            SELECT        Filtered.Id
                    ,    Filtered.PackageRegistrationKey
                    ,    Filtered.Version
                    ,    Filtered.DownloadCount
                    ,    row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number]
            FROM        (
                        SELECT        PackageRegistrations.Id
                                ,    Packages.PackageRegistrationKey
                                ,    Packages.Version
                                ,    PackageRegistrations.DownloadCount
                        FROM        Packages
                        INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey
                        WHERE        ((((Packages.IsPrerelease <> cast(1 as bit)))))
                                ((((AND    Packages.IsLatestStable = 1))))
                                ((((AND    Packages.IsLatest = 1))))
                                AND    (
                                        PackageRegistrations.Id LIKE '%jquery%' ESCAPE N'~'
                                    OR    PackageRegistrations.Id LIKE '%ui%' ESCAPE N'~'

                                    OR    Packages.Title LIKE '%jquery%' ESCAPE N'~'
                                    OR    Packages.Title LIKE '%ui%' ESCAPE N'~'

                                    OR    Packages.Tags LIKE '%jquery%' ESCAPE N'~'
                                    OR    Packages.Tags LIKE '%ui%' ESCAPE N'~'
                                    )
                        ) Filtered
            ) Paged
INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey
INNER JOIN    Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version
WHERE        Paged.[row_number] > 30
ORDER BY    PackageRegistrations.DownloadCount DESC
        ,    Paged.Id

I can hear the DB whimpering in fear in a dark corner, where it is hiding while it isn’t being flogged by cruel and unusual queries.

Okay, there is a certain amount of hyperbole here, I’ll admit .But at least it is funny.

At any rate, here we have query that allows the user to search for the latest stable packages by their id, title or tags. To make things interesting for the DB, all queries are using ‘%jquery%’ form. This is something that particularly every single resource you can find about databases will warn you against. You can read why here. I think we can safely assume that the NuGet guys do not use EF Prof, or they wouldn’t go this route.

Actually, I am being unfair here. There really aren’t many other good options when you start to need those sort of things. Yes, I know of SQL Server Full Text Indexes, they are complex to setup and maintain and they don’t provide enough facilities to do interesting stuff. They are also more complex to program against. You could maintain your own indexes on the side (Lucene, Fast, etc). Now you have triple the amount of work that you have to do, and care and maintenance of those isn’t trivial. For either the devs or the ops team.

So I can certainly follow why the decision was make to use LIKE ‘%jquery%’, even though it is a well known problem.

That said, it is the wrong tool for the job, and I think that RavenDB can do a lot more and in more interesting ways as well.

Let us see the index that can handle these sort of queries.

What does this index do?

Well, it index the a bunch of fields to allow them to be searched for by value, but it also do something else that is query interesting. The Query field in the index takes information from several different fields that are all indexed as one. We also specify that this index will treat the Query field as the target for full text analysis. This means that we can now write the following query:

In code, this would look like this:

var results = session.Query<Package_Search.Request, Package_Search>()
    .Where(x=> x.IsLatestVersion && x.IsAbsoluteLatestVersion && x.IsPrerelease == false)
    .Search(x=>x.Query, userSearchTerms)
    .OrderByDescending(x=>x.DownloadCount).ThenBy(x=>x.Created)
    .Take(30
    .As<Package>()
    .ToList();

This will generate the query you can see above, and return the first 30 results.

But a lot more is actually happening here, let us look at what actually goes on in the index:

Here you can see the actual terms that were indexed in the database for each of the documents. The reason that this is important is that when it comes the time to do searches, we aren’t going to need to do anything as crass as a full table scan, which is what SQL has to do. Instead, all of those terms are located in an index, and we have the <<jquery ui>> search string. We can them do a very simple index lookup (cost of that is O(logN), if you’ll recall) to find your results.

And of course, we have this guy:

So I am pretty happy about this so far, but we can probably do better. We will see how in our next post.

Aug 31 2012

NuGet Perf, Part IV–Modeling the packages

time to read 3 min | 405 words

Tweet Share Share 21 comments

Tags:

Before we move on to discussing how to implement package search, I wanted to take a bit of time to discuss the we structured the data. In particular, there are a bunch of properties that feel very relational in nature. In particular, these two properties:

Tags: Ian_Mercer Natural_Language Abodit NLP
Dependencies: AboditUnits:1.0.4|Autofac.Mef:2.5.2.830|ImpromptuInterface:5.6.2|log4net:1.2.11

In the current version of NuGet, those properties are actually stored as symbol separated strings. The reason for that? In relational databases, if you want to have a collection, you have to have another table, then join to it, then take care of it, and wake up in the middle of the night to take it to a walk. So people go the obvious route and just concatenate strings and hope for the best. Note that in the dependencies case, we have multi level concatenation.

In RavenDB, we have full fledged support for storing complex objects, so the tags above will become:

And what about the dependencies? Those we store in an array of complex objects, like so:

RavenDB allows us to store the model in a way that is easy on the eye ,natural to work with and in general making our lives easier.

Let us say that I wanted to add a feature to NuGet, “show me all the packages that use this package”?

And allow me to brag a little bit?

By the way, just to be sure that everyone has full grasp about what is going on, I am writing this post while on 30,000 feet. The laptop I am using is NOT connected to power, and the data set that I am using is the full NuGet dataset.

Compare the results you get from RavenDB to what you have to do in SQL: Dependencies LIKE ‘%log4net%’

You can kiss your performance goodbye with these sort of queries.

Oren Eini

Oren Eini

CEO of RavenDB

NuGet Perf, The Final Part – Load Testing – Source Code

NuGet Perf, The Final Part – Load Testing – Results ^ 2

NuGet Perf, The Final Part – Loading Testing – Results

NuGet Perf, The Final Part – Load Testing – The Tests

NuGet Perf, The Final Part – Load Testing – Setup

NuGet Perf, Part VIII: Correcting a mistake and doing aggregations

NuGet Perf, Part VII AKA getting results is only half the work

NuGet Perf, Part VI AKA how to be the most popular dev around

NugGet Perf, Part V–Searching Packages

NuGet Perf, Part IV–Modeling the packages

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed