NuGet Perf Problems, part I

time to read 14 min | 2742 words

It appears that NuGet has some perf problems recently and Jeff Handley posted the problematic queries as well as the new, hand optimized queries.

You can see the original problematic queries and the optimized code (still tentative) here.

Hand optimized query to load the pages for the packages page:

SELECT        TOP (30)
            Paged.PackageRegistrationKey
        ,    Paged.Id
        ,    Paged.Version
        ,    Packages.FlattenedAuthors
        ,    Packages.Copyright
        ,    Packages.Created
        ,    Packages.FlattenedDependencies
        ,    Packages.Description
        ,    PackageRegistrations.DownloadCount
        ,    Packages.ExternalPackageUrl
        ,    N'packages/' + PackageRegistrations.Id + N'/' + Packages.Version AS C1
        ,    Packages.IconUrl
        ,    Packages.IsLatestStable
        ,    Packages.Language
        ,    Packages.LastUpdated
        ,    Packages.LicenseUrl
        ,    Packages.Hash
        ,    Packages.HashAlgorithm
        ,    Packages.PackageFileSize
        ,    Packages.ProjectUrl
        ,    CASE Packages.Listed WHEN 1 THEN Packages.Published ELSE NULL END AS C2
        ,    Packages.ReleaseNotes
        ,    N'package/ReportAbuse/' + PackageRegistrations.Id + N'/' + Packages.Version AS C3
        ,    Packages.RequiresLicenseAcceptance
        ,    Packages.Summary
        ,    CASE WHEN Packages.Tags IS NULL THEN CAST(NULL as varchar(1)) ELSE N' ' + LTRIM(RTRIM(Packages.Tags)) + N' ' END AS C4
        ,    ISNULL(Packages.Title, PackageRegistrations.Id) AS C5
        ,    Packages.DownloadCount AS DownloadCount1
        ,    cast(0 as float(53)) AS C6
FROM        (

            SELECT        Filtered.Id
                    ,    Filtered.PackageRegistrationKey
                    ,    Filtered.Version
                    ,    Filtered.DownloadCount
                    ,    row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number]
            FROM        (
                        SELECT        PackageRegistrations.Id
                                ,    Packages.PackageRegistrationKey
                                ,    Packages.Version
                                ,    PackageRegistrations.DownloadCount
                        FROM        Packages
                        INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey
                        WHERE        Packages.IsPrerelease <> cast(1 as bit)
                        ) Filtered
            ) Paged
INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey
INNER JOIN    Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version
WHERE        Paged.[row_number] > 30
ORDER BY    PackageRegistrations.DownloadCount DESC
        ,    Paged.Id

This monster query is actually translated to something like:

Give me the top 30 packages which are not pre released, ordered by the download count and then by their id.

It takes a great deal of complexity to deal with that for one major reason, the data is split up across multiple tables in a way that make it hard get all of it easily. The minor reason is that there is really no good way to do paging in SQL Server (shocking, I know). One would assume that such a basic feature would have a bit more attention.

What is worse is the optimized version of the search feature:

SELECT        TOP (30)
            Paged.PackageRegistrationKey
        ,    Paged.Id
        ,    Paged.Version
        ,    Packages.FlattenedAuthors
        ,    Packages.Copyright
        ,    Packages.Created
        ,    Packages.FlattenedDependencies
        ,    Packages.Description
        ,    PackageRegistrations.DownloadCount
        ,    Packages.ExternalPackageUrl
        ,    N'packages/' + PackageRegistrations.Id + N'/' + Packages.Version AS C1
        ,    Packages.IconUrl
        ,    Packages.IsLatestStable
        ,    Packages.Language
        ,    Packages.LastUpdated
        ,    Packages.LicenseUrl
        ,    Packages.Hash
        ,    Packages.HashAlgorithm
        ,    Packages.PackageFileSize
        ,    Packages.ProjectUrl
        ,    CASE Packages.Listed WHEN 1 THEN Packages.Published ELSE NULL END AS C2
        ,    Packages.ReleaseNotes
        ,    N'package/ReportAbuse/' + PackageRegistrations.Id + N'/' + Packages.Version AS C3
        ,    Packages.RequiresLicenseAcceptance
        ,    Packages.Summary
        ,    CASE WHEN Packages.Tags IS NULL THEN CAST(NULL as varchar(1)) ELSE N' ' + LTRIM(RTRIM(Packages.Tags)) + N' ' END AS C4
        ,    ISNULL(Packages.Title, PackageRegistrations.Id) AS C5
        ,    Packages.DownloadCount AS DownloadCount1
        ,    cast(0 as float(53)) AS C6
FROM        (

            SELECT        Filtered.Id
                    ,    Filtered.PackageRegistrationKey
                    ,    Filtered.Version
                    ,    Filtered.DownloadCount
                    ,    row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number]
            FROM        (
                        SELECT        PackageRegistrations.Id
                                ,    Packages.PackageRegistrationKey
                                ,    Packages.Version
                                ,    PackageRegistrations.DownloadCount
                        FROM        Packages
                        INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey
                        WHERE        ((((Packages.IsPrerelease <> cast(1 as bit)))))
                                ((((AND    Packages.IsLatestStable = 1))))
                                ((((AND    Packages.IsLatest = 1))))
                                AND    (
                                        PackageRegistrations.Id LIKE '%jquery%' ESCAPE N'~'
                                    OR    PackageRegistrations.Id LIKE '%ui%' ESCAPE N'~'

                                    OR    Packages.Title LIKE '%jquery%' ESCAPE N'~'
                                    OR    Packages.Title LIKE '%ui%' ESCAPE N'~'

                                    OR    Packages.Tags LIKE '%jquery%' ESCAPE N'~'
                                    OR    Packages.Tags LIKE '%ui%' ESCAPE N'~'
                                    )
                        ) Filtered
            ) Paged
INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey
INNER JOIN    Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version
WHERE        Paged.[row_number] > 30
ORDER BY    PackageRegistrations.DownloadCount DESC
        ,    Paged.Id

One thing that immediately popped up to me was the use of queries such as "’%jquery%’. This is a flat out killer for performance in relational databases, since they cannot do any indexes on this and are forced to do a table scan.

I decided to take a stub at moving the NuGet data to RavenDB, which is a much better fit (in my own obviously utterly unbiased opinion). In the next post, we will start with the actual import process, then we get to actual queries.

Tweet Share Share 18 comments

Tags:

Comments

28 Aug 2012
09:49 AM

Chris Chilvers

Only took until MSSQL 2012 to get some paging using OFFSET and FETCH. http://www.dbadiaries.com/new-t-sql-features-in-sql-server-2012-offset-and-fetch/

Apparently that's the ANSI standard way of handling it.

Not that the paging helps with the expensive wildcard filter, or having to perform all those joins before the server knows which row number any row actually is.

28 Aug 2012
09:49 AM

Dooh

Looking at the original queries, I am wondering why the new ones are faster. They introduce self-joins, which were not present previously and move scalar computations to the top level - I see no major optimization here.

28 Aug 2012
09:54 AM

flukus

I strongly suspect that the dialog also does client side paging which probably doesn't help.

28 Aug 2012
11:23 AM

Jeff Handley

@Dooh,

The new queries are better because they include a tiny subset of the data in the inner-most select. We just grab the key fields and sorting field, then we sort on that, apply paging to get the records that need to be returned, and then we join back to the table to get the rest of the data.

In the original queries, the entire table is brought through each of the inner selects and SQL was granting enough memory to hold the entire contents of the table for the sort and page operations.

@Ayende,

Note that there's not an actual table scan, but there are index scans. The execution plan of the rewritten query can be seen here:

http://www.twitpic.com/aok5ao/full

28 Aug 2012
11:44 AM

Fujiy

I thing that NuGet site don't touch the database for 99% of requests, or shouldn't

28 Aug 2012
11:52 AM

Wyatt Barnett

Kinda shocking they don't take advantage of SQL full text indexing. Then again I'm not sure how one would get there with the entity framework short of stored procs.

Would also be interesting to know how the other package managers store stuff -- aptitude, ruby gems and the like.

28 Aug 2012
13:11 PM

Andrey Shchekin

Hardcoded URLs in query, weird. No FTS, also weird.

28 Aug 2012
14:45 PM

Chris Eldredge

These past 6 months or so I've been working on NuGet.Server, the default web application for hosting private/internal nuget feeds for your company. The development of that project follows a similar approach of using Lucene to get high quality search and speed. For anyone using NuGet.Server, you can find my custom builds at https://github.com/themotleyfool/NuGet/downloads

28 Aug 2012
16:15 PM

Ayende Rahien

Jeff, Which query plan this is? The search or the package page? I guess that SQL can do an index scan for LIKE, but that is still pretty expensive thing to do.

28 Aug 2012
16:43 PM

Jeff Handley

@Ayende It's for the search query. By having Tags in the index, we get the index scan instead of the table scan. But yes, the index scan is still somewhat expensive overall.

28 Aug 2012
19:11 PM

Rafal

As stated on the NuGet website, there are only 48 000 packages in its database so even a table scan should be fast.

29 Aug 2012
06:25 AM

Jonathan Allen

This seems rather silly to me. Just run the query once an hour and dump the results in a reporting table. Fixing this kind of thing should be a no-brainer for any junior DBA.

29 Aug 2012
07:23 AM

paul carroll

Seems like you really need a proper fulltext search engine like Solr (Lucene) http://lucene.apache.org/solr/. I've implemented it before in no time at all, host it in windows and have to say that it is blindingly fast.... yes it's Java, but it's just brilliant in terms of performance.

As you're using .NET I'd thoroughly recommend this library for interfacing to Solr (https://github.com/mausch/SolrNet), it's extremely well built and maintained

29 Aug 2012
08:29 AM

Jesús López

How or where can I download the NuGet database? I would like to try to optimize that query. I have optimized a lot of SQL Server queries in my job with amazing results, and often far more complex than that one.

29 Aug 2012
08:52 AM

Ayende Rahien

Jesus, It is available through OData, I simply sucked all of that through OData queries.

29 Aug 2012
14:33 PM

geekbeing

@Ayende This "d" you're mentioning actually is a security measure, check out the details: http://haacked.com/archive/2008/11/20/anatomy-of-a-subtle-json-vulnerability.aspx

30 Aug 2012
09:20 AM

Frans Bouma

@paul

sql server has full text search build in. It's automatic and also very fast. The amount of rows we're talking about all fit in memory so any query, even these dreadful EF queries will be quick.

02 Sep 2012
01:48 AM

paul carroll

@Frans Yeah I appreciate that SQL Server has FT search, however after using it myself and having done substantial work with Solr I've found the SQL implementation to be quite cumbersome and no where near as powerful to configure. The usability benefits you get from Solr really set it apart.

When solving tech problems I tend to use a "best tool for the job" approach and not just stick with something because it's familiar to me (not saying that you advocate that BTW).

Anyway, if you get time to play with it I think you'll be pleasantly surprised

Cheers!

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB