Ayende @ Rahien

Refunds available at head office

NuGet Perf Problems, part I

It appears that NuGet has some perf problems recently and Jeff Handley posted the problematic queries as well as the new, hand optimized queries.

You can see the original problematic queries and the optimized code (still tentative) here.

Hand optimized query to load the pages for the packages page:

SELECT        TOP (30)
            Paged.PackageRegistrationKey
        ,    Paged.Id
        ,    Paged.Version
        ,    Packages.FlattenedAuthors
        ,    Packages.Copyright
        ,    Packages.Created
        ,    Packages.FlattenedDependencies
        ,    Packages.Description
        ,    PackageRegistrations.DownloadCount
        ,    Packages.ExternalPackageUrl
        ,    N'packages/' + PackageRegistrations.Id + N'/' + Packages.Version AS C1
        ,    Packages.IconUrl
        ,    Packages.IsLatestStable
        ,    Packages.Language
        ,    Packages.LastUpdated
        ,    Packages.LicenseUrl
        ,    Packages.Hash
        ,    Packages.HashAlgorithm
        ,    Packages.PackageFileSize
        ,    Packages.ProjectUrl
        ,    CASE Packages.Listed WHEN 1 THEN Packages.Published ELSE NULL END AS C2
        ,    Packages.ReleaseNotes
        ,    N'package/ReportAbuse/' + PackageRegistrations.Id + N'/' + Packages.Version AS C3
        ,    Packages.RequiresLicenseAcceptance
        ,    Packages.Summary
        ,    CASE WHEN Packages.Tags IS NULL THEN CAST(NULL as varchar(1)) ELSE N' ' + LTRIM(RTRIM(Packages.Tags)) + N' ' END AS C4
        ,    ISNULL(Packages.Title, PackageRegistrations.Id) AS C5
        ,    Packages.DownloadCount AS DownloadCount1
        ,    cast(0 as float(53)) AS C6
FROM        (

            SELECT        Filtered.Id
                    ,    Filtered.PackageRegistrationKey
                    ,    Filtered.Version
                    ,    Filtered.DownloadCount
                    ,    row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number]
            FROM        (
                        SELECT        PackageRegistrations.Id
                                ,    Packages.PackageRegistrationKey
                                ,    Packages.Version
                                ,    PackageRegistrations.DownloadCount
                        FROM        Packages
                        INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey
                        WHERE        Packages.IsPrerelease <> cast(1 as bit)
                        ) Filtered
            ) Paged
INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey
INNER JOIN    Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version
WHERE        Paged.[row_number] > 30
ORDER BY    PackageRegistrations.DownloadCount DESC
        ,    Paged.Id
 

This monster query is actually translated to something like:

Give me the top 30 packages which are not pre released, ordered by the download count and then by their id.

It takes a great deal of complexity to deal with that for one major reason, the data is split up across multiple tables in a way that make it hard get all of it easily. The minor reason is that there is really no good way to do paging in SQL Server (shocking, I know). One would assume that such a basic feature would have a bit more attention.

What is worse is the optimized version of the search feature:

SELECT        TOP (30)
            Paged.PackageRegistrationKey
        ,    Paged.Id
        ,    Paged.Version
        ,    Packages.FlattenedAuthors
        ,    Packages.Copyright
        ,    Packages.Created
        ,    Packages.FlattenedDependencies
        ,    Packages.Description
        ,    PackageRegistrations.DownloadCount
        ,    Packages.ExternalPackageUrl
        ,    N'packages/' + PackageRegistrations.Id + N'/' + Packages.Version AS C1
        ,    Packages.IconUrl
        ,    Packages.IsLatestStable
        ,    Packages.Language
        ,    Packages.LastUpdated
        ,    Packages.LicenseUrl
        ,    Packages.Hash
        ,    Packages.HashAlgorithm
        ,    Packages.PackageFileSize
        ,    Packages.ProjectUrl
        ,    CASE Packages.Listed WHEN 1 THEN Packages.Published ELSE NULL END AS C2
        ,    Packages.ReleaseNotes
        ,    N'package/ReportAbuse/' + PackageRegistrations.Id + N'/' + Packages.Version AS C3
        ,    Packages.RequiresLicenseAcceptance
        ,    Packages.Summary
        ,    CASE WHEN Packages.Tags IS NULL THEN CAST(NULL as varchar(1)) ELSE N' ' + LTRIM(RTRIM(Packages.Tags)) + N' ' END AS C4
        ,    ISNULL(Packages.Title, PackageRegistrations.Id) AS C5
        ,    Packages.DownloadCount AS DownloadCount1
        ,    cast(0 as float(53)) AS C6
FROM        (

            SELECT        Filtered.Id
                    ,    Filtered.PackageRegistrationKey
                    ,    Filtered.Version
                    ,    Filtered.DownloadCount
                    ,    row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number]
            FROM        (
                        SELECT        PackageRegistrations.Id
                                ,    Packages.PackageRegistrationKey
                                ,    Packages.Version
                                ,    PackageRegistrations.DownloadCount
                        FROM        Packages
                        INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey
                        WHERE        ((((Packages.IsPrerelease <> cast(1 as bit)))))
                                ((((AND    Packages.IsLatestStable = 1))))
                                ((((AND    Packages.IsLatest = 1))))
                                AND    (
                                        PackageRegistrations.Id LIKE '%jquery%' ESCAPE N'~'
                                    OR    PackageRegistrations.Id LIKE '%ui%' ESCAPE N'~'

                                    OR    Packages.Title LIKE '%jquery%' ESCAPE N'~'
                                    OR    Packages.Title LIKE '%ui%' ESCAPE N'~'

                                    OR    Packages.Tags LIKE '%jquery%' ESCAPE N'~'
                                    OR    Packages.Tags LIKE '%ui%' ESCAPE N'~'
                                    )
                        ) Filtered
            ) Paged
INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey
INNER JOIN    Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version
WHERE        Paged.[row_number] > 30
ORDER BY    PackageRegistrations.DownloadCount DESC
        ,    Paged.Id

One thing that immediately popped up to me was the use of queries such as "’%jquery%’. This is a flat out killer for performance in relational databases, since they cannot do any indexes on this and are forced to do a table scan.

I decided to take a stub at moving the NuGet data to RavenDB, which is  a much better fit (in my own obviously utterly unbiased opinion). In the next post, we will start with the actual import process, then we get to actual queries.

Comments

Chris Chilvers
08/28/2012 09:49 AM by
Chris Chilvers

Only took until MSSQL 2012 to get some paging using OFFSET and FETCH. http://www.dbadiaries.com/new-t-sql-features-in-sql-server-2012-offset-and-fetch/

Apparently that's the ANSI standard way of handling it.

Not that the paging helps with the expensive wildcard filter, or having to perform all those joins before the server knows which row number any row actually is.

Dooh
08/28/2012 09:49 AM by
Dooh

Looking at the original queries, I am wondering why the new ones are faster. They introduce self-joins, which were not present previously and move scalar computations to the top level - I see no major optimization here.

flukus
08/28/2012 09:54 AM by
flukus

I strongly suspect that the dialog also does client side paging which probably doesn't help.

Jeff Handley
08/28/2012 11:23 AM by
Jeff Handley

@Dooh,

The new queries are better because they include a tiny subset of the data in the inner-most select. We just grab the key fields and sorting field, then we sort on that, apply paging to get the records that need to be returned, and then we join back to the table to get the rest of the data.

In the original queries, the entire table is brought through each of the inner selects and SQL was granting enough memory to hold the entire contents of the table for the sort and page operations.

@Ayende,

Note that there's not an actual table scan, but there are index scans. The execution plan of the rewritten query can be seen here:

http://www.twitpic.com/aok5ao/full

Fujiy
08/28/2012 11:44 AM by
Fujiy

I thing that NuGet site don't touch the database for 99% of requests, or shouldn't

Wyatt Barnett
08/28/2012 11:52 AM by
Wyatt Barnett

Kinda shocking they don't take advantage of SQL full text indexing. Then again I'm not sure how one would get there with the entity framework short of stored procs.

Would also be interesting to know how the other package managers store stuff -- aptitude, ruby gems and the like.

Andrey Shchekin
08/28/2012 01:11 PM by
Andrey Shchekin

Hardcoded URLs in query, weird. No FTS, also weird.

Chris Eldredge
08/28/2012 02:45 PM by
Chris Eldredge

These past 6 months or so I've been working on NuGet.Server, the default web application for hosting private/internal nuget feeds for your company. The development of that project follows a similar approach of using Lucene to get high quality search and speed. For anyone using NuGet.Server, you can find my custom builds at https://github.com/themotleyfool/NuGet/downloads

Ayende Rahien
08/28/2012 04:15 PM by
Ayende Rahien

Jeff, Which query plan this is? The search or the package page? I guess that SQL can do an index scan for LIKE, but that is still pretty expensive thing to do.

Jeff Handley
08/28/2012 04:43 PM by
Jeff Handley

@Ayende It's for the search query. By having Tags in the index, we get the index scan instead of the table scan. But yes, the index scan is still somewhat expensive overall.

Rafal
08/28/2012 07:11 PM by
Rafal

As stated on the NuGet website, there are only 48 000 packages in its database so even a table scan should be fast.

Jonathan Allen
08/29/2012 06:25 AM by
Jonathan Allen

This seems rather silly to me. Just run the query once an hour and dump the results in a reporting table. Fixing this kind of thing should be a no-brainer for any junior DBA.

paul carroll
08/29/2012 07:23 AM by
paul carroll

Seems like you really need a proper fulltext search engine like Solr (Lucene) http://lucene.apache.org/solr/. I've implemented it before in no time at all, host it in windows and have to say that it is blindingly fast.... yes it's Java, but it's just brilliant in terms of performance.

As you're using .NET I'd thoroughly recommend this library for interfacing to Solr (https://github.com/mausch/SolrNet), it's extremely well built and maintained

Jesús López
08/29/2012 08:29 AM by
Jesús López

How or where can I download the NuGet database? I would like to try to optimize that query. I have optimized a lot of SQL Server queries in my job with amazing results, and often far more complex than that one.

Ayende Rahien
08/29/2012 08:52 AM by
Ayende Rahien

Jesus, It is available through OData, I simply sucked all of that through OData queries.

geekbeing
08/29/2012 02:33 PM by
geekbeing

@Ayende This "d" you're mentioning actually is a security measure, check out the details: http://haacked.com/archive/2008/11/20/anatomy-of-a-subtle-json-vulnerability.aspx

Frans Bouma
08/30/2012 09:20 AM by
Frans Bouma

@paul

sql server has full text search build in. It's automatic and also very fast. The amount of rows we're talking about all fit in memory so any query, even these dreadful EF queries will be quick.

paul carroll
09/02/2012 01:48 AM by
paul carroll

@Frans Yeah I appreciate that SQL Server has FT search, however after using it myself and having done substantial work with Solr I've found the SQL implementation to be quite cumbersome and no where near as powerful to configure. The usability benefits you get from Solr really set it apart.

When solving tech problems I tend to use a "best tool for the job" approach and not just stick with something because it's familiar to me (not saying that you advocate that BTW).

Anyway, if you get time to play with it I think you'll be pleasantly surprised

Cheers!

Comments have been closed on this topic.