NuGet Perf Problems, part I
It appears that NuGet has some perf problems recently and Jeff Handley posted the problematic queries as well as the new, hand optimized queries.
You can see the original problematic queries and the optimized code (still tentative) here.
Hand optimized query to load the pages for the packages page:
SELECT TOP (30) Paged.PackageRegistrationKey , Paged.Id , Paged.Version , Packages.FlattenedAuthors , Packages.Copyright , Packages.Created , Packages.FlattenedDependencies , Packages.Description , PackageRegistrations.DownloadCount , Packages.ExternalPackageUrl , N'packages/' + PackageRegistrations.Id + N'/' + Packages.Version AS C1 , Packages.IconUrl , Packages.IsLatestStable , Packages.Language , Packages.LastUpdated , Packages.LicenseUrl , Packages.Hash , Packages.HashAlgorithm , Packages.PackageFileSize , Packages.ProjectUrl , CASE Packages.Listed WHEN 1 THEN Packages.Published ELSE NULL END AS C2 , Packages.ReleaseNotes , N'package/ReportAbuse/' + PackageRegistrations.Id + N'/' + Packages.Version AS C3 , Packages.RequiresLicenseAcceptance , Packages.Summary , CASE WHEN Packages.Tags IS NULL THEN CAST(NULL as varchar(1)) ELSE N' ' + LTRIM(RTRIM(Packages.Tags)) + N' ' END AS C4 , ISNULL(Packages.Title, PackageRegistrations.Id) AS C5 , Packages.DownloadCount AS DownloadCount1 , cast(0 as float(53)) AS C6 FROM ( SELECT Filtered.Id , Filtered.PackageRegistrationKey , Filtered.Version , Filtered.DownloadCount , row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number] FROM ( SELECT PackageRegistrations.Id , Packages.PackageRegistrationKey , Packages.Version , PackageRegistrations.DownloadCount FROM Packages INNER JOIN PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey WHERE Packages.IsPrerelease <> cast(1 as bit) ) Filtered ) Paged INNER JOIN PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey INNER JOIN Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version WHERE Paged.[row_number] > 30 ORDER BY PackageRegistrations.DownloadCount DESC , Paged.Id
This monster query is actually translated to something like:
Give me the top 30 packages which are not pre released, ordered by the download count and then by their id.
It takes a great deal of complexity to deal with that for one major reason, the data is split up across multiple tables in a way that make it hard get all of it easily. The minor reason is that there is really no good way to do paging in SQL Server (shocking, I know). One would assume that such a basic feature would have a bit more attention.
What is worse is the optimized version of the search feature:
SELECT TOP (30) Paged.PackageRegistrationKey , Paged.Id , Paged.Version , Packages.FlattenedAuthors , Packages.Copyright , Packages.Created , Packages.FlattenedDependencies , Packages.Description , PackageRegistrations.DownloadCount , Packages.ExternalPackageUrl , N'packages/' + PackageRegistrations.Id + N'/' + Packages.Version AS C1 , Packages.IconUrl , Packages.IsLatestStable , Packages.Language , Packages.LastUpdated , Packages.LicenseUrl , Packages.Hash , Packages.HashAlgorithm , Packages.PackageFileSize , Packages.ProjectUrl , CASE Packages.Listed WHEN 1 THEN Packages.Published ELSE NULL END AS C2 , Packages.ReleaseNotes , N'package/ReportAbuse/' + PackageRegistrations.Id + N'/' + Packages.Version AS C3 , Packages.RequiresLicenseAcceptance , Packages.Summary , CASE WHEN Packages.Tags IS NULL THEN CAST(NULL as varchar(1)) ELSE N' ' + LTRIM(RTRIM(Packages.Tags)) + N' ' END AS C4 , ISNULL(Packages.Title, PackageRegistrations.Id) AS C5 , Packages.DownloadCount AS DownloadCount1 , cast(0 as float(53)) AS C6 FROM ( SELECT Filtered.Id , Filtered.PackageRegistrationKey , Filtered.Version , Filtered.DownloadCount , row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number] FROM ( SELECT PackageRegistrations.Id , Packages.PackageRegistrationKey , Packages.Version , PackageRegistrations.DownloadCount FROM Packages INNER JOIN PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey WHERE ((((Packages.IsPrerelease <> cast(1 as bit))))) ((((AND Packages.IsLatestStable = 1)))) ((((AND Packages.IsLatest = 1)))) AND ( PackageRegistrations.Id LIKE '%jquery%' ESCAPE N'~' OR PackageRegistrations.Id LIKE '%ui%' ESCAPE N'~' OR Packages.Title LIKE '%jquery%' ESCAPE N'~' OR Packages.Title LIKE '%ui%' ESCAPE N'~' OR Packages.Tags LIKE '%jquery%' ESCAPE N'~' OR Packages.Tags LIKE '%ui%' ESCAPE N'~' ) ) Filtered ) Paged INNER JOIN PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey INNER JOIN Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version WHERE Paged.[row_number] > 30 ORDER BY PackageRegistrations.DownloadCount DESC , Paged.IdOne thing that immediately popped up to me was the use of queries such as "’%jquery%’. This is a flat out killer for performance in relational databases, since they cannot do any indexes on this and are forced to do a table scan.
I decided to take a stub at moving the NuGet data to RavenDB, which is a much better fit (in my own obviously utterly unbiased opinion). In the next post, we will start with the actual import process, then we get to actual queries.
Comments
Only took until MSSQL 2012 to get some paging using OFFSET and FETCH. http://www.dbadiaries.com/new-t-sql-features-in-sql-server-2012-offset-and-fetch/
Apparently that's the ANSI standard way of handling it.
Not that the paging helps with the expensive wildcard filter, or having to perform all those joins before the server knows which row number any row actually is.
Looking at the original queries, I am wondering why the new ones are faster. They introduce self-joins, which were not present previously and move scalar computations to the top level - I see no major optimization here.
I strongly suspect that the dialog also does client side paging which probably doesn't help.
@Dooh,
The new queries are better because they include a tiny subset of the data in the inner-most select. We just grab the key fields and sorting field, then we sort on that, apply paging to get the records that need to be returned, and then we join back to the table to get the rest of the data.
In the original queries, the entire table is brought through each of the inner selects and SQL was granting enough memory to hold the entire contents of the table for the sort and page operations.
@Ayende,
Note that there's not an actual table scan, but there are index scans. The execution plan of the rewritten query can be seen here:
http://www.twitpic.com/aok5ao/full
I thing that NuGet site don't touch the database for 99% of requests, or shouldn't
Kinda shocking they don't take advantage of SQL full text indexing. Then again I'm not sure how one would get there with the entity framework short of stored procs.
Would also be interesting to know how the other package managers store stuff -- aptitude, ruby gems and the like.
Hardcoded URLs in query, weird. No FTS, also weird.
These past 6 months or so I've been working on NuGet.Server, the default web application for hosting private/internal nuget feeds for your company. The development of that project follows a similar approach of using Lucene to get high quality search and speed. For anyone using NuGet.Server, you can find my custom builds at https://github.com/themotleyfool/NuGet/downloads
Jeff, Which query plan this is? The search or the package page? I guess that SQL can do an index scan for LIKE, but that is still pretty expensive thing to do.
@Ayende It's for the search query. By having Tags in the index, we get the index scan instead of the table scan. But yes, the index scan is still somewhat expensive overall.
As stated on the NuGet website, there are only 48 000 packages in its database so even a table scan should be fast.
This seems rather silly to me. Just run the query once an hour and dump the results in a reporting table. Fixing this kind of thing should be a no-brainer for any junior DBA.
Seems like you really need a proper fulltext search engine like Solr (Lucene) http://lucene.apache.org/solr/. I've implemented it before in no time at all, host it in windows and have to say that it is blindingly fast.... yes it's Java, but it's just brilliant in terms of performance.
As you're using .NET I'd thoroughly recommend this library for interfacing to Solr (https://github.com/mausch/SolrNet), it's extremely well built and maintained
How or where can I download the NuGet database? I would like to try to optimize that query. I have optimized a lot of SQL Server queries in my job with amazing results, and often far more complex than that one.
Jesus, It is available through OData, I simply sucked all of that through OData queries.
@Ayende This "d" you're mentioning actually is a security measure, check out the details: http://haacked.com/archive/2008/11/20/anatomy-of-a-subtle-json-vulnerability.aspx
@paul
sql server has full text search build in. It's automatic and also very fast. The amount of rows we're talking about all fit in memory so any query, even these dreadful EF queries will be quick.
@Frans Yeah I appreciate that SQL Server has FT search, however after using it myself and having done substantial work with Solr I've found the SQL implementation to be quite cumbersome and no where near as powerful to configure. The usability benefits you get from Solr really set it apart.
When solving tech problems I tend to use a "best tool for the job" approach and not just stick with something because it's familiar to me (not saying that you advocate that BTW).
Anyway, if you get time to play with it I think you'll be pleasantly surprised
Cheers!
Comment preview