Ayende @ Rahien

Refunds available at head office

NuGet Perf, Part VIII: Correcting a mistake and doing aggregations

I hope this is the last one, because I can never recall what is the next Latin number.

At any rate, it has been pointed out to me that I made an error in importing the data. I assumed that the DownloadCount field that I got from the Nuget API is the download count for the specific package, but it appears that this is the total downloads count, across all versions of this package. The actual download number for a specific package is: VersionDownloadCount.

That changes things a bit, because the way Nuget sorts things is based on the total download count, not the download count for a specific version. The reason this complicate things is that we aren’t going to store the total download count in all the version documents. First, let us see the sort of query we need to write. In SQL, it would look like this:

select top 30 skip 30 
    Id,
    PackageId,
     Created, 
    (select sum(VersionDownloadCount) from Packages all where all.PackageId = p.PackageId) as TotalDownloadsCount
from Packages p
where IsPrerelease = 0
order by TotalDownloadsCount desc, Created

This is a much simplified version of the real query, and something that you can’t actually write this simply in SQL, most probably. But it gets the point.

Note that in order to process this query, the RDMBS would have to first aggregate all of the data (for each row, mind) then do the paging, then give you the results. Sure, you can keep a counter for all the downloads for a package, but considering the fact that downloads are highly parallel and happen all the time, waiting for writers to finish doing their update.

Instead, with RavenDB, we are going to use a map/reduce index and query on that.

image

This should be fairly simple to follow. In the map we go over all the packages, and output their package id, whatever they have been released, the specific version download count and the date it was created.

In the reduce, we group by the package id and whatever is was pre released or not ( I am assuming that we usually don’t want to show the pre-release stuff there).

Finally, we sum up all of the individual package downloads and we output the oldest created date. Using all of that, we can now move to the next step, and actually query that:

image

There  is a small bug here, since I don’t see RavenDB in the results,  but I guess I’ll have to wait until I get the updated data from Nuget.

Actually, that is not quite true, for pre-released software, we are pretty high up:

image

That explains much, RavenDB 1.2 is pretty awesome.

Comments

grega_g
09/06/2012 09:35 AM by
grega_g

IX

Paul Stovell
09/06/2012 12:54 PM by
Paul Stovell

The real question is when will RavenDB 1.2 become 'stable'? Or is the Duke Nukem Forever version of RavenDB? :)

Andreas Kroll
09/06/2012 01:14 PM by
Andreas Kroll

Hi Ayende,

a lot of people will for sure agree that they'll happily help you count in roman numbers if you continue this interesting series of posts we can indeed learn a lot from.

So as grega_g already posted:

IX X XI XII XIII XIV XV XVI XVII XVIII XIX XX

But you also could look at http://www.novaroma.org/via_romana/numbers.html which explains the numbers and has a handy converter on the right side :-)

Thanks for the entertaining and informative content so far

Chris Eldredge
09/06/2012 04:45 PM by
Chris Eldredge

When you query the NuGet feed, each result contains the DownloadCount aggregated across all package versions. For example, this query:

http://nuget.org/api/v2/Packages?$filter=Id%20eq%20'nuget.core'

How would you combine the map/reduce query with a search query to accomplish this same goal?

Ayende Rahien
09/06/2012 10:37 PM by
Ayende Rahien

Chris, Wait for it, I have it in a future post.

Ayende Rahien
09/06/2012 10:37 PM by
Ayende Rahien

Paul, We have been actively working on 1.2, you can get it right now. It hasn't even been 6 months, I don't think that the comparison is appropriate.

Paul Stovell
09/06/2012 10:40 PM by
Paul Stovell

@Ayende, sorry, no offence intended, I know it's available on the pre-release channels. I'm just excited for it to come to the stable channel so I can start using the features.

While it has the 'unstable' or 'pre-release' tags, I'm hesitant to switch to it in case it causes my customer's computers to explode and I get blamed for using something clearly labelled 'unstable' (even though I know it's far more stable than most software out there).

Alexei K
09/12/2012 06:35 PM by
Alexei K

Hey Ayende, any chance you can tag you series of posts with a per-series tag? Like tagging this series as "nuget-perf" or something. Like now most posts just have "raven" as tag... that is so very useless for filtering. I want to see the list of posts in this series, and I can't really do that without manually scrolling through the recent post list.

I would love to be able to just click "nuget-perf" tag and get all the articles for easy reading.

Ayende Rahien
09/14/2012 07:47 AM by
Ayende Rahien

Alexei, That is a great idea, I'll do so.

Comments have been closed on this topic.