NuGet Perf, Part III–Displaying the Packages page
The first thing that we will do with RavenDB and the NuGet data is to issue the same logical query as the one used to populate the packages page. As a reminder, here is how it looks:
SELECT TOP (30) -- ton of fields removed for brevity FROM ( SELECT Filtered.Id , Filtered.PackageRegistrationKey , Filtered.Version , Filtered.DownloadCount , row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number] FROM ( SELECT PackageRegistrations.Id , Packages.PackageRegistrationKey , Packages.Version , PackageRegistrations.DownloadCount FROM Packages INNER JOIN PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey WHERE Packages.IsPrerelease <> cast(1 as bit) ) Filtered ) Paged INNER JOIN PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey INNER JOIN Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version WHERE Paged.[row_number] > 30 ORDER BY PackageRegistrations.DownloadCount DESC , Paged.Id
Despite the apparent complexity ,this is a really trivial query. What is does is say:
- Give me the first 30 – 60 rows
- Where IsPrerelease is false
- Order by the download count and then the id
With Linq, the client side query looks something like this:
var results = Session.Query<Package>()
.Where(x=>x.IsPrerelease == false)
.OrderBy(x=>x.DownloadCount).ThenBy(x=>x.Id)
.Skip(30)
.Take(30)
.ToList();
Now, I assume that this is what the NuGet code is also doing, it is just that the relational database has made it so they have to go to the data in a really convoluted way.
With RavenDB, to match the same query, I could just issue the following query, but there are subtle differences between how the query works in SQL and how it works in RavenDB. in particular, the data that we have in RavenDB is the output of this query, but it isn’t the raw output. For example, we don’t have the Id column available, which is used for sorting. Now, I think that the logic is meaning to say, “sort by download count descending and then by age ascending”. So old and popular packages are more visible than new and fresh packages.
In order to match the same behavior (and because we will need it to the next post) we will define the following index in RavenDB:
And querying it:
The really nice thing about this?
This is the URL for this search:
/indexes/Packages/Listing?query=IsPrerelease:false&start=0&pageSize=128&aggregation=None&sort=-DownloadCount&sort=Created
This is something that RavenDB can do in its sleep, because it is a very cheap operation. Consider the query plan that would for the SQL query above. You have to join 5 times just to get to the data that you want, paging is a real mess, and the database actually have to work a lot to answer this fiddling little query.
Just to give you some idea here. We are talking about something that conceptually should be the same as:
select top 30 skip 30 * from Data where IsPrerelease = 0
But it get really complex really fast with the joins and the tables and all the rest.
In comparison, in RavenDB, we actually do have just a property match to do. Because we keep the entire object graph in a single location, we can do very efficient searches on it.
In the next post, I’ll discuss the actual way I modeled the data, and then we get to do exciting searches .
Comments
Your linq query is definitely not the source of the SQL you're seeing. It joins several tables twice, pages over a subset and joins that subset. Your linq query would not result in this.
The cumbersome way the SQL looks is a result of linq though: normally one would move the isprerelease predicate in the where clause inside the ON clause and simply page over the end result of the query. I don't see why they do it this way. Your linq query would result (normally) in a query which looks like the 'filtered' subset, and move the order by inside the query. After all paging in SQL Server might look cumbersome, but it's a wrapper query you apply to the normal query, where you wrap your normal query with the 'paging wrapper' to get paging. As they don't do that here, it's overly complicated.
I don't know the datamodel of NuGet, but from the looks of it it looks like they used more than 1 table for package storing. One truly wonders why. But then again, it's NuGet, some service which web developers think is 'useful' because they find it useful, forgetting that not everyone does webdevelopment
Frans do you mean to say that NuGet is only useful for people doing webdevelopment? In that case I think you don't know about the breath of different packages is available from NuGet, one of the more popular once is Ninject an IOC container, NUnit a testing framework and log4net a loggin library - not really specific to web-development.
@simon No I'm saying that nuget is primary a solution to a problem webdevs had, but non-web devs didn't have. I mean, a lot of devs simply create a folder in their solution, add 3rd party dlls there and reference them in multiple projects from that folder. the 'recent' tab in add-reference is then more handy than nuget to add references to multiple projects.
Frans, I agree with you. I even go one step further, I don't use 3rd party dlls, i just copy and paste the code off github and codeplex for all the libraries I use into my project. Don't have to worry about all these extra dlls anymore, and I only have to include the classes I need! It doesn't matter if the dll versions are compatible with each other because I get new ones every build!
Why is the download count exactly the same for all the jQuery versions?
Frans, We use nuget in pretty much any project we have now, and we don't do web apps much if at all. We like to get away from having to manage the deps and nuget does a good job at it.
Matt, That is the value we get from NuGet OData, see:
https://nuget.org/api/v2/Packages?$skiptoken='jQuery','0.0.0.0'
As you can see, you have DownloadCount which is the same for all.
What I think I missed is that there is also _VersionDownloadCount_, with the value just for this version, not globally.
While your results are clearly good, you're in no way comparing apples with apples.
The reason the SQL Server version is slow is because the schema is a stinking mess of lots of tables, not because SQL Server is bad and RavenDB is good.
A simple denormalised persisted view along with full-text search would definitely give good results.
Any chance of providing performance data for SQL Server on same or similar machine?
Tim, I don't have the data in SQL format.
Frans and jonnii,
sometimes I cannot believe what I read. You really think it is easier to copy dlls to a directory or even copy code from GitHub to your project than perform an "install-package <name>"??? NuGet really is getting better and better each day. Most packages integrate themselves into solutions very well, so for instance I have IoC ready with one or two install-package commands depending on which container I use. What about dependencies? NuGet pulls all dependencies automatically for me. You would have to do that by hand. What about updates in your case? I just issue an update command for an updated package and get all the benefits of version checking etc.
What is it you dislike about NuGet? I imagine if you'd work on a linux machine you would also not use a package installer like yast to get tools, but rather install them by hand or even download the code and compile it?
This post is get summation of every reason I love RavenDB with modern software development. This post shows everything that is wrong for doing modern software development against relational dbs and how large of an impedance mismatch SQL tables have compared to object graphs.
This post is a great summation*** if i could type.
@Frans, @jonnii, I cannot disagree more with you. To me, Nuget is to package and dependency management, as what version control is to source code. Not to mention Nuget private repositories where you can host your own or 3rd party libraries and have a central point to manage and import from.
But hey, feel free to manually copy DLLs around, create ZIP files of project versions, save them to floppy for backups :)
Surely jonnii is just kidding??
Holding out for the Linux port of the nuget client.
@Frans "I mean, a lot of devs simply create a folder in their solution, add 3rd party dlls there and reference them in multiple projects"
World of development is different now, OSS with fast development cycles needs painless upgrades. If you have not seen it yet, you might have missed the train - I am afraid.
I hope @jonnii is just kidding...
Frans, jonni: What a nonsense. It's easier to analyze source code to chose classes you need then run one command? And how you update that copied code?
@Ali What are you talking about? So your project simply takes dependencies on the latest dlls from nuget and if something breaks along the way, because an updated version breaks your code, so what? Not every project can use solely OSS dlls (heck, many projects use only non-OSS dlls), and many projects take a dependency on dll vX.Y and stick with that, because they know it works. Upgrade it 'because nuget says so' is stupid. But hey, I'm not your client, so go ahead. But please don't talk to me like I'm a petty child who doesn't know what software dev looks like. I didn't miss a train, why would I? I'm a professional software developer now for over 18 years, do you really think what's hip and 'new' today is actually 'new' ? haha :D
What I find funny is that if you say you like installed versions over some web-based package site, you suddenly do software dev on a dos box with floppy disks. Like I hit your mother in the face with a baseball bat when I talked about NuGet. Get a life.
ouch, someone seems to be in a bad mood or something.
@Frans nuget doesn't force you to upgrade anything, it's a specific action. Also, I concur with Ayende (and the most of the rest of the world) that nuget is useful for just about any project, not just web development.
@Andreas I didn't say one should choose downloading source over a package install. I just don't see the point of nuget over simply referencing a dll you have on disk. Perhaps it's related to ppl who just do OSS work, but many dlls are closed-source. Try to mix two ways of adding references, it gets cumbersome. Add reference's recent tab is much quicker in that regard. Sure it checks dependencies, but as I said, dependencies of a dll you reference are dependencies you have to research up front anyway. At least for professional projects you're shipping to clients: after all your code then depends on these versions as well. If these dlls update, do you then have to update the dll you directly reference? Most likely yes. Can your project do that? that's to be seen. I wouldn't update referenced dlls 'on the spot' just because there's a new version. At least not in professional projects shipped to clients/customers.
But perhaps in 'modern day' development one doesn't give a f*ck about whether stuff breaks.
I'm not saying nuget doesn't serve a purpose, I just don't see the benefit in my day-to-day work and therefore not the hype around it. But apparently it's forbidden to say so, as it's equal to being stupid.
Frans, We deliver commercial software via nuget. It simplify the update process, and most importantly, the dependencies process for both us and our clients.
@Frans I do not have more to say - not sure what I can say. All I can say is that I respect you for what you have done with LLBLGen Pro.
@Frans, where I work all of our internal libraries are packaged. TeamCity has a built in nuget server, and if you don't use team city then you can put them on a share drive for everyone to consume.
Something I don't understand here. You are running one query on database and are proud it takes 17ms. But NuGet's database is not hit by one user, but tousands of users. Users that also write to that database. So there is locking happening. Clearly I am misunderstanding why you present those 17ms.
Karep, a) RavenDB doesn't DO locking. Users can write to the DB all day, it doesn't impact read performance. b) RavenDB is actually getting faster the more your use it, because it anticipate and optimize itself based on real world usage.
@Frans,
Pinning a package at a specific nuget version is not that complicated. Install-Package MyPackage -Version x.x.x.x
Almost every other language out there has package management, dunno why .net should be the exception.
very nice
Comment preview