NuGet Perf, Part IV–Modeling the packages
Before we move on to discussing how to implement package search, I wanted to take a bit of time to discuss the we structured the data. In particular, there are a bunch of properties that feel very relational in nature. In particular, these two properties:
- Tags: Ian_Mercer Natural_Language Abodit NLP
- Dependencies: AboditUnits:1.0.4|Autofac.Mef:2.5.2.830|ImpromptuInterface:5.6.2|log4net:1.2.11
In the current version of NuGet, those properties are actually stored as symbol separated strings. The reason for that? In relational databases, if you want to have a collection, you have to have another table, then join to it, then take care of it, and wake up in the middle of the night to take it to a walk. So people go the obvious route and just concatenate strings and hope for the best. Note that in the dependencies case, we have multi level concatenation.
In RavenDB, we have full fledged support for storing complex objects, so the tags above will become:
And what about the dependencies? Those we store in an array of complex objects, like so:
RavenDB allows us to store the model in a way that is easy on the eye ,natural to work with and in general making our lives easier.
Let us say that I wanted to add a feature to NuGet, “show me all the packages that use this package”?
And allow me to brag a little bit?
By the way, just to be sure that everyone has full grasp about what is going on, I am writing this post while on 30,000 feet. The laptop I am using is NOT connected to power, and the data set that I am using is the full NuGet dataset.
Compare the results you get from RavenDB to what you have to do in SQL: Dependencies LIKE ‘%log4net%’
You can kiss your performance goodbye with these sort of queries.
Comments
" In the current version of NuGet, those properties are actually stored as symbol separated strings. The reason for that? In relational databases, if you want to have a collection, you have to have another table, then join to it, then take care of it, and wake up in the middle of the night to take it to a walk."
RDBMSs are optimized to the brim to do one thing extremely well: set oriented operations. Just because the NuGet team sucks when it comes to relational modeling, doesn't mean RDBMSs suck in precisely the task they're good at.
Storing the dependencies as symbol strings sounds silly to me, as it requires you to do LIKE operations to do any filtering on them which is extremely slow compared to a clustered index scan if they'd have used simply FK-PK relationships.
One truly wonders whether they really thought things through. I mean, the nuget site itself is more reads than writes, so to display text for a package, one can denormalize the data of a package for display purposes in a special table, e.g. each hour, and still do everything on a single machine with a simple database. One can also decide to refresh this denormalized data if one of the elements involved changes. Gee, how hard would that be?
The nuget service doesn't need to work with strings, but needs to work with actual data, so they can use the actual data (normalized normal tables) with normal FK-PK relationships for the dependencies there. Would be simple joins over simple tables. Changes are for popular packages the queries are still in the query cache of the RDBMS and it doesn't hit anything on disk. As we're not talking about millions of rows, I doubt it the RDBMS has to hit the disk anyway for reads, everything will be in memory.
It really breaths RDBMS n00bism all over the place. Nothing wrong with that, we all had to start somewhere. But it doesn't sound to me there's any reason why this would lead to any performance problem on an RDBMS system at all.
Frans, Well, to start with, that would require an additional join. _All the time_. That tend to be expensive.
No it wouldn't. You only join data when you need it. For display purposes, it might be you want to display the names of assemblies it depends on and their versions it depends on. But that's readonly data which only changes if the package changes. Which is almost never compared to the amount of reads, so you can store the display data in a denormalized table and update the row for the package _when its normalized tables change_. Hardly rocket science. (Hint: if you have a forum and one wants to display a thread of messages, the messages aren't reparsed every time the thread is read, they're parsed and prepared for display only _once_, namely when they're changed/inserted. After that, the display of the messages/thread is done by displaying the processed data. This is the same thing).
For the service you don't need the join at all. When I want to install library X vA.B, I ask nuget for the dll. Nuget gives it to me and performs a simple select with a simple where over the dependencies table with solely IDs and returns the results to the client so the client needs to pull for each dependency of X vA.B the assemblies related to the IDs received from the service. Rinse repeat. No join needed.
Frans, In other words, now I have to start storing the data in multiple locations. Make sure it updates nicely, etc. That is a LOT of extra work, and if things goes out of sync, it creates very hard to figure out bugs. Also, you seems to be ignoring the issue that Nuget only send you the Package Info, the rest happens on the client, not on the server.
Why would it go out of sync? The data the system works with is a normalized datamodel with no redundant data! I am not ignoring the package info nuget sends, it's exactly what I described in my second paragraph.
There's no need to start storing the data in multiple locations.
Simply convert some of those joins into indexed views and let SQL Server handle the work for you.
The nuget schema really is a very poor example of schema design and you could easily fix the performance issues with a better schema (like you're doing by putting it into RavenDB).
This series seems like a fun thought exercise that shows the simplicity of working with RavenDB, but I'm wondering if you're able to actually plug your implementation into a NuGet compatible package feed that works with the NuGet clients.
You mentioned that NuGet uses OData so I'm wondering if your exercise will include exposing your RavenDB package store with a compatible schema.
Simplifying the schema to suit your needs is always nice, but it isn't practical when there are thousands of clients in the wild that expect to use the crufty, old, badly designed schema.
Now for my plug. My company has worked on the performance problems of NuGet Server, which does not use SQL, but still needs to support the same OData schema as nuget.org. If you want a really fast implementation of a NuGet package feed that actually works with NuGet clients, check us out at https://github.com/themotleyfool/NuGet/downloads
Putting an OData facade on the RavenDb services Ayende is confiuring should be trivial. The presentation of the data is not required to be in the same format as the data at rest.
Great work on taking a real world RDBMS problem and showing how a NoSQL solution can make the data storage, search, and retrieval issues trivial as the data reaches scale.
This is a really interesting series. My company is the one that actually built the original NuGet feed and NuGet.org site, as a vendor for Microsoft, and I led our teams efforts on that project. Granted, the site has been rebuilt since then, and I don't know how much it's changed from the original design, but I suspect that the schema has not changed much due to compatibility concerns.
Interesting fact: early on in the development of that project, we actually discussed the idea of using RavenDB. It did occur to us that Raven would be a good fit for the type of data that we needed to store for NuGet (as well as the Orchard gallery and feed which we were building together as one project). We even did a little experimentation with it to see if it would work. However, the main reason we didn't pursue that was the desire to provide an OData feed for the clients.
This was all happening in late 2010. At that time, OData was fairly new and RavenDB itself was also pretty young. We didn't really have any experience with either, and writing an OData provider for Raven did not look to be a realistic goal given our time frame (everything was shipping together in early Jan 2011 - MVC3, Razor, NuGet, WebMatrix, Orchard).
So we used what we knew would work at that time, which was WCF Data Services to provide the OData feed from an EF model on a traditional SQL Server database. I would love to be able to start over and do it all again with Raven. Of course, we certainly wouldn't do it as well as Ayende himself, so it's really fun to follow this series to see how he would approach the problem. :-)
@Jeff - OData exposes an IQueryable over HTTP, so while it's true that you can make a projection over your internal object graph to flatten or change the schema, your query executer will need to be able to translate queries on the projection back to the underlying format. That is by no means trivial. If someone has an example of powering an OData endpoint with RavenDB, I'd like to see it. A quick google search came up empty.
On the other hand, I think WCF Data Services (OData) is hopelessly complicated and it probably isn't worth anyone's time trying to make it work with something other than EF or LINQ to Collections. NuGet should have used a simple, intuitive REST api instead of OData. But they didn't.
I think NuGet's problems are also generated by amount of traffic it receives. Not only it's database schema. In your examples you are then only one client queying the data.Add another 5k users and few hundred writing to a database. It's not that NuGet is not working now. Sometimes it has problems.
@Chris - You can use the new ASP.Net Web API to expose an IQueryable as an OData endpoint. It's a limited subset (filter, sort, skip and take) - but this aligns nicely with RavenDB's dynamic query capability. One can simply return session.Query<Foo>() as an IQueryable and away you go. I've used it several times already and it works quite well.
@Matt - Sounds great. Maybe NuGet should migrate to that. In the meantime, my original point remains.
Chris,
Look here:
https://github.com/themotleyfool/NuGet/blob/master/src/Server/DataServices/Packages.svc.cs
It appears that most of what you need to support this would be just exposing IQueryable on the wire. We can do that already, so I am not really sure where the problem would be.
Great series Ayende! I need to get off my duff and play with RavenDB. :)
I wanted to correct a couple of nits.
The reason we use a symbol separated string for some properties is not because we're using a relational database, but because we chose OData which makes that subquery very very hard. It's querying and filtering was pretty limited when we started this. I like that RavenDB would make it easy to store it in a proper format.
The way you stored dependencies isn't exactly correct. Each dependency actually has a version range. If you just specify a version, it means that version or greater. For example, <dependency id="log4net" version="1.2.11" /> means log4net 1.2.11 or greater. But you could also do this: <dependency id="log4net" version="[1.2.11,2.00)" /> means log4net version 1.2.11 to 2.00 (exclusive).
In most cases, we recommend just specifying a min value. For full details, checkout the docs here: http://docs.nuget.org/docs/reference/versioning
Given that, how would you model that in RavenDB? And how would you query for something like: "Given this package id and version, find me all the dependencies that match the version constraint."
That's something we have to do on the client, but would be interesting to see if RavenDB can do some of that on the server quickly.
Haacked, I would actually store this like this: "[1.2.11,2.00)" Then it is the server that is in charge of translating this to a query. In this case, it would turn it to:
This will give you what you want (with lexicographic comparison).
Honestly, I find OData to be extremely cumbersome format to work with, and I am not sure that it is a good format to expose to the web. Exactly because it make it hard to do those sort of things.
If we store it like that, how would I do the reverse query?
As for OData, I've come to be in agreement with you. We liked the ability to send queryables over the web, but that benefit doesn't override the many costs we've had to deal with. Let's just say there was a lot of arm twisting and this is one of the technology decisions I regret making on NuGet.
Haacked, It is a big more complex:
If you want to do an exact match.
If you want to find something that is defined as [1.2.10, 2.0) as a match for 1.2.11, you would need to work a bit harder. Then I would store the package version as:
{ "Package": "log4net", "Version": "1.2.11" // <-- exact match }
or
{ "Package": "log4net", // [1.2.10, 2.0) - and the other 3 options as well "FromVersionInclusive": "1.2.10", "ToVersionExclusive": "2.0" }
As I said, it is a bit more complex using the API, but it should generate a query that looks something like this:
Which does the work
Is lexagraphic comparison acceptable for version ranges? It's different from the semantic ordering. Consider:
0.9 1.0 1.0-pre 1.1 1.1.1 1.10-pre 1.10 1.2
Those are sorted lexigraphically. Note that 1.2 appears in the wrong location, and "prerelease" packages also appear in the wrong place.
Since SemanticVersion already implements IComparable, my implementation takes advantage of that so that strange fields like this can specify their own sorting algorithms. See https://github.com/themotleyfool/NuGet/blob/lucene/src/Server/Infrastructure/Lucene/LucenePackage.cs#L43
Chris, I don't understand, why are they sorted wrong? Looks right to me. Also, you can play additional games with the versions during indexing, if you need to.
Lexical sort: 0.9 1.0 1.0-pre 1.1 1.1.1 1.10 1.10-pre 1.2
Semantic Version sort: 0.9 1.0-pre 1.0 1.1 1.1.1 1.2 1.10-pre 1.10
Note that "-pre" suffix comes "before" the same version without prefix. E.g. 1.0-alpha is "before" 1.0. Also note that 1.2 comes before 1.10.
(The prerelease versions are somewhat less important since they're often excluded by other query clauses such as item.IsPrerelease == false.)
Comment preview