RavenDB & FreeDB: An optimization opportunity
Update: The numbers in this post are not relevant. I include them here solely so you would have a frame of reference. We have done a lot of optimization work, and the numbers are orders of magnitude faster now. See the next post for details.
The purpose of this post is to setup a scenario, see how RavenDB do with it, and then optimize the parts that we don’t like. This post is scheduled to go about two months after it was written, so anything that you see here is likely already fixed. In future posts, I’ll talk about the optimizations, what we did, and what was the result.
System note: I run those tests on a year old desktop, with all the database activity happening on a single 7200 RPM 300GB disk with 8 GB of RAM. Please don’t get to hung up on the actual numbers, I include them for reference, but real hardware on production system should kick this drastically higher. Another thing to remember is that this was an active system, while all of those operations were running, I was actively working and developing on the machine. The main point is to give us some sort of a metric about where we are, and to see whatever we like this or not.
We keep looking at additional things that we can do with RavenDB, and having large amount of information to tests things with is awesome. Having non fake data is even awesomer, because fake data is predictable data, while real data tend to be much more… interesting.
That is why I decided to load the entire freedb database into RavenDB and see what is happening.
What is freedb?
freedb is a database to look up CD information using the internet. This is done by a client (a freedb aware application) which calculates a (nearly) unique disc ID for a CD in your CD-Rom and then queries the database. As a result, the client displays the artist, CD-title, tracklist and some additional info.
The nice thing about freedb is that you can download their data* and make use of it yourself.
* The not so nice thing is that the data is in free form text format. I wrote a parser for it if you really want to use it, which you can find here: https://github.com/ayende/XmcdParser
So I decided to push all of this data into RavenDB. The import process took a couple of hours (didn’t actually measure, so I am not sure exactly how much), and we ended up with a RavenDB database with: 3,133,903 documents. Memory usage during the import process was ~100 MB – 150 MB (no indexes were present).
The actual size in RavenDB is 3.59 GB with 3.69 GB reserved on the file system.
Starting the database from cold boot takes about 4 seconds.
This is what the document looks like:
A full backup of the database took about 3 minutes, with all of the time dedicate for pure I/O.
Doing an export, using smuggler (on the local machine, 128 document batches) took about 18 minutes and resulted in a 803MB file (not surprising, smuggler output is a compressed file).
Note that we created this in a completely empty database, so the next step was to actually create an index and see how the database behaves. We create the default Raven/DocumentsByEntityName index, and got 5,870 seconds, so just over an hour and a half. For what it worth, this resulted in on disk index with a size of 125MB.
I then tried a much more complex index:
Just to give you some idea, this index gives you full text search support over just about every music cd that was ever made. To be frank, this index scares me, because it means that we have to have index entry for every single track in the world.
After indexing was completed, we ended up with a 700 MB on disk presence. Indexing took about 7 hours to complete. That is a lot, but remember what we are dealing with, we indexed 3.1 million documents, but we actually indexed, 52,561,894 values (remember, we index each and every track). The interesting bit is that while it took a lot of CPU (full text indexing usually does) memory usage was relatively low, it peaked about 300 MB and usually was around the 180MB).
Searching over this index is not as fast as I would like, taking about a second to complete. Then again, the results are quite impressive:
Well, given that this is the equivalent of a 52 million records (in this case, literally records ) , and we are performing full text search, quite nice.
Let us see what happens when do something a little simpler, shall we?
In this case, we are only indexing 3.1 millions documents, and we don’t do full text searches. This index took 2.3 hours to run.
Queries on that are a much more satisfactory rate of starting out at 75 ms and dropping to 5 ms very quickly.