As I said in my previous post, tasked with having to load 3.1 million files into RavenDB, most of them in the 1 – 2 KB range.
Well, the first thing I did had absolutely nothing to do with RavenDB, it had to do with avoiding dealing with this:
As you can see, that is a lot.
But when the freedb dataset is distributed, what we have is actually:
This is a tar.bz2, which we can read using the SharpZipLib library.
The really interesting thing is that reading the archive (even after adding the cost of decompressing it) is far faster than reading directly from the file system. Most file systems do badly on large amount of small files, and at any rate, it is very hard to optimize the access pattern to a lot of small files.
However, when we are talking about something like reading a single large file? That is really easy to optimize and significantly reduces the cost on the input I/O.
Just this step has reduced the cost of importing by a significant factor, we are talking about twice as much as before, and with a lot less disk activity.