Watch your 6, or is it your I/O? It is the I/O, yes

Mar 28 2012

Watch your 6, or is it your I/O? It is the I/O, yes

time to read 2 min | 254 words

As I said in my previous post, tasked with having to load 3.1 million files into RavenDB, most of them in the 1 – 2 KB range.

Well, the first thing I did had absolutely nothing to do with RavenDB, it had to do with avoiding dealing with this:

As you can see, that is a lot.

But when the freedb dataset is distributed, what we have is actually:

This is a tar.bz2, which we can read using the SharpZipLib library.

The really interesting thing is that reading the archive (even after adding the cost of decompressing it) is far faster than reading directly from the file system. Most file systems do badly on large amount of small files, and at any rate, it is very hard to optimize the access pattern to a lot of small files.

However, when we are talking about something like reading a single large file? That is really easy to optimize and significantly reduces the cost on the input I/O.

Just this step has reduced the cost of importing by a significant factor, we are talking about twice as much as before, and with a lot less disk activity.

Tweet Share Share 9 comments

Tags:

Performance

Comments

28 Mar 2012
10:17 AM

Falhar

Damn, I thought about that too. Now I'm sad I didn't mention it in previous thread.

28 Mar 2012
15:31 PM

Steve Wagner

My thought is that NTFS just plain sucks with that much files even if they are large. UNIX filesystem dose not have that much problems with it.

28 Mar 2012
15:53 PM

configurator

@Steve: ext* would still be a lot slower to read millions of files in sequence than using an archive, assuming you don't care about the order in which you read files. It's because a file's size is measured in pages, and you get lots of free space between the files, so even if the files are all sequential on disk there's a lot of seek time. I think even an uncompressed tar file would be faster than pretty much any OS.

Not to mention that most compression schemes (e.g. zip, gzip, but not bz2) can be decompressed faster than they are read from disk on modern machines, meaning I/O is still the bottleneck even on compressed files.

28 Mar 2012
17:53 PM

Dan

How about just getting a good SSD? :D

28 Mar 2012
18:00 PM

Rafal

Yes, it's a shame modern 'high performance' filesystems (NTFS, i'm talking to you) can't handle 50 thousand small files in a single directory. Some time ago a bug in my company's application that resulted in creation of thousands of temp files brought down the whole server - IIS, the application and even file explorer. And it was hard to find too because the filesystem behaved as if the hardware was failing or the disk logical structure was corrupt.

29 Mar 2012
07:24 AM

Patrick Huizinga

@Dan: +1 Because if Ayende had gotton a good SSD, the Size on disk of all his files would've been less than 743 MB.

29 Mar 2012
15:23 PM

Darius

How about just getting a good SSD? :D How about getting a decent file system?

30 Mar 2012
08:03 AM

configurator

@Dan, Patrick, Darius: Are you suggesting that Ayende should tell all his customers to install SSDs on their DB servers instead of enabling his software to read tarballs?

06 Apr 2012
07:47 AM

Gian Maria

Having such huge amount of file is a pain for every filesystem, because you have files that are smaller than the size of a cluster in the disk, and this mean wasted space and wasted time in reading and writing on disk. The picture of ayende shows how the real file size on disk is four time bigger than file size.

Having them stored as a single compressed file is a good strategy, because you are trading disk activity with CPU activity to decompress the file and this is a gain because disk is much more slower. The good part is that all compression libraries have the option to create an archive at 0 compression level, this permits to avoid spending CPU cycle for decompression, but with the advantage of handling everything as a single file.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB