You are only as fast as your slowest bottleneck
Chris points out something very important:
“A much better solution would have been to simply put the database on a compressed directory, which would slow down some IO ..."
I don't agree.
Compression needs CPU. We got a lot of more IO by switching on compression (it's just less to write and read). Previous our CPU was about 40%, now averaging at 70%. Compression rate saves us about 30% per file. After switching on compression our IO bound application was about 20% faster.
We are currently planning switching on compression on all our production servers over Christmas, because using cpu-cores for compression is even cheaper than adding hard disks and raid for performance.
In general, most operations today are mostly IO bound, with the CPU mostly sitting there twiddling the same byte until that byte threatens to sue for harassment. It make sense to trade off IO for CPU time, because our systems are being starved for IO.
In fact, you can just turn on compression at the File System level in most OSes, and it is likely to result in a significant saving for the application performance, assuming that the data does not already fits in memory.
Comments
with the CPU mostly sitting there twiddling the same byte until that byte threatens to sue for harassment.
LOL. I almost fell of the chair.
we are currently in very heated debates, just noticed that you put the comment on your blog.
we concluded that we will try a different approach. we will implement compression on application level. reason is that we send a lot over the wire and therefor could also save a lot of bandwith. this is much better because then we can remove some latency when communicating with backend system. but first, we will benchmark it and get the real numbers.
cloud computing brings interesting times, suddenly cpu/bandwith/io becomes somehow tradeable
i currently don't see compression used extensively, for ex. why not create a dictionary and share it between all servers? you don't need to retransmit the dictionary every time.
Chris, that's very interesting. I was thinking about using app level compression for message queuing: I've built a message bus on top of an SQL database and wanted to use compression to improve database performance. An application usually has only few types of messages with similar structure (JSON) and repeating contents so I thought that using a shared compression dictionary would be a better solution than compressing each message individually. However, I didn't find any compression API where I could use an external dictionary so this is still only an idea. Do you know such compression library?
I do not know if compressing directory where the database file are stored is a good idea. It would be useful if the database engine directly supports compression instead of relying on file system storage.
Compressing data at application level is something I do not like very much, because it left you with unreadable data into database, so the data could be read only by the application.
Indeed this is a good argument, trading CPU time for size could be a good approach.
Rafal, I'm not aware of any library. but there are lots of oss ones which wouldn't be that hard to adapt, #ziplib would be candidate. currently i'm looking at lzo compression.
Gian, good point. But since I don't use the db as an integration database i have no problem when data can only be read by one application
I've been working with compression and databases for at least 5 years. I can say that you will get data faster... but if you have to update tables, you will have big bottlenecks. For read-only data warehouse is satisfying.
yes, sometimes such bottlenecks are used rarely enough to leave them such the way they are. Problems start when you want to get all your flow through that bottleneck;)
Comment preview