Play broken games, win broken prizes
This post really annoyed me. Feel free to go ahead and go through it, I’ll wait. The gist of the post, titled: “WAL usage looks broken in modern Time Series Databases?” is that time series dbs that uses a Write Ahead Log system are broken, and that their system, which isn’t using a WAL (but uses Log-Structure-Merge, LSM) is also broken, but no more than the rest of the pack.
This post annoyed me greatly. I’m building databases for a living, and for over a decade or so, I have been focused primarily with building a distributed, transactional (ACID), database. A key part of that is actually knowing what is going on in the hardware beneath my software and how to best utilize that. This post was annoying, because it make quite a few really bad assumptions, and then build upon them. I particularly disliked the outright dismissal of direct I/O, mostly because they seem to be doing that on very partial information.
I’m not familiar with Prometheus, but doing fsync() every two hours basically means that it isn’t on the same plane of existence as far as ACID and transactions are concerned. Cassandra is usually deployed in cases where you either don’t care about some data loss or if you do, you use multiple replicas and rely on that. So I’m not going to touch that one as well.
InfluxDB is doing the proper thing and doing fsync after each write. Because fsync is slow, they very reasonable recommend batching writes. I consider this to be something that the database should do, but I do see where they are coming from.
Postgres, on the other hand, I’m quite familiar with, and the description on the post is inaccurate. You can configure Postgres to behave in this manner, but you shouldn’t, if you care about your data. Usually, when using Postgres, you’ll not get a confirmation on your writes until the data has been safely stored on the disk (after some variant of fsync was called).
What really got me annoyed was the repeated insistence of “data loss or corruption”, which shows a remarkable lack of understanding of how WAL actually works. Because of the very nature of WAL, the people who build them all have to consider the nature of a partial WAL write, and there are mechanisms in place to handle it (usually by considering this particular transaction as invalid and rolling it back).
The solution proposed in the post is to use SSTable (sorted strings table), which is usually a component in LSM systems. Basically, buffer the data in memory (they use 1 second intervals to write it to disk) and then write it in one go. I’ll note that they make no mention of actually writing to disk safely. So no direct I/O or calls to fsync. In other words, a system crash may leave you a lot worse off than merely 1 second of lost data. In fact, it is possible that you’ll have some data there, and some not. Not necessarily in the order of arrival.
A proper database engine will:
- Merge multiple concurrent writes into a single disk operation. In this way, we can handle > 100,000 separate writes per seconds (document writes, so significantly larger than the typical time series drops) on commodity hardware.
- Ensure that if any write was confirmed, it actually hit durable storage and can never go away.
- Properly handle partial writes or corrupted files, in such a way that none of the invariants on the system is violated.
I’m leaving aside major issues with LSM and SSTables, of which write amplification, and the inability to handle sustained high loads (because there is never a break in which you can do book keeping). Just the portions on the WAL usage (which shows broken and inefficient use) to justify another broken implementation is quite enough for me.
Comments
Hi Ayende,
The author of the original article here.
I implicitly assumed proper fsync after flushing data to SSTable, so the data in SSTable cannot be lost or corrupted after successful flush to disk.
Both issues are addressed:
Write amplification is relieved with efficient data compression. Uncompressed data point occupies at least 24 bytes - 8 bytes for time series id, 8 bytes for timestamp and 8 bytes for float64 value, not counting indexes and metadata for time series tags. For instance, TimescaleDB built on top of PostgreSQL occupies 30 bytes per data point - see this post for details . VictoriaMetrics compresses typical data points to 0.4-1 byte, showing 30x-75x compression rate comparing to PostgreSQL. This means that 30-level LSM tree in VictoriaMetrics requires the same amount of disk bandwidth as PostgreSQL or TimescaleDB. 30-level LSM tree may handle 8^30 = 2^90 of rows for standard 8-way merge. So despite of write amplification VictoriaMetrics uses much lower disk bandwidth comparing to PostgreSQL (TimescaleDB) on typical workloads - see this post for details.
VictoriaMetrics slows down data ingestion when too many small SSTables exist, so they could be merged into bigger ones. This way it easily handles sustained high load.
The main question of the article is "why to use WAL in database systems build on LSM trees?". PostgreSQL doesn't use LSM trees, so the question is unrelated to it.
Aliaksandr , Your description of WAL in general was partial / wrong. And the part that you didn't cover was simple, if I don't want to lose any data, I typically want to use a WAL. About your benchmarks, you are running them on a machine with 60GB of RAM and your biggest benchmark was with a file size of 17GB. That isn't a really interesting thing to do, in terms of testing system performance. What happenswhen you run the same 40M benchmark when you have only 4 GB of RAM?
The reason I'm saying that is that already based on your own metrics, you can see that you saturated the hard disk. In this case, you're very likely buffering stuff in memory (which you can do because you have so much available RAM). When you run on a system where you need to compete with the writes, that is where you'll most see these costs.
Note that in the case of WAL, for example, a common tactic is to physically locate it on another hard disk (typically a faster one) to speed up the cost of client waiting for the transaction to commit. You cannot really do that with LSM.
Nice idea for the next benchmark. I'll run it next time. In theory disk read usage should increase during big SSTables merges, since the data for source SSTables will be missing in the OS page cache. Disk write usage should stay the same, since the destination SSTable is fully flushed to disk before swapping it with source SSTables in LSM tree irregardless of RAM size. That's why disk bandwidth usage graphs from this article show disk read usage close to zero for all the competitors, since, as you already noted, all the data fits into 60GB of RAM.
IO usage graphs from the article shows plenty of spare IO resources for VictoriaMetrics. TimescaleDB and InfluxDB have much higher IO usage in the benchmarks.
If the database uses both LSM and WAL, such as InfluxDB, Cassandra and Prometheus, then they have to write data to WAL and then eventually persist the data in LSM. So what's the benefit in WAL usage for such databases?
Aliaksandr, I think you'll find that the IO bandwidth is a fixed amount, and the moment you have competing reads & writes, overall performance for both will suffer. This is especially true for HDD, but also relevant for SSD and NVMe drives. The perf cliff in those cases is both real and quite painful. LevelDB used to have a LOT of issues about this, see HyperLevelDB and friends. RocksDB approach to the same problem is with a lot of additional complexity.
WAL is meant to be used for ensuring the durability of data, you don't lose data in the case of a crash. It also can let you do some pretty cool things overall (log shipping, point in time recovery, etc)
Comment preview