The rant in this video is an absolute beautiful one. I run into this rant figuring out how MADV_DONTNEED work and I thought I would give some context on why the behavior is exactly what I want. In fact, you can read my reasoning directly in the Linux Kernel source code.
During a transaction, we need to put the memory pages modified by the transaction somewhere. We put them in temporary storage which we call scratch. Once a transaction is committed, we still use the scratch memory for a short period of time (due to MVCC) and then we flush these pages to the data file. At this point, we are never going to use the data on the scratch pages again. Just leaving them around means that the kernel needs to write them to the file they were mapped from. Under load, that can actually be a large portion of the I/O the system is doing.
We handle that by tell the OS that we don’t actually need this memory and it should throw it away and not write it to disk using MADV_DONTNEED. We are still checking whatever this cause us excessive reads when we do that (when the kernel tries to re-read the data from disk).
There are things that seems better, though. There is MADV_REMOVE, which will do the same and also zero (efficiently) the data on disk if needed, so it is not likely to cause page faults when reading it back again. The problem is that this is limited to certain file systems. In particular, SMB file systems are really common for containers, so that is something to take into account.
MADV_FREE, on the other hand, does exactly what we want, but will only work on anonymous maps. Our scratch files use actual files, not anonymous maps. This is because we want to give the memory a backing store in the case of memory overload (and to avoid the wrath of the OOM killer). So we explicitly define them as file (although temporary ones). h