Linux, Debts and Out Of Memory Killer
Imagine that you go to the bank, and ask for a 100,000$ mortgage. The nice guy in the bank agrees to lend you the money, and since you need to pay that in 5 installments, you take 15,000$ to the contractor, and leave the rest in the bank until it is needed. The bank is doing brisk business, and promise a lot of customers that they can get their mortgage in the bank. Since most of the mortgages are also taken in installments, the bank never actually have enough money to hand over to all lenders. But it make do.
Until one dark day when you come to the bank and ask for the rest of the money, because it is time to install the kitchen cabinets, and you need to pay for that. The nice man in the bank tell you to wait a bit, and goes to see if they have any money. At this point, it would be embarrassing to tell you that they don’t have any money to give you, because they over committed themselves. The nice man from the bank murders you and bury your body in the desert, to avoid you complaining that you didn’t get the money that you were promised. Actually, the nice man might go ahead and kill someone else (robbing them in the process), and then give you their money. You go home happy to your blood stained kitchen cabinets.
That is how memory management works in Linux.
After this dramatic opening, let us get down to what is really going on. Linux has a major problem. Its process model means that it is stuck up a tree and the only way down is via free fall. Whenever a process wants to create another process, the standard method in Linux is to call fork() and then call execv() to execute the new binary. The problem here is what fork() does. It needs to copy the entire process state to the new process. That include all memory, handles, registers, etc.
Let us assume that we have a process that allocated 1GB of memory for reading and writing, and then called fork(). The way things are setup, it is pretty cheap to create the new process, all we need to do is duplicate the kernel data structures and we are done. However, what happens when the memory that the process allocated? The fork() call requires that both processes will have access to that memory, and also that both of them may modify it. That means that we have a copy on write situation. Whenever one of the processes modify the memory, it is forcing the OS to copy that piece of memory to another physical memory location and remap the virtual addresses.
This allows the developer to do some really cool stuff. Redis implemented its backup strategy via the fork() call. By forking and then dumping the in memory process state to disk it can get consistent snapshot of the system with almost no code. It is the OS that is responsible for maintaining that invariant.
Speaking of invariants, it also means that there is absolutely no way that Linux can manage memory properly. If we have 2 GB of RAM on the machine, and we have a 1GB process that fork()-ed, what is going to happen? Well, it was promised 1 GB of RAM, and it got that. And it was also promised by fork() that both processes will be able to modify the full 1GB of RAM. If we also have some other processes taking memory (and assuming no swap for the moment), that pretty much means that someone is going to end up holding the dirty end of the stick.
Now, Linux has a configuration option that would prevent it (vm.overcommit_memory = 2, and the over commit ratio, but that isn’t really important. I’m including this here for the nitpickers, and yes, I’m aware that you can set oom_adj = –17 to protect myself from this issue, not the point.). This tell Linux that it shouldn’t over commit. In such cases, it would mean that the fork() method call would fail, and you’ll be left with an effectively a crippled system. So, we have the potential for a broken invariant. What is going to happen now?
Well, Linux promised you memory, and after exhausting all of the physical memory, it will start paging to swap file. But that can be exhausted to. That is when the Out Of Memory Killer gets to play, and it takes an axe and start choosing a likely candidate to be mercilessly murdered. The “nice” thing about this is that there is no control over that, and you might be a perfectly well behaved process that the OOM just doesn’t like this Monday, so buh-bye!
Looking around, it seems that we aren’t the only one that had run head first into this issue. The Oracle recommendation is to set things up to panic and reboot the entire machine when this happens, and that seems… unproductive.
The problem is that as a database, we aren’t really in control of how much we allocate, and we rely on the system to tell us when we do too much. Linux has no facility to do things like warn applications that memory is low, or even letting us know that by refusing to allocate more memory. Both are things that we already support, and would be very helpful.
That is quite annoying.
Comments
I witnessed Linux kernel mailing list discussions about the OOM killer heuristic. Turns out there does not seem to be a good one. Sometimes you want to kill a "runaway" memory hog, sometimes that memory hog is your mission critical database.
In that discussion they were tweaking things around until it all seemed to work in practice... Very brittle.
Have you looked at ulimit?
Carl, That is unrelated
The only time I have seen system crash because of this was when some smartass 'linux admin' disabled the swap everywhere because he heard swap is bad and makes computer slower. And I thought linux admins are more likely to understand what's going on.
This is why most linux server targeted applications have some form of explicit settings for memory consumption, so that sysadmins can ensure the OS will not pull the axe on them.
But I'm guessing, in Raven controlling all the buffers and memory allocations according to some self imposed memory settings won't be trivial. I'm curious how much help can Mono provide with this issue...so please do share :) can't wait.
Ah, so then this is why Redis ports for Windows aren't recommended for production use? fork() is not available and it's difficult to actually code that properly?
David, I'm not sure, I wouldn't be surprised if this is just one of many such issues. Moving between platforms is hard
The OOM killer is one of the major reasons why I moved my hosting from Linux to Windows. It was a drop too much. When my server ran out of memory (due to poor recursive code somewhere in PHP, I believe), it went around killing processes and left me with a crippled, half-functional and practically unusable (no ssh, among other things) server. I obviously had to reboot it. This is not nice behavior for a server. What's the point? It's like it broke your server to attract your attention. Windows does it in a more elegant and sane fashion, as far as I know (not that Windows is perfect, it can do worse). So it has to be possible, no? In the end the proper solution would be to put safeguards at the process level. If every process looks at memory as if it was an infinite resource (when it's not) then such things are bound to happen.
Well, Linux memory management certainly has some legacy that doesn't exactly make it very good. However, it's not like noone knows that it is a problem, and that's why cgroups are so handy - there you can define hard, soft and physical+swap limits per process (group) and you can even set an option to not kill on oom, but rather to wait until memory is available again, e.g. after buffers are cleared.
Comment preview