An outage every 30 minutes
We run a lot of benchmarks internally and sometimes it feels like there is a roaming band of performance focused optimizers that go through the office and try to find under utilized machines. Some people mine bitcoin for fun, in our office, we benchmark RavenDB and try to see if we can either break a record or break RavenDB.
Recently a new machine was… repurposed to serve as a benchmarking server. You can call it a right of passage for most new machines here, I would say. The problem with that machine is that the client would error. Not only would it fail, but at the exact same interval. We tested that from multiple clients and from multiple machines and found that every 30 minutes on the dot, we’ll have an outage that lasted under one second.
Today I come to the office to news that the problem was found:
It seems that after 30 minutes of idle time (no user logged in), the machine would turn off the ethernet, regardless of if there are active connections going on. Shortly afterward it would be woken up, of course, but it would be down just enough time for us to notice it.
In fact, I’m really happy that we got an error. I would hate to try to figure out latency spikes because of something like this, and I still don’t know how the team found the root cause.
Comments
I’ll see your sleep and raise you a network switch.
We had a customer who had a switch that was configured to shit down idle connections (those that had not been used for more than a half hour).
This system was entirely on a private network. No outside activity so all connections are from known boxes to known boxes.
Our product was composed of a number of application servers connected to a number of database servers with persistent connection caching.
Of course, at 2 am the load on the system was much much less than at 2 pm. Because of that the connection pool would often exceed the 30 minute time before a connection again received data (even using round robin). There would be a thousand connections in use during the afternoon but maybe a hundred for an hour overnight.
So... the switch just silently broke those connections. There was no IP layer messages sent, it just removed it from its internal table. So the database and the PC both still thought the connection was active. Their stacks were entirely still happy. They both kept trying to send to each other but the intervening network just ate all the packets and they would wait forever (or until they timed out). But we had a few thousand connections in the pool that all had to timeout when the morning rush happened!
It was the first use for tcp keep-alive that I’ve ever had!
Stephen, The Release It book (https://pragprog.com/book/mnee/release-it) contains a lot of stuff like that.RavenDB does keep alive on all connections because of this.
Comment preview