Just had to trouble shoot a really strange problem in a production system. The situation is pretty simple. At one point in time, a machine had an issue and shortly afterward became utterly unable to reach the outside world. I was luckily able to SSH into the machine, but any outgoing connection would fail.
Given that TCP worked (I was using it or SSH), how can that be? I can create a connection to the system and have bidirectional communication, but not initiate any calls to the outside world.
Strange doesn’t begin to cover how weird this is. The reason for the failure, by the way, was that the disk was absolutely full. The reason for the network outage was a mystery.
It took a while to figure out that the error was that the disk was full, which caused an error is systemd-resolved, which failed to setup DNS properly. I was also unable to make connections via IP address, which was strange, but the problem went away after the disk was cleared.
I have to say, network failure due to disk errors was not how I expected to start the day.