Production postmortemThe industry at large
The following is a really good study on real world production crashes:
Simple Testing Can Prevent Most Critical Failures:
An Analysis of Production Failures in Distributed
Data-Intensive Systems
It makes for fascinating reading, especially since the include the details of the root cause of some of the errors. I wasn’t sure whatever to cringe or sympathize 
. 
More posts in "Production postmortem" series:
- (07 Apr 2025) The race condition in the interlock
 - (12 Dec 2023) The Spawn of Denial of Service
 - (24 Jul 2023) The dog ate my request
 - (03 Jul 2023) ENOMEM when trying to free memory
 - (27 Jan 2023) The server ate all my memory
 - (23 Jan 2023) The big server that couldn’t handle the load
 - (16 Jan 2023) The heisenbug server
 - (03 Oct 2022) Do you trust this server?
 - (15 Sep 2022) The missed indexing reference
 - (05 Aug 2022) The allocating query
 - (22 Jul 2022) Efficiency all the way to Out of Memory error
 - (18 Jul 2022) Broken networks and compressed streams
 - (13 Jul 2022) Your math is wrong, recursion doesn’t work this way
 - (12 Jul 2022) The data corruption in the node.js stack
 - (11 Jul 2022) Out of memory on a clear sky
 - (29 Apr 2022) Deduplicating replication speed
 - (25 Apr 2022) The network latency and the I/O spikes
 - (22 Apr 2022) The encrypted database that was too big to replicate
 - (20 Apr 2022) Misleading security and other production snafus
 - (03 Jan 2022) An error on the first act will lead to data corruption on the second act…
 - (13 Dec 2021) The memory leak that only happened on Linux
 - (17 Sep 2021) The Guinness record for page faults & high CPU
 - (07 Jan 2021) The file system limitation
 - (23 Mar 2020) high CPU when there is little work to be done
 - (21 Feb 2020) The self signed certificate that couldn’t
 - (31 Jan 2020) The slow slowdown of large systems
 - (07 Jun 2019) Printer out of paper and the RavenDB hang
 - (18 Feb 2019) This data corruption bug requires 3 simultaneous race conditions
 - (25 Dec 2018) Handled errors and the curse of recursive error handling
 - (23 Nov 2018) The ARM is killing me
 - (22 Feb 2018) The unavailable Linux server
 - (06 Dec 2017) data corruption, a view from INSIDE the sausage
 - (01 Dec 2017) The random high CPU
 - (07 Aug 2017) 30% boost with a single line change
 - (04 Aug 2017) The case of 99.99% percentile
 - (02 Aug 2017) The lightly loaded trashing server
 - (23 Aug 2016) The insidious cost of managed memory
 - (05 Feb 2016) A null reference in our abstraction
 - (27 Jan 2016) The Razor Suicide
 - (13 Nov 2015) The case of the “it is slow on that machine (only)”
 - (21 Oct 2015) The case of the slow index rebuild
 - (22 Sep 2015) The case of the Unicode Poo
 - (03 Sep 2015) The industry at large
 - (01 Sep 2015) The case of the lying configuration file
 - (31 Aug 2015) The case of the memory eater and high load
 - (14 Aug 2015) The case of the man in the middle
 - (05 Aug 2015) Reading the errors
 - (29 Jul 2015) The evil licensing code
 - (23 Jul 2015) The case of the native memory leak
 - (16 Jul 2015) The case of the intransigent new database
 - (13 Jul 2015) The case of the hung over server
 - (09 Jul 2015) The case of the infected cluster
 

Comments
Comment preview