Ideal deployment disaster

time to read 2 min | 301 words

Today we botched deployment for production. We discovered an unforeseen complication related to the production network topography which means that (to simplify) transactions wouldn’t work.

I am happy about this. Not because of the failed production deployment, but because of how it happened. We worked on deploying the app for half an hour, found the problem, then spent the next few hours trying to figure out the exact cause and options to fix it. We actually came up with four ways of resolving the issue, from the really hackish to what I believe that would be appropriate, but would take some time to do using the proper process for production changes.

So, why am I happy?

We spent time on that from noon to 3 PM (look Ma, no midnight deployment), and we actually aren’t going live any time soon. The reason for deploying for production was to practice, to see what strange new things would show up in production. They did show up, we will fix them, at our leisure, without trying to solve a hairy problem while staring at the screen with blood shot eyes and caffeine positioning.  At no time there was someone breathing down our neck, or the IT guys to argue with, or emergency change request or any of the usual crap that usually comes with deploying to production if you aren’t being careful.

Practice the way you play, and since pushing to production is usually such a mess anyway, you want to practice this as soon as possible. And not just deploying to another environment (we already did that last week) to test your deployment process. You want to deploy to the actual production environment, so you can get permissions, firewalls, subnets and all the rest sorted out up front.