Make infrastructure easier to rebuild than to repair
If you can quickly recreate the app and its environment on demand, you can rebuild instead of repair when things break. Even with a single production server, adopt the discipline. Then redefine "done" so every story is exercised in production-like environments before it ships.
"We used to treat servers like pets — you name them, and when they get sick, you nurse them back to health. Now servers are treated like cattle — you number them, and when they get sick, you shoot them."
// Bill Baker, Distinguished Engineer, Microsoft.
Whenever you change production — config, patch, upgrade — that change has to be replicated everywhere automatically (prod, pre-prod, every new env). No SSH-and-edit. Two patterns that get you there:
Use Puppet, Chef, Ansible, Salt, etc. to reconcile servers to a desired state declared in version control. Runtime config goes through Istio / AWS SSM Parameter Store / similar.
Build a brand-new VM or container image from automated process; deploy it; destroy the old one or rotate it out. Manual changes are not allowed — the only path to prod is through the repo.
Pattern B is what most modern stacks (Kubernetes, ECS, serverless) enforce by default. The result: no drift can creep in.
"Works on box-A, fails on box-B."
One restart and it's gone forever.
"Only Carlos knows how to rebuild that."
Each server unique, none reproducible.
Developers will want to stay on old environments — they're afraid an env update will break something. Update anyway, frequently. The earlier you find env-related breakage, the cheaper the fix. GitHub's 2020 State of the Octoverse report: keeping your software current is the single best way to secure the codebase. Same for infrastructure.
A major hotel chain (covered in the DevOps Handbook 2nd ed.) ran $30B of bookings through a containerized platform. The whole point of the case: at that scale, you cannot maintain pets. Every container is immutable, every deploy creates fresh ones, every problem is fixed by killing and re-spawning. The cattle model is the only model that survives at scale — which means it's the model to start with at small scale, too.
download Case study · Hotel $30B in Containers (2020) · PDF ~660 KBAt the end of each development interval — or more often — every feature must be integrated, tested, working, and potentially shippable, demonstrated in a production-like environment. "Done" isn't "the unit tests pass on my laptop." It's "this is running end-to-end in something that looks like prod, and we could ship it now if we chose to."
By the end of the project, the code has been deployed and run in production-like environments hundreds or thousands of times, which means most of the deploy problems are already found and fixed before they would have hit a customer.
A team SSHs into production to apply a hotfix — 'just this once, we will codify it later.' What happens?
// pick one to verify
A team's definition of done is 'unit tests pass + code review approved.' A senior dev proposes adding 'has been demonstrated in a production-like environment.' What's the upside?
// pick one to verify