Make infrastructure easier to rebuild than to repair

If you can quickly recreate the app and its environment on demand, you can rebuild instead of repair when things break. Even with a single production server, adopt the discipline. Then redefine "done" so every story is exercised in production-like environments before it ships.

"We used to treat servers like pets — you name them, and when they get sick, you nurse them back to health. Now servers are treated like cattle — you number them, and when they get sick, you shoot them."

// Bill Baker, Distinguished Engineer, Microsoft.

// immutable infrastructure

Whenever you change production — config, patch, upgrade — that change has to be replicated everywhere automatically (prod, pre-prod, every new env). No SSH-and-edit. Two patterns that get you there:

// pattern A · config management

Use Puppet, Chef, Ansible, Salt, etc. to reconcile servers to a desired state declared in version control. Runtime config goes through Istio / AWS SSM Parameter Store / similar.

// pattern B · build new, destroy old

Build a brand-new VM or container image from automated process; deploy it; destroy the old one or rotate it out. Manual changes are not allowed — the only path to prod is through the repo.

Pattern B is what most modern stacks (Kubernetes, ECS, serverless) enforce by default. The result: no drift can creep in.

// what immutable kills · the snowflake server menagerie

drift

"Works on box-A, fails on box-B."

fragile artifacts

One restart and it's gone forever.

works of art

"Only Carlos knows how to rebuild that."

snowflakes

Each server unique, none reproducible.

// keep pre-prod current

Developers will want to stay on old environments — they're afraid an env update will break something. Update anyway, frequently. The earlier you find env-related breakage, the cheaper the fix. GitHub's 2020 State of the Octoverse report: keeping your software current is the single best way to secure the codebase. Same for infrastructure.

// case study · hotel co. ran $30b of revenue in containers (2020)

A major hotel chain (covered in the DevOps Handbook 2nd ed.) ran $30B of bookings through a containerized platform. The whole point of the case: at that scale, you cannot maintain pets. Every container is immutable, every deploy creates fresh ones, every problem is fixed by killing and re-spawning. The cattle model is the only model that survives at scale — which means it's the model to start with at small scale, too.

download Case study · Hotel $30B in Containers (2020) · PDF ~660 KB

// redefining "done"

At the end of each development interval — or more often — every feature must be integrated, tested, working, and potentially shippable, demonstrated in a production-like environment. "Done" isn't "the unit tests pass on my laptop." It's "this is running end-to-end in something that looks like prod, and we could ship it now if we chose to."

By the end of the project, the code has been deployed and run in production-like environments hundreds or thousands of times, which means most of the deploy problems are already found and fixed before they would have hit a customer.

help Knowledge Check

Question 1/2

A team SSHs into production to apply a hotfix — 'just this once, we will codify it later.' What happens?

// pick one to verify

help Knowledge Check

Question 2/2

A team's definition of done is 'unit tests pass + code review approved.' A senior dev proposes adding 'has been demonstrated in a production-like environment.' What's the upside?

// pick one to verify

arrow_back mod-04 / version-control mod-05 / automated testing arrow_forward