Cloud Architecture Guide
DAY_03 / SECTION_03 // SCALE
MODULE READY

Consensus & lock services

When multiple machines must agree on one truth — which replica is master, who holds the lock, what config is current — you need a consensus algorithm. The cost of getting this wrong shows up in real-world outages constantly.

LOCK_01
ONLINE
lock

Chubby

Google · the original
Filesystem-like API for locks. Handles locks across datacenter locations. Uses the Paxos protocol for asynchronous consensus. Many Google services depend on it.
LOCK_02
IDLE
hub

etcd

Kubernetes' choice
Open-source, written in Go, uses Raft (a more recent, simpler algorithm with the same guarantees). What K8s puts its desired state into.
LOCK_03
STANDBY
schema

ZooKeeper

Big-data heritage
Used by Kafka, Hadoop, HBase. Hierarchical namespace (znodes). Older but battle-tested at scale. Increasingly replaced by etcd in newer designs.
// what they enable
master election

5 replicas, only 1 active — pick which one safely.

distributed config

Everyone agrees on the same current values.

leader/follower coord

Replicas know who's primary at any moment.

critical state

The data that must be consistent.

// paxos · the consensus problem at a high level
Paxos protocol diagram
// proposers · acceptors · learners

Multiple processes need to agree on a value despite failures. Proposers propose values, Acceptors vote on them, Learners receive the agreed value. The algorithm is robust against network partitions and failures — distributed storage stays consistent as long as a majority of acceptors can still talk.

// you don't usually implement paxos — you use a lock service that does. raft is a more recent, simpler algorithm with the same guarantees.

// why this matters in practice

If you've ever read a postmortem that says "X became unavailable when ZooKeeper went down" or "the etcd cluster lost quorum" — this is why.

Lock services are a single point of failure with cascade potential. They're replicated for exactly that reason — but quorum loss (e.g. 2 of 3 nodes down) takes the whole thing offline.

help Knowledge Check
Question 1/1

A team running self-hosted Kubernetes has 3 etcd nodes. One is down for maintenance and another fails. What happens?

// pick one to verify