Observability & scale walkthrough
The same console as Day 1 — different layers. Metrics, alarms, audit, autoscaling. The drama is when an instance goes unhealthy and the ASG replaces it live.
- 01
CloudWatch Metrics
Pick the EC2 instance from Day 1. Find CPU / network / disk. Show the per-AZ split if multi-AZ.
- 02
Custom metric
Push one from CLI or Lambda. Talk through what custom metrics cost and when they're worth it.
- 03
Set up an alarm
CPU > 80% for 5 min → SNS email. Discuss alerting hygiene — what should page someone, what shouldn't.
- 04
CloudTrail audit
Find a single API call from Day 1's demo. Show retention, immutability, and why this is the source of truth in incidents.
- 05
Auto Scaling Group
Create a launch template. Scale on CPU. Walk through min/max/desired and what each one does.
- 06
Application Load Balancer
Register the ASG as a target group. Show health checks, listener rules, and how the LB hides instance churn from clients.
- 07
Take an instance unhealthy
The ASG replaces it. The drama is the lesson — engineering self-healing as a teaching moment.
- 08
CloudWatch Logs Insights
Run a query against the log stream. Show how observability becomes investigatable, not just visible.
From Google SRE — start with these four when instrumenting any service:
how slow
how busy
how broken
how full
// discussion: which of the four are we not covering with this demo, and why?
- — Find any AWS metric, log, or audit event in under a minute
- — Set an alarm and explain why it should fire
- — Read an Auto Scaling Group config
- — Justify what's missing from "good-enough" observability