Why Shell-Native Monitoring Matters Even With Vendors Everywhere
You have Datadog. You have Prometheus. You have CloudWatch. So why write monitoring in shell?
Because the gap between “what your vendor sees” and “what’s actually true” is exactly the surface where outages live. Vendor agents collect what their schema knows about; they don’t know that your nightly batch job emits a sentinel file at /var/lib/jobs/last-success, that your custom build of nginx puts a status JSON at /run/nginx/status.json, or that the real health of your app is “the queue depth in /var/spool/myapp is < 1000.” Shell-native monitoring lets you measure exactly those things.
The four shell-script monitoring patterns:
| Pattern | What it measures | Where it runs | Output |
|---|---|---|---|
| Textfile exporter | Custom metrics for Prometheus | Cron / timer | Files in /var/lib/node_exporter/textfile_collector/ |
| HTTP health probe | Liveness/readiness for LB or k8s | Service container | HTTP 200 or 5xx via simple HTTP server |
| Watchdog | Detect “alive but stuck” | sd_notify or external | systemd restart / alert |
| Push agent | Active reporting to dashboards | Continuous | HTTP POST to ingestion endpoint |
This lesson teaches the discipline of each pattern, the Prometheus exposition format your scripts must produce, the difference between liveness and readiness probes (and why getting it wrong cascades incidents), and a lib/metrics.sh you can source.
The Prometheus Exposition Format (5-Minute Tutorial)
Prometheus scrapes targets that expose metrics in a specific text format. The format is plain text, line-oriented, designed for shell scripts to emit:
# HELP myapp_jobs_total Total jobs processed.
# TYPE myapp_jobs_total counter
myapp_jobs_total{queue="orders",status="success"} 12345
myapp_jobs_total{queue="orders",status="failure"} 17
myapp_jobs_total{queue="billing",status="success"} 9821
# HELP myapp_queue_depth Current queue depth.
# TYPE myapp_queue_depth gauge
myapp_queue_depth{queue="orders"} 42
myapp_queue_depth{queue="billing"} 7
# HELP myapp_request_duration_seconds Request latency.
# TYPE myapp_request_duration_seconds histogram
myapp_request_duration_seconds_bucket{le="0.1"} 1450
myapp_request_duration_seconds_bucket{le="0.5"} 1490
myapp_request_duration_seconds_bucket{le="1.0"} 1500
myapp_request_duration_seconds_bucket{le="+Inf"} 1500
myapp_request_duration_seconds_sum 234.5
myapp_request_duration_seconds_count 1500
The four metric types
| Type | Semantic | Example |
|---|---|---|
counter |
Monotonically increasing; reset only on process restart | requests_total, errors_total |
gauge |
Goes up and down | memory_bytes, queue_depth |
histogram |
Sample distribution into buckets | request_duration_seconds |
summary |
Like histogram but with quantiles computed at the source | Less common in shell |
Format rules
- One metric per line.
- Optional
# HELP <name> <text>and# TYPE <name> <type>lines describe the metric. - Labels in
{key="value",key2="value2"}— comma-separated, double-quoted values. - Value is a float (or integer; integers are accepted).
- An optional trailing timestamp in milliseconds since epoch (rarely used).
- Empty lines and blank space are ignored.
- The whole exposition must end with a newline.
The format is forgiving but strict on structure: a missing newline at the end, or unquoted label values, breaks parsing.
Pattern 1: Textfile Exporter
The simplest pattern. node_exporter (the standard host-metrics agent) has a --collector.textfile.directory flag that picks up any *.prom file from a directory and exposes its contents as part of its scrape output.
# /etc/cron.d/nightly-job-metrics
*/5 * * * * root /opt/myapp/bin/emit-metrics.sh
# /opt/myapp/bin/emit-metrics.sh
#!/usr/bin/env bash
set -Eeuo pipefail
OUT=/var/lib/node_exporter/textfile_collector
TMP=$(mktemp "${OUT}/myapp.prom.XXXXXX")
trap 'rm -f "$TMP"' EXIT
# Compute metrics.
queue_depth=$(find /var/spool/myapp -type f | wc -l)
last_success_age=$(( $(date +%s) - $(stat -c %Y /var/lib/myapp/last-success 2>/dev/null || echo 0) ))
disk_used_pct=$(df --output=pcent /var/lib/myapp | tail -1 | tr -d ' %')
# Emit.
cat >"$TMP" <<EOF
# HELP myapp_queue_depth Pending jobs.
# TYPE myapp_queue_depth gauge
myapp_queue_depth $queue_depth
# HELP myapp_last_success_age_seconds Seconds since last successful run.
# TYPE myapp_last_success_age_seconds gauge
myapp_last_success_age_seconds $last_success_age
# HELP myapp_disk_used_percent Disk usage of /var/lib/myapp.
# TYPE myapp_disk_used_percent gauge
myapp_disk_used_percent $disk_used_pct
EOF
# Atomic move into place — so the exporter never reads a partial file.
mv "$TMP" "$OUT/myapp.prom"
trap - EXIT
The atomic-move-from-tmp pattern is critical: node_exporter reads *.prom files at scrape time. If you write directly with >, the exporter can read a half-written file and emit garbage to Prometheus. Always tmp-then-rename in the same directory.
Why textfile is the right pattern for batch / scheduled work
- No HTTP server in your script.
- node_exporter is already running on the host; you piggyback on its endpoint.
- Cron-driven, so works for jobs that run periodically (backups, sync jobs, batch).
- Survives if your script dies — the .prom file remains; Prometheus sees stale data, alerts on freshness.
Pattern 2: HTTP Health Probe
For liveness/readiness probes, you need an HTTP endpoint. The dead simple way is socat or ncat listening on a port and returning a static or computed response:
# /opt/myapp/bin/healthd
#!/usr/bin/env bash
set -Eeuo pipefail
PORT=${PORT:-8080}
while :; do
# Accept one connection at a time. ncat -k keeps the listener open.
ncat -l -p "$PORT" -k -e /opt/myapp/bin/health-handler.sh
done
# /opt/myapp/bin/health-handler.sh — invoked per request
#!/usr/bin/env bash
set -Eeuo pipefail
# Compute health.
last_heartbeat=$(stat -c %Y /var/lib/myapp/heartbeat 2>/dev/null || echo 0)
age=$(( $(date +%s) - last_heartbeat ))
if (( age < 30 )); then
status_code="200 OK"
body='{"status":"healthy","heartbeat_age_seconds":'"$age"'}'
else
status_code="503 Service Unavailable"
body='{"status":"unhealthy","heartbeat_age_seconds":'"$age"'}'
fi
# Read the request line (we don't care about its contents but must consume).
read -r request_line || true
printf 'HTTP/1.1 %s\r\n' "$status_code"
printf 'Content-Type: application/json\r\n'
printf 'Content-Length: %d\r\n' "${#body}"
printf 'Connection: close\r\n'
printf '\r\n'
printf '%s' "$body"
For Kubernetes liveness probe:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
Liveness vs Readiness: get this distinction right
Liveness: “is the process alive enough to be useful?” Failed liveness → restart the process. Should rarely fail; a transient failure ≠ kill.
Readiness: “is the process ready to serve traffic right now?” Failed readiness → remove from load-balancer rotation. Can flap freely; useful for “still warming up” or “circuit breaker open.”
Common bug: making liveness check too strict (e.g., requires an external database) — when the DB blips, all replicas fail liveness, k8s restarts them all simultaneously, and the DB blip becomes a full outage.
Rule: liveness checks only test “this process is responsive.” Readiness checks test “I’m fully functional.” External dependency checks belong in readiness, never in liveness.
# /healthz/live — only checks process self
liveness_check() {
# Did our event loop tick recently?
local hb_age
hb_age=$(( $(date +%s) - $(stat -c %Y /var/lib/myapp/heartbeat 2>/dev/null || echo 0) ))
(( hb_age < 60 )) # tolerate up to 60s — restart is expensive
}
# /healthz/ready — checks downstream dependencies
readiness_check() {
# Can we reach the database?
pg_isready -h "$DB_HOST" -p 5432 -t 2 >/dev/null 2>&1 || return 1
# Is the cache warm?
[[ -f /var/lib/myapp/cache.warm ]] || return 1
return 0
}
Pattern 3: Watchdog — Detecting “Alive But Stuck”
A liveness probe answers “is the process responding?” A watchdog answers “is the process making progress?” — a much harder question.
The pattern:
- The main loop writes a “heartbeat” timestamp on every iteration.
- A separate watchdog reads the heartbeat; if it’s stale, the process is stuck.
- Action: kill the process (so systemd restarts it) or trigger an alert.
sd_notify watchdog (preferred for systemd-managed services)
# In your service script:
main_loop() {
systemd-notify --ready --status="Started"
while :; do
process_one_batch || break
systemd-notify WATCHDOG=1 --status="Last batch: $(date -u +%FT%TZ)"
sleep 5
done
}
In the unit file:
[Service]
Type=notify
WatchdogSec=30
Restart=on-failure
WatchdogSec=30 — if 30s pass without WATCHDOG=1, systemd considers the service stuck and restarts it. The script must call systemd-notify WATCHDOG=1 more often than every 30s. Half the timeout is a good interval (so a 30s watchdog → ping every 15s).
External watchdog (for non-systemd or cron-driven contexts)
# /opt/myapp/bin/watchdog.sh — runs from cron every minute
#!/usr/bin/env bash
set -Eeuo pipefail
HEARTBEAT=/var/lib/myapp/heartbeat
MAX_AGE=120
PID_FILE=/var/run/myapp.pid
if [[ ! -f "$HEARTBEAT" || ! -f "$PID_FILE" ]]; then
echo "watchdog: no heartbeat or pidfile; nothing to do" >&2
exit 0
fi
age=$(( $(date +%s) - $(stat -c %Y "$HEARTBEAT") ))
pid=$(cat "$PID_FILE")
if (( age > MAX_AGE )) && kill -0 "$pid" 2>/dev/null; then
echo "watchdog: pid=$pid heartbeat is ${age}s stale; SIGTERM"
kill -TERM "$pid"
sleep 5
if kill -0 "$pid" 2>/dev/null; then
echo "watchdog: pid=$pid still alive; SIGKILL"
kill -KILL "$pid"
fi
fi
The kill-with-grace pattern: SIGTERM first, give 5 seconds, then SIGKILL. SIGTERM lets the process clean up (close DB connections, flush buffers); SIGKILL is the hammer when grace is over.
Pattern 4: Push-Based Reporting
Some monitoring systems ingest metrics over HTTP rather than scraping. A push agent runs continuously, computes metrics, sends them to the ingestion endpoint:
# /opt/myapp/bin/push-agent.sh
#!/usr/bin/env bash
set -Eeuo pipefail
METRICS_URL="${METRICS_URL:?metrics URL required}"
METRICS_TOKEN="${METRICS_TOKEN:?token required}"
INTERVAL=${INTERVAL:-60}
while :; do
# Build payload.
ts=$(date +%s)
payload=$(jq -nc \
--arg ts "$ts" \
--arg host "$(hostname)" \
--arg cpu "$(awk '{print $1}' /proc/loadavg)" \
--arg mem "$(awk '/MemAvailable:/ {print $2}' /proc/meminfo)" \
'{
timestamp: ($ts | tonumber),
host: $host,
metrics: {
load_1min: ($cpu | tonumber),
memory_available_kb: ($mem | tonumber)
}
}')
# Send. Don't crash on transient failures.
if ! curl -fsS -X POST \
-H "Authorization: Bearer $METRICS_TOKEN" \
-H "Content-Type: application/json" \
--max-time 10 \
--data "$payload" \
"$METRICS_URL"; then
echo "$(date -u +%FT%TZ) push failed" >&2
# Continue; maybe next iteration succeeds.
fi
sleep "$INTERVAL"
done
Run it under systemd with Restart=on-failure so a crash doesn’t silently stop reporting.
A Drop-In Library: lib/metrics.sh
# lib/metrics.sh — emit Prometheus textfile metrics from any script.
: "${METRICS_DIR:=/var/lib/node_exporter/textfile_collector}"
: "${METRICS_NAMESPACE:=myapp}"
# Internal: collected metrics buffered in associative arrays (bash 4+).
declare -A METRIC_HELP METRIC_TYPE
declare -a METRIC_LINES
metrics_init() {
METRIC_HELP=()
METRIC_TYPE=()
METRIC_LINES=()
}
# Declare a metric. Idempotent.
metrics_declare() {
local name="$1" type="$2" help="$3"
METRIC_HELP["$name"]="$help"
METRIC_TYPE["$name"]="$type"
}
# Add a sample. labels can be empty.
metrics_set() {
local name="$1" value="$2" labels="${3:-}"
if [[ -n "$labels" ]]; then
METRIC_LINES+=("${name}{${labels}} ${value}")
else
METRIC_LINES+=("${name} ${value}")
fi
}
# Increment a counter (read existing, add). Useful for cron-driven counters.
metrics_inc() {
local name="$1" labels="${2:-}" by="${3:-1}"
local file="${METRICS_DIR}/${METRICS_NAMESPACE}.counters"
local key
if [[ -n "$labels" ]]; then
key="${name}{${labels}}"
else
key="${name}"
fi
# File format: "key value"
local current
current=$(awk -v k="$key" '$1==k {print $2; exit}' "$file" 2>/dev/null || echo 0)
current=${current:-0}
local new=$(( current + by ))
# Atomic update via temp file.
local tmp
tmp=$(mktemp "${file}.XXXXXX")
awk -v k="$key" -v v="$new" '
$1==k {print k, v; found=1; next}
{print}
END { if (!found) print k, v }
' "$file" 2>/dev/null > "$tmp" || echo "$key $new" > "$tmp"
mv "$tmp" "$file"
}
# Emit all collected metrics atomically.
metrics_emit() {
local outfile="${METRICS_DIR}/${METRICS_NAMESPACE}.prom"
local tmp
tmp=$(mktemp "${outfile}.XXXXXX")
trap "rm -f '$tmp'" EXIT
# Group by metric name for HELP/TYPE headers.
declare -A seen
local line metric
{
for line in "${METRIC_LINES[@]}"; do
metric="${line%%[ {]*}"
if [[ -z "${seen[$metric]:-}" ]]; then
seen[$metric]=1
printf '# HELP %s %s\n' "$metric" "${METRIC_HELP[$metric]:-}"
printf '# TYPE %s %s\n' "$metric" "${METRIC_TYPE[$metric]:-untyped}"
fi
printf '%s\n' "$line"
done
} > "$tmp"
install -m 0644 "$tmp" "$outfile"
rm -f "$tmp"
trap - EXIT
}
# ─── Health endpoint helpers ───────────────────────────────────────────────
health_response_ok() {
local body="${1:-{\"status\":\"healthy\"}}"
printf 'HTTP/1.1 200 OK\r\n'
printf 'Content-Type: application/json\r\n'
printf 'Content-Length: %d\r\n' "${#body}"
printf 'Connection: close\r\n\r\n'
printf '%s' "$body"
}
health_response_unhealthy() {
local reason="${1:-unhealthy}"
local body="{\"status\":\"unhealthy\",\"reason\":\"$reason\"}"
printf 'HTTP/1.1 503 Service Unavailable\r\n'
printf 'Content-Type: application/json\r\n'
printf 'Content-Length: %d\r\n' "${#body}"
printf 'Connection: close\r\n\r\n'
printf '%s' "$body"
}
# ─── Heartbeat ─────────────────────────────────────────────────────────────
heartbeat_write() {
local file="${1:-/var/lib/myapp/heartbeat}"
date -u +%s > "$file"
}
heartbeat_age() {
local file="${1:-/var/lib/myapp/heartbeat}"
echo $(( $(date +%s) - $(stat -c %Y "$file" 2>/dev/null || echo 0) ))
}
Usage:
. /opt/myapp/lib/metrics.sh
metrics_init
metrics_declare myapp_queue_depth gauge "Pending jobs."
metrics_set myapp_queue_depth $(find /var/spool -type f | wc -l)
metrics_declare myapp_jobs_total counter "Total jobs processed."
metrics_set myapp_jobs_total 12345 'queue="orders",status="success"'
metrics_set myapp_jobs_total 17 'queue="orders",status="failure"'
metrics_emit
Real-World Recipes
Recipe 1: Emit metrics about backup freshness
. /opt/myapp/lib/metrics.sh
metrics_init
backup_dir=/backups
metrics_declare backup_age_seconds gauge "Age of latest backup."
metrics_declare backup_size_bytes gauge "Size of latest backup."
metrics_declare backup_count gauge "Number of backups retained."
for app in myapp app2 app3; do
latest=$(ls -t "$backup_dir/$app/"*.tar.gz 2>/dev/null | head -1)
if [[ -n "$latest" ]]; then
age=$(( $(date +%s) - $(stat -c %Y "$latest") ))
size=$(stat -c %s "$latest")
count=$(ls "$backup_dir/$app/"*.tar.gz 2>/dev/null | wc -l)
metrics_set backup_age_seconds "$age" "app=\"$app\""
metrics_set backup_size_bytes "$size" "app=\"$app\""
metrics_set backup_count "$count" "app=\"$app\""
fi
done
metrics_emit
Schedule via cron */5 * * * *. Prometheus alerts on backup_age_seconds > 86400 per app.
Recipe 2: HTTP health endpoint with multiple dependency checks
#!/usr/bin/env bash
. /opt/myapp/lib/metrics.sh
set -Eeuo pipefail
PORT="${PORT:-8080}"
handle_request() {
read -r request_line || return
# Read remaining headers until empty line.
while IFS= read -r line && [[ "$line" != $'\r' ]]; do :; done
local path
path=$(echo "$request_line" | awk '{print $2}')
case "$path" in
/healthz/live)
# Just check we're processing.
if (( $(heartbeat_age /var/lib/myapp/heartbeat) < 60 )); then
health_response_ok '{"status":"alive"}'
else
health_response_unhealthy "heartbeat stale"
fi
;;
/healthz/ready)
# Check downstream dependencies.
local issues=()
pg_isready -h "$DB_HOST" -t 2 >/dev/null 2>&1 || issues+=("db")
curl -fsS --max-time 2 "$REDIS_URL" >/dev/null || issues+=("redis")
if [[ ${#issues[@]} -eq 0 ]]; then
health_response_ok '{"status":"ready"}'
else
health_response_unhealthy "deps: ${issues[*]}"
fi
;;
*)
printf 'HTTP/1.1 404 Not Found\r\nContent-Length: 0\r\n\r\n'
;;
esac
}
while :; do
ncat -l -p "$PORT" -e "$0 --handle"
done
Recipe 3: External watchdog with metrics
# /opt/myapp/bin/watchdog.sh — runs every minute via cron.
. /opt/myapp/lib/metrics.sh
metrics_init
services=(myapp-api myapp-worker myapp-scheduler)
metrics_declare service_active gauge "Service active state (1=active)."
metrics_declare service_restart_count counter "Service restart count."
for svc in "${services[@]}"; do
if systemctl is-active --quiet "$svc"; then
metrics_set service_active 1 "service=\"$svc\""
else
metrics_set service_active 0 "service=\"$svc\""
# Try to restart.
if systemctl restart "$svc"; then
metrics_inc service_restart_count "service=\"$svc\""
fi
fi
done
metrics_emit
Recipe 4: Histogram-style request latency exporter
# Process a log file, emit latency histogram.
# Log format: "GET /api/users 0.342s 200"
. /opt/myapp/lib/metrics.sh
metrics_init
log=/var/log/myapp/access.log
buckets=(0.1 0.5 1.0 2.0 5.0)
declare -A bucket_counts
total_count=0
total_sum=0
# Read latencies from last 5 minutes.
since=$(date -d '5 minutes ago' +%s)
while IFS=' ' read -r _ _ duration _; do
duration=${duration%s}
total_count=$(( total_count + 1 ))
total_sum=$(awk -v s="$total_sum" -v d="$duration" 'BEGIN { print s + d }')
for bucket in "${buckets[@]}"; do
if (( $(awk -v d="$duration" -v b="$bucket" 'BEGIN { print (d <= b) }') )); then
bucket_counts[$bucket]=$((${bucket_counts[$bucket]:-0} + 1))
fi
done
done < <(tail -10000 "$log")
metrics_declare myapp_request_duration_seconds histogram "Request latency."
for bucket in "${buckets[@]}"; do
metrics_set myapp_request_duration_seconds_bucket "${bucket_counts[$bucket]:-0}" "le=\"$bucket\""
done
metrics_set myapp_request_duration_seconds_bucket "$total_count" 'le="+Inf"'
metrics_set myapp_request_duration_seconds_sum "$total_sum"
metrics_set myapp_request_duration_seconds_count "$total_count"
metrics_emit
Footgun List
-
Writing
.promfiles non-atomically. Always tmp-then-rename in the same dir. Otherwise node_exporter reads a half-written file and emits broken metrics. -
Metrics with high cardinality labels. Per-user-ID labels create millions of time series. Limit labels to bounded sets (status code, queue name, region — not request_id).
-
Counter going down. Counters must monotonically increase. If your script computes “errors in last 5 min” and emits as a counter, you’ll see negative deltas. Use a gauge for “current snapshot,” counter for “cumulative since process start.”
-
Liveness checking external deps. Cascades failures. Liveness only checks self; readiness checks deps.
-
Health endpoint without timeout. A hung DB query freezes the health endpoint, k8s thinks pod is dead, restarts it — and the new pod tries the same query and freezes too. Always timeout dep checks:
pg_isready -t 2. -
Watchdog with no grace. Killing on the first late heartbeat is wrong if heartbeats are best-effort. Allow 2–3 missed cycles before action.
-
Push agent that crashes on transient send failure. Wrap
curlinif/then; log the failure and continue. Don’tset -eyour way to silent monitor death. -
Forgetting trailing newline in textfile output. Prometheus parsers may reject; always end the file with a newline.
-
Label values with quotes/backslashes/newlines. Escape:
\\for backslash,\"for quote,\nfor newline. -
Health endpoint that performs writes. Don’t make
/healthzinsert a row to test the DB. The probe runs every 10 seconds — you’d flood the DB. Use read-only checks. -
Mixing systemd-notify watchdog with external watchdog. Pick one. Two watchdogs fighting over the same process leads to flapping restarts.
-
Sending raw timestamps as metric values. Prometheus expects floats.
date +%sis fine; ISO-8601 strings break parsing.
Quick-Reference Card
┌─ PROMETHEUS METRIC TYPES ─────────────────────────────────────────────┐
│ counter monotonically increasing (requests_total, errors_total) │
│ gauge goes up and down (queue_depth, memory_bytes) │
│ histogram bucketed sample distribution (request_duration_seconds) │
│ summary quantiles computed at source (less common in shell) │
└────────────────────────────────────────────────────────────────────────┘
┌─ EXPOSITION FORMAT ───────────────────────────────────────────────────┐
│ # HELP <metric> <description> │
│ # TYPE <metric> <type> │
│ <metric>{label="value",...} <number> │
│ Trailing newline required │
└────────────────────────────────────────────────────────────────────────┘
┌─ TEXTFILE EXPORTER ───────────────────────────────────────────────────┐
│ Drop *.prom in /var/lib/node_exporter/textfile_collector/ │
│ ATOMIC WRITE: tmp + mv (never `>`) │
│ node_exporter --collector.textfile.directory=... │
│ Schedule via cron or systemd timer │
└────────────────────────────────────────────────────────────────────────┘
┌─ LIVENESS vs READINESS ───────────────────────────────────────────────┐
│ Liveness: am I responsive? (Failed → restart process) │
│ Only checks self; never external deps │
│ Readiness: am I serving traffic? (Failed → remove from LB) │
│ Can check downstream deps; flapping is OK │
└────────────────────────────────────────────────────────────────────────┘
┌─ WATCHDOG PATTERN ────────────────────────────────────────────────────┐
│ Process writes heartbeat (timestamp file) every iteration │
│ Watcher checks heartbeat freshness │
│ Stale → SIGTERM with grace, then SIGKILL │
│ systemd: Type=notify + WatchdogSec=N + systemd-notify WATCHDOG=1 │
└────────────────────────────────────────────────────────────────────────┘
┌─ KUBERNETES PROBE FIELDS ─────────────────────────────────────────────┐
│ initialDelaySeconds wait before first probe (allow startup) │
│ periodSeconds interval between probes │
│ timeoutSeconds per-probe timeout (set to 2-5) │
│ failureThreshold consecutive failures before action │
│ successThreshold consecutive successes (readiness only) │
└────────────────────────────────────────────────────────────────────────┘
What’s Next
Monitoring tells you the system’s state. Backups protect you when the state is wrong. The next lesson, Backup & Restore Scripts: Integrity, Retention, Immutability & Drill Testing, covers the discipline of backups that actually work — checksumming for integrity, retention with grandfather-father-son schemes, immutable backups via S3 Object Lock, and the practice of regularly restoring from backups to verify they’re real.