Why This Capstone Exists
A shell script that is good enough today is the easiest piece of software to ship. A shell script that is still earning its keep three years from now, after the original author has left and the surrounding system has been refactored twice, is one of the hardest. The difference is not language quality — it’s lifecycle discipline.
This capstone consolidates the 41 preceding lessons into the four artifacts every team should keep on the wall:
- The Production Review Checklist — the questions every PR introducing or modifying a production shell script must answer.
- The Lifecycle Policy — the documented states a script lives through, from prototype to active to deprecated to retired.
- The Standard Metrics Surface — what every production shell script must emit so monitoring sees it.
- The Sunset Criteria — explicit triggers for retiring a script before it becomes a maintenance hazard.
Following these four artifacts turns “shell scripts as a graveyard of one-off tools” into “shell scripts as a sustainable engineering surface.”
The Library Family From The Series
Across L1-L41 we built a layered library of shell helpers. Each script in production should source the libraries relevant to its role:
| Library | From lesson | Purpose |
|---|---|---|
lib/log.sh |
L7 | Structured logging with levels, JSON output, log rotation |
lib/err.sh |
L8 | Error trap, stack trace, cleanup on exit |
lib/fs.sh |
L28 | Atomic file writes, temp-in-same-dir, fsync helpers |
lib/lock.sh |
L21 | flock single-instance, distributed locks via Redis |
lib/observe.sh |
L25 | Tracing helpers, structured events, span-id propagation |
lib/secrets.sh |
L24 | Vault/SSM lookup, credential masking, never-print discipline |
lib/test.sh |
L31 | Bats integration, fixture setup/teardown, assertion helpers |
lib/metrics.sh |
L34 | Prometheus textfile exporter, atomic .prom file writes |
lib/backup.sh |
L35 | sha256 manifests, GFS retention, S3 upload + verify |
lib/db.sh |
L36 | pg_dump/mysqldump pipelines, base backup, PITR drill |
lib/loganalyze.sh |
L37 | Streaming awk, mawk detection, fleet fan-out |
lib/heal.sh |
L38 | Detect-decide-act, idempotency, rate limit, circuit breaker |
lib/migrate.sh |
L39 | Resumable batch, watermark, staging cutover, sample-diff |
lib/compliance.sh |
L40 | Controls-as-tests, JSONL bundles, GPG signing, drift |
lib/forensics.sh |
L41 | Order-of-volatility capture, chain-of-custody log |
A real production deploy keeps these in /usr/local/lib/ with mode 0644, owned by root:root. Scripts source them from the canonical path. Updates ship via the same Ansible/Puppet/cloud-init that ships system config — version-controlled, reviewed, and rolled out with the same discipline as any other infrastructure code.
The Production Review Checklist
Every PR that introduces or modifies a production shell script must satisfy seven categories. The checklist is intentionally numerous — most categories are one-line yes/no, and the explicit list ensures nothing is forgotten.
Category 1: Boilerplate & Shell Mode
#!/usr/bin/env bash
set -o errexit -o nounset -o pipefail
IFS=$'\n\t'
readonly SCRIPT_NAME="$(basename "$0")"
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
trap 'on_exit $?' EXIT
on_exit() { local rc=$1; ...; }
Category 2: Argument Handling
Category 3: Error Handling
Category 4: Idempotency & Safety
Category 5: Observability
Category 6: Testing
Category 7: Documentation
The Single-Page Reviewer’s Checklist
For pasting into a PR template:
SHELL SCRIPT REVIEW
Boilerplate
[ ] #!/usr/bin/env bash
[ ] set -euo pipefail (or justified)
[ ] EXIT trap for cleanup
[ ] readonly script-self vars
Args & UX
[ ] --help with examples
[ ] --dry-run if mutating
[ ] Required args validated
Error handling
[ ] No silent 2>/dev/null
[ ] stderr for errors
[ ] Documented exit codes
Safety
[ ] Idempotent on re-run
[ ] Atomic file writes (lib/fs.sh)
[ ] No hardcoded secrets
[ ] Inputs sanitized
Observability
[ ] Structured logs
[ ] Heartbeat metric
[ ] Audit log if mutating
Testing
[ ] Bats tests in tests/
[ ] shellcheck clean
[ ] shfmt clean
[ ] CI green
Docs
[ ] Header block
[ ] Runbook updated
[ ] CHANGELOG
The Lifecycle Policy
Every production shell script lives through five states. The transitions between states are explicit, owner-driven, and tracked.
┌──────────┐ review ┌──────────┐ adoption ┌──────────┐
│ DRAFT │───────────▶│ PROVISIONAL│────────────▶│ ACTIVE │
└──────────┘ └──────────┘ └──────────┘
│
deprecation
▼
┌──────────┐ sunset ┌──────────┐
│ DEPRECATED│──────────▶│ RETIRED │
└──────────┘ └──────────┘
State 1: DRAFT
- Lives in a feature branch or
scripts/draft/. - Author is the only owner.
- No SLA, no monitoring expected.
- Not run from cron / systemd in production.
Exit criteria: code review approval + at least one test.
State 2: PROVISIONAL
- Merged to main but in
scripts/provisional/. - Has Bats tests passing in CI.
- Has shellcheck-clean.
- May run in production but with monitoring marked as “experimental.”
- Has a deadline for promotion to ACTIVE or rollback to DRAFT (typically 30 days).
Exit criteria: 30 days of clean operation + observability metrics in place + runbook written.
State 3: ACTIVE
- The “happy” state. Script is in
scripts/active/(or wherever your prod tree puts it). - Has documented owner team and on-call runbook.
- Emits metrics; has alert rules.
- Reviewed annually for continued relevance.
- Any change goes through full PR review.
Exit criteria: someone files an issue marking it for deprecation.
State 4: DEPRECATED
- Functioning but flagged for replacement.
- Has a target retirement date.
- Has a documented replacement (a real tool, an upstream library, or “no longer needed”).
- A deprecation warning is logged on every invocation.
- New consumers are blocked at PR review.
Exit criteria: target retirement date passes AND no consumers remain.
State 5: RETIRED
- Removed from
scripts/active/, archived toscripts/retired/<year>/. - Cron / systemd entries removed.
- Monitoring rules archived.
- Final commit message records why retired and what replaced it.
- Kept in version control forever (for audit and “we used to do this” forensics) but never deployed.
The Lifecycle Tracker (METADATA file)
Every active script has a sibling <script>.lifecycle.yaml:
name: nightly-backup.sh
state: ACTIVE
owner_team: platform-storage
on_call_runbook: https://runbooks.example.com/nightly-backup
metrics_dashboard: https://grafana.example.com/d/backup
created: 2024-08-12
promoted_to_provisional: 2024-08-15
promoted_to_active: 2024-09-15
last_review: 2026-06-01
replacement_candidate: null
deprecated_after: null
retired_after: null
dependencies:
- lib/log.sh
- lib/backup.sh
- lib/db.sh
notes: |
Replaces the legacy 'backup-cron.pl' from 2018.
Annual review consists of: open every YAML, ask “is this still earning its keep?”, update last_review. Scripts where the answer is “no” get queued for deprecation.
The Owner Departure Trigger
When the owner of a script leaves the team, the script’s state transitions to:
- DEPRECATED if the script is non-critical and no other team member volunteers to own it.
- ACTIVE with new owner if a teammate explicitly accepts the runbook and on-call obligation.
This is the single biggest lever against the “scripts as graveyard” pattern. Without owner-departure triggers, every team accumulates 50+ orphan scripts within 5 years.
Standard Metrics Every Production Script Should Emit
Every production script emits at least four metrics. With lib/metrics.sh:
# 1. Last run timestamp (success or failure)
metric_set "myapp_script_last_run_seconds" "$(date +%s)" \
"script=\"$SCRIPT_NAME\""
# 2. Last success timestamp
trap 'on_exit $?' EXIT
on_exit() {
local rc=$1
if (( rc == 0 )); then
metric_set "myapp_script_last_success_seconds" "$(date +%s)" \
"script=\"$SCRIPT_NAME\""
fi
metric_set "myapp_script_last_exit_code" "$rc" \
"script=\"$SCRIPT_NAME\""
}
# 3. Duration
SECONDS=0
# ...work happens...
metric_set "myapp_script_duration_seconds" "$SECONDS" \
"script=\"$SCRIPT_NAME\""
# 4. Domain-specific (rows processed, bytes uploaded, etc.)
metric_set "myapp_backup_bytes_total" "$(stat -c %s /var/backups/...)" \
"script=\"$SCRIPT_NAME\""
Standard alert rules to wire up (reuse for every production script):
groups:
- name: shell-scripts
rules:
# Script hasn't run in 25h (expected nightly cron)
- alert: ScriptStaleRun
expr: time() - myapp_script_last_run_seconds > 86400 * 1.04
for: 10m
annotations:
summary: "{{ $labels.script }} hasn't run in over 25 hours"
# Script ran but failed
- alert: ScriptLastRunFailed
expr: myapp_script_last_exit_code != 0
for: 1h
annotations:
summary: "{{ $labels.script }} last run failed with exit code {{ $value }}"
# Script success is stale (ran recently but kept failing)
- alert: ScriptLastSuccessStale
expr: time() - myapp_script_last_success_seconds > 86400 * 2
for: 30m
annotations:
summary: "{{ $labels.script }} hasn't succeeded in 2+ days"
# Script duration anomaly (took 3× the median)
- alert: ScriptDurationAnomaly
expr: myapp_script_duration_seconds > 3 * avg_over_time(myapp_script_duration_seconds[14d])
for: 0m
annotations:
summary: "{{ $labels.script }} took {{ $value }}s, 3× normal"
These four alert rules, applied uniformly across every production script, transform “did the cron run?” from a tribal knowledge question into a monitored property.
The Sunset Criteria
A script earns retirement when at least one of these triggers fires:
Trigger 1: Replaced By A Real Tool
The script’s job is now done by:
- A purpose-built service (e.g., the bash-based backup script is replaced by Velero, Restic Server, Rclone Sync).
- An upstream provided feature (e.g., a custom S3 lifecycle script replaced by S3 lifecycle rules).
- A managed service (e.g., a cron-based DB backup replaced by RDS automated backups).
When this happens, run both in parallel for at least one full operational cycle (one week minimum, one month preferred), verify the new tool’s outputs match the old script’s, then deprecate.
Trigger 2: No Consumers
The script’s outputs (files, metrics, alerts) have no remaining consumer. Verify:
# Find anything still referencing the script
grep -r "nightly-backup.sh" /etc /opt /home /var/spool /usr/local/bin
grep -r "myapp_backup_bytes_total" /etc/prometheus # any alerts?
git log --all --oneline -- scripts/active/nightly-backup.sh # any recent activity?
If all three are empty, the script is unused. Deprecate immediately; retire after 30 days.
Trigger 3: Repeated Failures Without A Fix
If a script’s myapp_script_last_success_seconds has been stale for >30 days and nobody has been able (or willing) to fix it, the script is dead in fact if not in name. Deprecate it and either:
- Find an owner willing to fix it within 14 days, or
- Retire it.
The worst state is “the cron is still listed but the script silently fails every night.” The retirement is more honest.
Trigger 4: Owner Team Departure With No Successor
Already covered in lifecycle. If owner_team is empty for >30 days, the script transitions to DEPRECATED automatically. After 90 more days, it retires.
Trigger 5: Annual Review Says “No Longer Needed”
The yearly check on last_review. The owner team explicitly says “this isn’t earning its keep.” Deprecate immediately.
The Retirement Ceremony
The day a script retires:
-
Remove from cron / systemd:
sudo systemctl disable --now myapp-script.timer. -
Move the file:
git mv scripts/active/foo.sh scripts/retired/2026/foo.sh. -
Update its lifecycle.yaml:
state: RETIRED, setretired_after. -
Archive monitoring: move alert rules to
prometheus/rules/retired/. -
Final commit message:
retire: scripts/active/nightly-backup.sh Replaced by Velero (https://velero.example.com). Velero has run in parallel for 30 days and outputs match. No remaining consumers. Cron entry removed. -
Note in team weekly: “Retired script X, total active scripts now N.”
The “total active scripts now N” metric is itself worth tracking. A team where N goes up monotonically is accumulating debt; a team where N stays flat or shrinks is managing its surface.
The Anti-Patterns To Watch For (And Reject At Review)
After 41 lessons, these are the patterns that should fail review every time:
Anti-Pattern 1: The “Just A Quick Script” That Lives For 5 Years
Every script in production was once a “just a quick script.” Skip the lifecycle policy at your peril. Reject at review if a draft is being merged without lifecycle.yaml.
Anti-Pattern 2: Silent 2>/dev/null Without Justification
Suppressing stderr hides bugs. Every 2>/dev/null should have a comment: # stderr suppressed because rm prints "no such file" but we don't care.
Anti-Pattern 3: Hardcoded Paths That Break On The Other OS
/proc/sys/kernel/... works on Linux, doesn’t exist on macOS / BSD. dscl works on macOS, doesn’t exist on Linux. Either explicitly target one OS or do feature detection (command -v ... >/dev/null && ...).
Anti-Pattern 4: Globals Named i, tmp, data
Bash has no real namespacing. A for i in ... in a sourced library can collide with the caller’s i. Always use descriptive names in libraries, and local everywhere.
Anti-Pattern 5: Comments That Lie
# fast path — copies in O(1) — but the code does a recursive directory walk. Outdated comments are worse than no comments.
Anti-Pattern 6: Print-Then-Sleep
echo "Restarting..."; sleep 5; restart_thing — if restart_thing fails, the operator has been told a lie. Print after the action succeeds, not before.
Anti-Pattern 7: Magic Numbers
sleep 30 — why 30? tail -n 1000 — why 1000? Either a constant with a name (readonly RETRY_BACKOFF_SECONDS=30) or a comment.
Anti-Pattern 8: Unsourced Library Behavior Differences
Some libraries source at runtime, some at parse time. source lib/foo.sh inside a function works differently than at top level. Test both if your script does it.
Sample Production Script Skeleton
Pull together everything from L1-L41 into a template:
#!/usr/bin/env bash
#
# nightly-cleanup.sh — Remove stale temp files and rotate cleanup logs.
# Owner: platform-ops@example.com
# Runbook: https://runbooks.example.com/nightly-cleanup
# Lifecycle: see nightly-cleanup.lifecycle.yaml
#
# Env:
# CLEANUP_DRY_RUN — if "true", logs intended actions but does not delete
# CLEANUP_AGE_DAYS — files older than this are removed (default 7)
#
# Exit codes:
# 0 — success
# 1 — generic failure
# 2 — usage error
# 65 — input data invalid (sysexits.h EX_DATAERR)
set -o errexit -o nounset -o pipefail
IFS=$'\n\t'
readonly SCRIPT_NAME="$(basename "$0")"
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source /usr/local/lib/log.sh
source /usr/local/lib/err.sh
source /usr/local/lib/fs.sh
source /usr/local/lib/lock.sh
source /usr/local/lib/metrics.sh
readonly CLEANUP_AGE_DAYS=${CLEANUP_AGE_DAYS:-7}
readonly CLEANUP_DRY_RUN=${CLEANUP_DRY_RUN:-false}
readonly CLEANUP_DIRS=(/tmp /var/tmp /var/spool/myapp/work)
usage() {
cat <<EOF
Usage: $SCRIPT_NAME [--dry-run] [--age-days N]
Remove files older than --age-days from configured cleanup dirs.
EOF
}
main() {
parse_args "$@"
acquire_lock "$SCRIPT_NAME" || { log_warn "another instance running"; exit 0; }
log_info "starting cleanup; age=${CLEANUP_AGE_DAYS}d dry_run=${CLEANUP_DRY_RUN}"
metric_set "myapp_cleanup_last_run_seconds" "$(date +%s)" "script=\"$SCRIPT_NAME\""
SECONDS=0
local removed_total=0
for dir in "${CLEANUP_DIRS[@]}"; do
[[ -d "$dir" ]] || { log_warn "missing dir: $dir"; continue; }
local removed
removed=$(cleanup_dir "$dir") || removed=0
removed_total=$((removed_total + removed))
done
metric_set "myapp_cleanup_files_removed_total" "$removed_total" "script=\"$SCRIPT_NAME\""
metric_set "myapp_cleanup_duration_seconds" "$SECONDS" "script=\"$SCRIPT_NAME\""
log_info "removed $removed_total files in ${SECONDS}s"
}
cleanup_dir() {
local dir="$1"
local count=0
while IFS= read -r -d '' file; do
if "$CLEANUP_DRY_RUN"; then
log_debug "would remove: $file"
else
rm -f "$file" && count=$((count + 1))
fi
done < <(find "$dir" -type f -mtime "+$CLEANUP_AGE_DAYS" -print0)
echo "$count"
}
parse_args() {
while (( $# > 0 )); do
case "$1" in
--dry-run) CLEANUP_DRY_RUN=true ;;
--age-days) shift; CLEANUP_AGE_DAYS="$1" ;;
-h|--help) usage; exit 0 ;;
*) usage >&2; exit 2 ;;
esac
shift
done
}
on_exit() {
local rc=$1
metric_set "myapp_cleanup_last_exit_code" "$rc" "script=\"$SCRIPT_NAME\""
if (( rc == 0 )); then
metric_set "myapp_cleanup_last_success_seconds" "$(date +%s)" "script=\"$SCRIPT_NAME\""
fi
}
trap 'on_exit $?' EXIT
main "$@"
This template hits every category of the review checklist:
- Boilerplate ✓
- Argument handling ✓
- Error handling ✓ (set -e, EXIT trap)
- Idempotency ✓ (file removal is naturally idempotent)
- Observability ✓ (4 standard metrics + structured log)
- Safety ✓ (single-instance lock, dry-run flag)
- Documentation ✓ (header block)
Combined with a Bats test file and a lifecycle.yaml, this is a script ready for ACTIVE state.
The Capstone Quick-Reference Card
THE SEVEN-CATEGORY REVIEW
1. Boilerplate (shebang, set -e, trap)
2. Args & UX (--help, --dry-run, validation)
3. Error handling (stderr, exit codes, no silent suppress)
4. Safety (idempotent, atomic, no secrets, sanitized inputs)
5. Observability (logs, 4 metrics, audit log)
6. Testing (bats, shellcheck, shfmt, CI)
7. Documentation (header, runbook, CHANGELOG)
THE FIVE STATES
DRAFT → PROVISIONAL → ACTIVE → DEPRECATED → RETIRED
Each transition is owner-driven and tracked in lifecycle.yaml
THE FOUR STANDARD METRICS
myapp_*_last_run_seconds (when did it run?)
myapp_*_last_success_seconds (when did it last succeed?)
myapp_*_last_exit_code (what was the outcome?)
myapp_*_duration_seconds (how long did it take?)
THE FIVE SUNSET TRIGGERS
1. Replaced by a real tool (with parallel-run validation)
2. No remaining consumers (grep across infra)
3. Failing for >30 days without fix
4. Owner team departed without successor
5. Annual review says "no longer needed"
THE LIBRARY FAMILY
lib/log.sh, lib/err.sh, lib/fs.sh, lib/lock.sh,
lib/observe.sh, lib/secrets.sh, lib/test.sh,
lib/metrics.sh, lib/backup.sh, lib/db.sh,
lib/loganalyze.sh, lib/heal.sh, lib/migrate.sh,
lib/compliance.sh, lib/forensics.sh
THE EIGHT ANTI-PATTERNS
1. "Just a quick script" without lifecycle.yaml
2. Silent 2>/dev/null without justification
3. OS-hardcoded paths
4. i, tmp, data globals (no namespacing)
5. Comments that lie
6. Print-then-sleep
7. Magic numbers
8. Library behavior differences from sourcing
THE NUMBER THAT MATTERS
Total active scripts: track it weekly. Up = debt; flat = managed.
Closing — What This Course Is Really About
If you’ve read all 42 lessons in order, you now have the equivalent of 4-5 years of senior-engineer apprenticeship in production shell scripting, distilled. But the deeper lesson isn’t any specific pattern.
The deeper lesson is shell is a serious engineering surface when treated with the same discipline as any compiled language: version-controlled, tested, monitored, reviewed, owned, lifecycle-managed, retired. Most teams treat shell as a graveyard of one-off tools because it’s easy to do that — write a script, drop it on a host, never think about it again. That’s how you end up with 200 scripts on every box, half of which fail silently every night, and nobody knows which ones still matter.
The investment in lifecycle policy, review checklist, standard metrics, and sunset criteria is not bureaucracy — it’s the cheapest way to keep shell scripts from becoming the most expensive part of your infrastructure five years out.
The series ends here. Use it. Keep the checklists on the wall. Retire scripts ruthlessly. And when in doubt, source lib/log.sh first.