Shell Style Guide Capstone: The Production Review Checklist, Lifecycle Policy, Metrics Every Script Should Emit & The Sunset Criteria For Retiring Scripts Cleanly

Why This Capstone Exists

A shell script that is good enough today is the easiest piece of software to ship. A shell script that is still earning its keep three years from now, after the original author has left and the surrounding system has been refactored twice, is one of the hardest. The difference is not language quality — it’s lifecycle discipline.

This capstone consolidates the 41 preceding lessons into the four artifacts every team should keep on the wall:

The Production Review Checklist — the questions every PR introducing or modifying a production shell script must answer.
The Lifecycle Policy — the documented states a script lives through, from prototype to active to deprecated to retired.
The Standard Metrics Surface — what every production shell script must emit so monitoring sees it.
The Sunset Criteria — explicit triggers for retiring a script before it becomes a maintenance hazard.

Following these four artifacts turns “shell scripts as a graveyard of one-off tools” into “shell scripts as a sustainable engineering surface.”

The Library Family From The Series

Across L1-L41 we built a layered library of shell helpers. Each script in production should source the libraries relevant to its role:

Library	From lesson	Purpose
`lib/log.sh`	L7	Structured logging with levels, JSON output, log rotation
`lib/err.sh`	L8	Error trap, stack trace, cleanup on exit
`lib/fs.sh`	L28	Atomic file writes, temp-in-same-dir, fsync helpers
`lib/lock.sh`	L21	flock single-instance, distributed locks via Redis
`lib/observe.sh`	L25	Tracing helpers, structured events, span-id propagation
`lib/secrets.sh`	L24	Vault/SSM lookup, credential masking, never-print discipline
`lib/test.sh`	L31	Bats integration, fixture setup/teardown, assertion helpers
`lib/metrics.sh`	L34	Prometheus textfile exporter, atomic .prom file writes
`lib/backup.sh`	L35	sha256 manifests, GFS retention, S3 upload + verify
`lib/db.sh`	L36	pg_dump/mysqldump pipelines, base backup, PITR drill
`lib/loganalyze.sh`	L37	Streaming awk, mawk detection, fleet fan-out
`lib/heal.sh`	L38	Detect-decide-act, idempotency, rate limit, circuit breaker
`lib/migrate.sh`	L39	Resumable batch, watermark, staging cutover, sample-diff
`lib/compliance.sh`	L40	Controls-as-tests, JSONL bundles, GPG signing, drift
`lib/forensics.sh`	L41	Order-of-volatility capture, chain-of-custody log

A real production deploy keeps these in /usr/local/lib/ with mode 0644, owned by root:root. Scripts source them from the canonical path. Updates ship via the same Ansible/Puppet/cloud-init that ships system config — version-controlled, reviewed, and rolled out with the same discipline as any other infrastructure code.

The Production Review Checklist

Every PR that introduces or modifies a production shell script must satisfy seven categories. The checklist is intentionally numerous — most categories are one-line yes/no, and the explicit list ensures nothing is forgotten.

Category 1: Boilerplate & Shell Mode

#!/usr/bin/env bash
set -o errexit -o nounset -o pipefail
IFS=$'\n\t'

readonly SCRIPT_NAME="$(basename "$0")"
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

trap 'on_exit $?' EXIT
on_exit() { local rc=$1; ...; }

Has #!/usr/bin/env bash (not #!/bin/bash for portability across BSD/Linux).
Sets set -euo pipefail (or has documented justification for omission, e.g., triage scripts).
Sets IFS explicitly to avoid surprises on filenames with spaces.
Defines readonly SCRIPT_NAME and SCRIPT_DIR for self-reference.
Has an EXIT trap for cleanup (temp files, locks, network connections).
If the script is bash-only, the shebang says so. If it’s POSIX-portable, says #!/bin/sh and uses POSIX features only.

Category 2: Argument Handling

Uses getopts or a documented argument parser, not positional $1/$2 for non-trivial scripts.
Has a --help / -h that shows usage with examples.
Validates required arguments with explicit error messages (not just “missing argument”).
Has a --dry-run flag for any script that mutates state (per L38 healers, L39 migrations).
Has a --debug / -x flag that enables set -x if needed for troubleshooting.

Category 3: Error Handling

All commands that may fail are either explicitly handled (|| { ... }) or allowed to abort via set -e.
Error messages go to stderr (>&2), not stdout.
Error messages include the failing command, arguments, and a hint on remediation.
Cleanup is in EXIT trap, not duplicated on every failure path.
Exit codes are documented (0=success, 1=generic failure, 2=usage error, etc., per man 1 sysexits if applicable).
No 2>/dev/null swallowing errors silently — if a stderr is suppressed, there’s a comment justifying why.

Category 4: Idempotency & Safety

Re-running the script with the same inputs is safe (gives the same end state, no duplicates).
Mutating operations check existing state before acting ([[ -f "$file" ]] before mv, INSERT ON CONFLICT for SQL).
Destructive operations require a --force or explicit confirmation flag.
Single-instance enforcement via flock if appropriate.
Atomic file writes (temp + rename, per lib/fs.sh) for any file other components read.
No hardcoded credentials, secrets, or API keys (use lib/secrets.sh).
Inputs from external sources (env, args, files, network) are validated/sanitized before use.

Category 5: Observability

Logs to stderr or to syslog/journald, not silently.
Logs include timestamp (ISO 8601), script name, and severity.
Heartbeats to Prometheus textfile collector (per L34) for any long-running or scheduled script.
Records start time, end time, exit code in a metric.
Audit log for actions (per L38, L40) when the script mutates state.
No print-then-fail anti-pattern (see L7) — log levels make ERROR vs DEBUG explicit.

Category 6: Testing

Has a Bats test file under tests/ that covers happy path + error paths.
Tests run in CI (GitHub Actions, GitLab CI, etc.) on every PR.
Lint-clean: shellcheck -S warning script.sh passes.
Format-clean: shfmt -i 2 -ci -s -d script.sh shows no diff.
If the script depends on external state (DB, S3), tests use mocks or a docker-compose dev environment.
If the script can’t be reasonably tested, that’s documented with reasoning (not as default).

Category 7: Documentation

Top-of-file comment block: purpose, author, dependencies, env vars, exit codes.
Inline comments only where the code is non-obvious; no explaining set -e.
If the script is part of a runbook, the runbook references the script and vice versa.
CHANGELOG entry for non-trivial changes.
If the script generates artifacts (reports, bundles), the artifact format is documented.

The Single-Page Reviewer’s Checklist

For pasting into a PR template:

SHELL SCRIPT REVIEW

Boilerplate
  [ ] #!/usr/bin/env bash
  [ ] set -euo pipefail (or justified)
  [ ] EXIT trap for cleanup
  [ ] readonly script-self vars

Args & UX
  [ ] --help with examples
  [ ] --dry-run if mutating
  [ ] Required args validated

Error handling
  [ ] No silent 2>/dev/null
  [ ] stderr for errors
  [ ] Documented exit codes

Safety
  [ ] Idempotent on re-run
  [ ] Atomic file writes (lib/fs.sh)
  [ ] No hardcoded secrets
  [ ] Inputs sanitized

Observability
  [ ] Structured logs
  [ ] Heartbeat metric
  [ ] Audit log if mutating

Testing
  [ ] Bats tests in tests/
  [ ] shellcheck clean
  [ ] shfmt clean
  [ ] CI green

Docs
  [ ] Header block
  [ ] Runbook updated
  [ ] CHANGELOG

The Lifecycle Policy

Every production shell script lives through five states. The transitions between states are explicit, owner-driven, and tracked.

   ┌──────────┐   review   ┌──────────┐   adoption   ┌──────────┐
   │ DRAFT    │───────────▶│ PROVISIONAL│────────────▶│ ACTIVE   │
   └──────────┘             └──────────┘              └──────────┘
                                                            │
                                                       deprecation
                                                            ▼
                                                      ┌──────────┐   sunset   ┌──────────┐
                                                      │ DEPRECATED│──────────▶│ RETIRED  │
                                                      └──────────┘             └──────────┘

State 1: DRAFT

Lives in a feature branch or scripts/draft/.
Author is the only owner.
No SLA, no monitoring expected.
Not run from cron / systemd in production.

Exit criteria: code review approval + at least one test.

State 2: PROVISIONAL

Merged to main but in scripts/provisional/.
Has Bats tests passing in CI.
Has shellcheck-clean.
May run in production but with monitoring marked as “experimental.”
Has a deadline for promotion to ACTIVE or rollback to DRAFT (typically 30 days).

Exit criteria: 30 days of clean operation + observability metrics in place + runbook written.

State 3: ACTIVE

The “happy” state. Script is in scripts/active/ (or wherever your prod tree puts it).
Has documented owner team and on-call runbook.
Emits metrics; has alert rules.
Reviewed annually for continued relevance.
Any change goes through full PR review.

Exit criteria: someone files an issue marking it for deprecation.

State 4: DEPRECATED

Functioning but flagged for replacement.
Has a target retirement date.
Has a documented replacement (a real tool, an upstream library, or “no longer needed”).
A deprecation warning is logged on every invocation.
New consumers are blocked at PR review.

Exit criteria: target retirement date passes AND no consumers remain.

State 5: RETIRED

Removed from scripts/active/, archived to scripts/retired/<year>/.
Cron / systemd entries removed.
Monitoring rules archived.
Final commit message records why retired and what replaced it.
Kept in version control forever (for audit and “we used to do this” forensics) but never deployed.

The Lifecycle Tracker (METADATA file)

Every active script has a sibling <script>.lifecycle.yaml:

name: nightly-backup.sh
state: ACTIVE
owner_team: platform-storage
on_call_runbook: https://runbooks.example.com/nightly-backup
metrics_dashboard: https://grafana.example.com/d/backup
created: 2024-08-12
promoted_to_provisional: 2024-08-15
promoted_to_active: 2024-09-15
last_review: 2026-06-01
replacement_candidate: null
deprecated_after: null
retired_after: null
dependencies:
  - lib/log.sh
  - lib/backup.sh
  - lib/db.sh
notes: |
  Replaces the legacy 'backup-cron.pl' from 2018.

Annual review consists of: open every YAML, ask “is this still earning its keep?”, update last_review. Scripts where the answer is “no” get queued for deprecation.

The Owner Departure Trigger

When the owner of a script leaves the team, the script’s state transitions to:

DEPRECATED if the script is non-critical and no other team member volunteers to own it.
ACTIVE with new owner if a teammate explicitly accepts the runbook and on-call obligation.

This is the single biggest lever against the “scripts as graveyard” pattern. Without owner-departure triggers, every team accumulates 50+ orphan scripts within 5 years.

Standard Metrics Every Production Script Should Emit

Every production script emits at least four metrics. With lib/metrics.sh:

# 1. Last run timestamp (success or failure)
metric_set "myapp_script_last_run_seconds" "$(date +%s)" \
  "script=\"$SCRIPT_NAME\""

# 2. Last success timestamp
trap 'on_exit $?' EXIT
on_exit() {
  local rc=$1
  if (( rc == 0 )); then
    metric_set "myapp_script_last_success_seconds" "$(date +%s)" \
      "script=\"$SCRIPT_NAME\""
  fi
  metric_set "myapp_script_last_exit_code" "$rc" \
    "script=\"$SCRIPT_NAME\""
}

# 3. Duration
SECONDS=0
# ...work happens...
metric_set "myapp_script_duration_seconds" "$SECONDS" \
  "script=\"$SCRIPT_NAME\""

# 4. Domain-specific (rows processed, bytes uploaded, etc.)
metric_set "myapp_backup_bytes_total" "$(stat -c %s /var/backups/...)" \
  "script=\"$SCRIPT_NAME\""

Standard alert rules to wire up (reuse for every production script):

groups:
- name: shell-scripts
  rules:
  # Script hasn't run in 25h (expected nightly cron)
  - alert: ScriptStaleRun
    expr: time() - myapp_script_last_run_seconds > 86400 * 1.04
    for: 10m
    annotations:
      summary: "{{ $labels.script }} hasn't run in over 25 hours"

  # Script ran but failed
  - alert: ScriptLastRunFailed
    expr: myapp_script_last_exit_code != 0
    for: 1h
    annotations:
      summary: "{{ $labels.script }} last run failed with exit code {{ $value }}"

  # Script success is stale (ran recently but kept failing)
  - alert: ScriptLastSuccessStale
    expr: time() - myapp_script_last_success_seconds > 86400 * 2
    for: 30m
    annotations:
      summary: "{{ $labels.script }} hasn't succeeded in 2+ days"

  # Script duration anomaly (took 3× the median)
  - alert: ScriptDurationAnomaly
    expr: myapp_script_duration_seconds > 3 * avg_over_time(myapp_script_duration_seconds[14d])
    for: 0m
    annotations:
      summary: "{{ $labels.script }} took {{ $value }}s, 3× normal"

These four alert rules, applied uniformly across every production script, transform “did the cron run?” from a tribal knowledge question into a monitored property.

The Sunset Criteria

A script earns retirement when at least one of these triggers fires:

Trigger 1: Replaced By A Real Tool

The script’s job is now done by:

A purpose-built service (e.g., the bash-based backup script is replaced by Velero, Restic Server, Rclone Sync).
An upstream provided feature (e.g., a custom S3 lifecycle script replaced by S3 lifecycle rules).
A managed service (e.g., a cron-based DB backup replaced by RDS automated backups).

When this happens, run both in parallel for at least one full operational cycle (one week minimum, one month preferred), verify the new tool’s outputs match the old script’s, then deprecate.

Trigger 2: No Consumers

The script’s outputs (files, metrics, alerts) have no remaining consumer. Verify:

# Find anything still referencing the script
grep -r "nightly-backup.sh" /etc /opt /home /var/spool /usr/local/bin
grep -r "myapp_backup_bytes_total" /etc/prometheus  # any alerts?
git log --all --oneline -- scripts/active/nightly-backup.sh  # any recent activity?

If all three are empty, the script is unused. Deprecate immediately; retire after 30 days.

Trigger 3: Repeated Failures Without A Fix

If a script’s myapp_script_last_success_seconds has been stale for >30 days and nobody has been able (or willing) to fix it, the script is dead in fact if not in name. Deprecate it and either:

Find an owner willing to fix it within 14 days, or
Retire it.

The worst state is “the cron is still listed but the script silently fails every night.” The retirement is more honest.

Trigger 4: Owner Team Departure With No Successor

Already covered in lifecycle. If owner_team is empty for >30 days, the script transitions to DEPRECATED automatically. After 90 more days, it retires.

Trigger 5: Annual Review Says “No Longer Needed”

The yearly check on last_review. The owner team explicitly says “this isn’t earning its keep.” Deprecate immediately.

The Retirement Ceremony

The day a script retires:

Remove from cron / systemd: sudo systemctl disable --now myapp-script.timer.
Move the file: git mv scripts/active/foo.sh scripts/retired/2026/foo.sh.
Update its lifecycle.yaml: state: RETIRED, set retired_after.
Archive monitoring: move alert rules to prometheus/rules/retired/.

Final commit message:

retire: scripts/active/nightly-backup.sh

Replaced by Velero (https://velero.example.com).
Velero has run in parallel for 30 days and outputs match.
No remaining consumers. Cron entry removed.

Note in team weekly: “Retired script X, total active scripts now N.”

The “total active scripts now N” metric is itself worth tracking. A team where N goes up monotonically is accumulating debt; a team where N stays flat or shrinks is managing its surface.

The Anti-Patterns To Watch For (And Reject At Review)

After 41 lessons, these are the patterns that should fail review every time:

Anti-Pattern 1: The “Just A Quick Script” That Lives For 5 Years

Every script in production was once a “just a quick script.” Skip the lifecycle policy at your peril. Reject at review if a draft is being merged without lifecycle.yaml.

Anti-Pattern 2: Silent `2>/dev/null` Without Justification

Suppressing stderr hides bugs. Every 2>/dev/null should have a comment: # stderr suppressed because rm prints "no such file" but we don't care.

Anti-Pattern 3: Hardcoded Paths That Break On The Other OS

/proc/sys/kernel/... works on Linux, doesn’t exist on macOS / BSD. dscl works on macOS, doesn’t exist on Linux. Either explicitly target one OS or do feature detection (command -v ... >/dev/null && ...).

Anti-Pattern 4: Globals Named `i`, `tmp`, `data`

Bash has no real namespacing. A for i in ... in a sourced library can collide with the caller’s i. Always use descriptive names in libraries, and local everywhere.

Anti-Pattern 5: Comments That Lie

# fast path — copies in O(1) — but the code does a recursive directory walk. Outdated comments are worse than no comments.

Anti-Pattern 6: Print-Then-Sleep

echo "Restarting..."; sleep 5; restart_thing — if restart_thing fails, the operator has been told a lie. Print after the action succeeds, not before.

Anti-Pattern 7: Magic Numbers

sleep 30 — why 30? tail -n 1000 — why 1000? Either a constant with a name (readonly RETRY_BACKOFF_SECONDS=30) or a comment.

Anti-Pattern 8: Unsourced Library Behavior Differences

Some libraries source at runtime, some at parse time. source lib/foo.sh inside a function works differently than at top level. Test both if your script does it.

Sample Production Script Skeleton

Pull together everything from L1-L41 into a template:

#!/usr/bin/env bash
#
# nightly-cleanup.sh — Remove stale temp files and rotate cleanup logs.
# Owner: platform-ops@example.com
# Runbook: https://runbooks.example.com/nightly-cleanup
# Lifecycle: see nightly-cleanup.lifecycle.yaml
#
# Env:
#   CLEANUP_DRY_RUN — if "true", logs intended actions but does not delete
#   CLEANUP_AGE_DAYS — files older than this are removed (default 7)
#
# Exit codes:
#   0  — success
#   1  — generic failure
#   2  — usage error
#   65 — input data invalid (sysexits.h EX_DATAERR)

set -o errexit -o nounset -o pipefail
IFS=$'\n\t'

readonly SCRIPT_NAME="$(basename "$0")"
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

source /usr/local/lib/log.sh
source /usr/local/lib/err.sh
source /usr/local/lib/fs.sh
source /usr/local/lib/lock.sh
source /usr/local/lib/metrics.sh

readonly CLEANUP_AGE_DAYS=${CLEANUP_AGE_DAYS:-7}
readonly CLEANUP_DRY_RUN=${CLEANUP_DRY_RUN:-false}
readonly CLEANUP_DIRS=(/tmp /var/tmp /var/spool/myapp/work)

usage() {
  cat <<EOF
Usage: $SCRIPT_NAME [--dry-run] [--age-days N]
Remove files older than --age-days from configured cleanup dirs.
EOF
}

main() {
  parse_args "$@"
  acquire_lock "$SCRIPT_NAME" || { log_warn "another instance running"; exit 0; }

  log_info "starting cleanup; age=${CLEANUP_AGE_DAYS}d dry_run=${CLEANUP_DRY_RUN}"
  metric_set "myapp_cleanup_last_run_seconds" "$(date +%s)" "script=\"$SCRIPT_NAME\""
  SECONDS=0

  local removed_total=0
  for dir in "${CLEANUP_DIRS[@]}"; do
    [[ -d "$dir" ]] || { log_warn "missing dir: $dir"; continue; }
    local removed
    removed=$(cleanup_dir "$dir") || removed=0
    removed_total=$((removed_total + removed))
  done

  metric_set "myapp_cleanup_files_removed_total" "$removed_total" "script=\"$SCRIPT_NAME\""
  metric_set "myapp_cleanup_duration_seconds" "$SECONDS" "script=\"$SCRIPT_NAME\""
  log_info "removed $removed_total files in ${SECONDS}s"
}

cleanup_dir() {
  local dir="$1"
  local count=0
  while IFS= read -r -d '' file; do
    if "$CLEANUP_DRY_RUN"; then
      log_debug "would remove: $file"
    else
      rm -f "$file" && count=$((count + 1))
    fi
  done < <(find "$dir" -type f -mtime "+$CLEANUP_AGE_DAYS" -print0)
  echo "$count"
}

parse_args() {
  while (( $# > 0 )); do
    case "$1" in
      --dry-run)    CLEANUP_DRY_RUN=true ;;
      --age-days)   shift; CLEANUP_AGE_DAYS="$1" ;;
      -h|--help)    usage; exit 0 ;;
      *)            usage >&2; exit 2 ;;
    esac
    shift
  done
}

on_exit() {
  local rc=$1
  metric_set "myapp_cleanup_last_exit_code" "$rc" "script=\"$SCRIPT_NAME\""
  if (( rc == 0 )); then
    metric_set "myapp_cleanup_last_success_seconds" "$(date +%s)" "script=\"$SCRIPT_NAME\""
  fi
}
trap 'on_exit $?' EXIT

main "$@"

This template hits every category of the review checklist:

Boilerplate ✓
Argument handling ✓
Error handling ✓ (set -e, EXIT trap)
Idempotency ✓ (file removal is naturally idempotent)
Observability ✓ (4 standard metrics + structured log)
Safety ✓ (single-instance lock, dry-run flag)
Documentation ✓ (header block)

Combined with a Bats test file and a lifecycle.yaml, this is a script ready for ACTIVE state.

The Capstone Quick-Reference Card

THE SEVEN-CATEGORY REVIEW
  1. Boilerplate (shebang, set -e, trap)
  2. Args & UX (--help, --dry-run, validation)
  3. Error handling (stderr, exit codes, no silent suppress)
  4. Safety (idempotent, atomic, no secrets, sanitized inputs)
  5. Observability (logs, 4 metrics, audit log)
  6. Testing (bats, shellcheck, shfmt, CI)
  7. Documentation (header, runbook, CHANGELOG)

THE FIVE STATES
  DRAFT → PROVISIONAL → ACTIVE → DEPRECATED → RETIRED
  Each transition is owner-driven and tracked in lifecycle.yaml

THE FOUR STANDARD METRICS
  myapp_*_last_run_seconds       (when did it run?)
  myapp_*_last_success_seconds   (when did it last succeed?)
  myapp_*_last_exit_code         (what was the outcome?)
  myapp_*_duration_seconds       (how long did it take?)

THE FIVE SUNSET TRIGGERS
  1. Replaced by a real tool (with parallel-run validation)
  2. No remaining consumers (grep across infra)
  3. Failing for >30 days without fix
  4. Owner team departed without successor
  5. Annual review says "no longer needed"

THE LIBRARY FAMILY
  lib/log.sh, lib/err.sh, lib/fs.sh, lib/lock.sh,
  lib/observe.sh, lib/secrets.sh, lib/test.sh,
  lib/metrics.sh, lib/backup.sh, lib/db.sh,
  lib/loganalyze.sh, lib/heal.sh, lib/migrate.sh,
  lib/compliance.sh, lib/forensics.sh

THE EIGHT ANTI-PATTERNS
  1. "Just a quick script" without lifecycle.yaml
  2. Silent 2>/dev/null without justification
  3. OS-hardcoded paths
  4. i, tmp, data globals (no namespacing)
  5. Comments that lie
  6. Print-then-sleep
  7. Magic numbers
  8. Library behavior differences from sourcing

THE NUMBER THAT MATTERS
  Total active scripts: track it weekly. Up = debt; flat = managed.

Closing — What This Course Is Really About

If you’ve read all 42 lessons in order, you now have the equivalent of 4-5 years of senior-engineer apprenticeship in production shell scripting, distilled. But the deeper lesson isn’t any specific pattern.

The deeper lesson is shell is a serious engineering surface when treated with the same discipline as any compiled language: version-controlled, tested, monitored, reviewed, owned, lifecycle-managed, retired. Most teams treat shell as a graveyard of one-off tools because it’s easy to do that — write a script, drop it on a host, never think about it again. That’s how you end up with 200 scripts on every box, half of which fail silently every night, and nobody knows which ones still matter.

The investment in lifecycle policy, review checklist, standard metrics, and sunset criteria is not bureaucracy — it’s the cheapest way to keep shell scripts from becoming the most expensive part of your infrastructure five years out.

The series ends here. Use it. Keep the checklists on the wall. Retire scripts ruthlessly. And when in doubt, source lib/log.sh first.