Shell Lesson 42 of 42

Shell Style Guide Capstone: The Production Review Checklist, Lifecycle Policy, Metrics Every Script Should Emit & The Sunset Criteria For Retiring Scripts Cleanly

Why This Capstone Exists

A shell script that is good enough today is the easiest piece of software to ship. A shell script that is still earning its keep three years from now, after the original author has left and the surrounding system has been refactored twice, is one of the hardest. The difference is not language quality — it’s lifecycle discipline.

This capstone consolidates the 41 preceding lessons into the four artifacts every team should keep on the wall:

  1. The Production Review Checklist — the questions every PR introducing or modifying a production shell script must answer.
  2. The Lifecycle Policy — the documented states a script lives through, from prototype to active to deprecated to retired.
  3. The Standard Metrics Surface — what every production shell script must emit so monitoring sees it.
  4. The Sunset Criteria — explicit triggers for retiring a script before it becomes a maintenance hazard.

Following these four artifacts turns “shell scripts as a graveyard of one-off tools” into “shell scripts as a sustainable engineering surface.”

The Library Family From The Series

Across L1-L41 we built a layered library of shell helpers. Each script in production should source the libraries relevant to its role:

Library From lesson Purpose
lib/log.sh L7 Structured logging with levels, JSON output, log rotation
lib/err.sh L8 Error trap, stack trace, cleanup on exit
lib/fs.sh L28 Atomic file writes, temp-in-same-dir, fsync helpers
lib/lock.sh L21 flock single-instance, distributed locks via Redis
lib/observe.sh L25 Tracing helpers, structured events, span-id propagation
lib/secrets.sh L24 Vault/SSM lookup, credential masking, never-print discipline
lib/test.sh L31 Bats integration, fixture setup/teardown, assertion helpers
lib/metrics.sh L34 Prometheus textfile exporter, atomic .prom file writes
lib/backup.sh L35 sha256 manifests, GFS retention, S3 upload + verify
lib/db.sh L36 pg_dump/mysqldump pipelines, base backup, PITR drill
lib/loganalyze.sh L37 Streaming awk, mawk detection, fleet fan-out
lib/heal.sh L38 Detect-decide-act, idempotency, rate limit, circuit breaker
lib/migrate.sh L39 Resumable batch, watermark, staging cutover, sample-diff
lib/compliance.sh L40 Controls-as-tests, JSONL bundles, GPG signing, drift
lib/forensics.sh L41 Order-of-volatility capture, chain-of-custody log

A real production deploy keeps these in /usr/local/lib/ with mode 0644, owned by root:root. Scripts source them from the canonical path. Updates ship via the same Ansible/Puppet/cloud-init that ships system config — version-controlled, reviewed, and rolled out with the same discipline as any other infrastructure code.

The Production Review Checklist

Every PR that introduces or modifies a production shell script must satisfy seven categories. The checklist is intentionally numerous — most categories are one-line yes/no, and the explicit list ensures nothing is forgotten.

Category 1: Boilerplate & Shell Mode

#!/usr/bin/env bash
set -o errexit -o nounset -o pipefail
IFS=$'\n\t'

readonly SCRIPT_NAME="$(basename "$0")"
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

trap 'on_exit $?' EXIT
on_exit() { local rc=$1; ...; }

Category 2: Argument Handling

Category 3: Error Handling

Category 4: Idempotency & Safety

Category 5: Observability

Category 6: Testing

Category 7: Documentation

The Single-Page Reviewer’s Checklist

For pasting into a PR template:

SHELL SCRIPT REVIEW

Boilerplate
  [ ] #!/usr/bin/env bash
  [ ] set -euo pipefail (or justified)
  [ ] EXIT trap for cleanup
  [ ] readonly script-self vars

Args & UX
  [ ] --help with examples
  [ ] --dry-run if mutating
  [ ] Required args validated

Error handling
  [ ] No silent 2>/dev/null
  [ ] stderr for errors
  [ ] Documented exit codes

Safety
  [ ] Idempotent on re-run
  [ ] Atomic file writes (lib/fs.sh)
  [ ] No hardcoded secrets
  [ ] Inputs sanitized

Observability
  [ ] Structured logs
  [ ] Heartbeat metric
  [ ] Audit log if mutating

Testing
  [ ] Bats tests in tests/
  [ ] shellcheck clean
  [ ] shfmt clean
  [ ] CI green

Docs
  [ ] Header block
  [ ] Runbook updated
  [ ] CHANGELOG

The Lifecycle Policy

Every production shell script lives through five states. The transitions between states are explicit, owner-driven, and tracked.

   ┌──────────┐   review   ┌──────────┐   adoption   ┌──────────┐
   │ DRAFT    │───────────▶│ PROVISIONAL│────────────▶│ ACTIVE   │
   └──────────┘             └──────────┘              └──────────┘
                                                            │
                                                       deprecation
                                                            ▼
                                                      ┌──────────┐   sunset   ┌──────────┐
                                                      │ DEPRECATED│──────────▶│ RETIRED  │
                                                      └──────────┘             └──────────┘

State 1: DRAFT

Exit criteria: code review approval + at least one test.

State 2: PROVISIONAL

Exit criteria: 30 days of clean operation + observability metrics in place + runbook written.

State 3: ACTIVE

Exit criteria: someone files an issue marking it for deprecation.

State 4: DEPRECATED

Exit criteria: target retirement date passes AND no consumers remain.

State 5: RETIRED

The Lifecycle Tracker (METADATA file)

Every active script has a sibling <script>.lifecycle.yaml:

name: nightly-backup.sh
state: ACTIVE
owner_team: platform-storage
on_call_runbook: https://runbooks.example.com/nightly-backup
metrics_dashboard: https://grafana.example.com/d/backup
created: 2024-08-12
promoted_to_provisional: 2024-08-15
promoted_to_active: 2024-09-15
last_review: 2026-06-01
replacement_candidate: null
deprecated_after: null
retired_after: null
dependencies:
  - lib/log.sh
  - lib/backup.sh
  - lib/db.sh
notes: |
  Replaces the legacy 'backup-cron.pl' from 2018.

Annual review consists of: open every YAML, ask “is this still earning its keep?”, update last_review. Scripts where the answer is “no” get queued for deprecation.

The Owner Departure Trigger

When the owner of a script leaves the team, the script’s state transitions to:

This is the single biggest lever against the “scripts as graveyard” pattern. Without owner-departure triggers, every team accumulates 50+ orphan scripts within 5 years.

Standard Metrics Every Production Script Should Emit

Every production script emits at least four metrics. With lib/metrics.sh:

# 1. Last run timestamp (success or failure)
metric_set "myapp_script_last_run_seconds" "$(date +%s)" \
  "script=\"$SCRIPT_NAME\""

# 2. Last success timestamp
trap 'on_exit $?' EXIT
on_exit() {
  local rc=$1
  if (( rc == 0 )); then
    metric_set "myapp_script_last_success_seconds" "$(date +%s)" \
      "script=\"$SCRIPT_NAME\""
  fi
  metric_set "myapp_script_last_exit_code" "$rc" \
    "script=\"$SCRIPT_NAME\""
}

# 3. Duration
SECONDS=0
# ...work happens...
metric_set "myapp_script_duration_seconds" "$SECONDS" \
  "script=\"$SCRIPT_NAME\""

# 4. Domain-specific (rows processed, bytes uploaded, etc.)
metric_set "myapp_backup_bytes_total" "$(stat -c %s /var/backups/...)" \
  "script=\"$SCRIPT_NAME\""

Standard alert rules to wire up (reuse for every production script):

groups:
- name: shell-scripts
  rules:
  # Script hasn't run in 25h (expected nightly cron)
  - alert: ScriptStaleRun
    expr: time() - myapp_script_last_run_seconds > 86400 * 1.04
    for: 10m
    annotations:
      summary: "{{ $labels.script }} hasn't run in over 25 hours"

  # Script ran but failed
  - alert: ScriptLastRunFailed
    expr: myapp_script_last_exit_code != 0
    for: 1h
    annotations:
      summary: "{{ $labels.script }} last run failed with exit code {{ $value }}"

  # Script success is stale (ran recently but kept failing)
  - alert: ScriptLastSuccessStale
    expr: time() - myapp_script_last_success_seconds > 86400 * 2
    for: 30m
    annotations:
      summary: "{{ $labels.script }} hasn't succeeded in 2+ days"

  # Script duration anomaly (took 3× the median)
  - alert: ScriptDurationAnomaly
    expr: myapp_script_duration_seconds > 3 * avg_over_time(myapp_script_duration_seconds[14d])
    for: 0m
    annotations:
      summary: "{{ $labels.script }} took {{ $value }}s, 3× normal"

These four alert rules, applied uniformly across every production script, transform “did the cron run?” from a tribal knowledge question into a monitored property.

The Sunset Criteria

A script earns retirement when at least one of these triggers fires:

Trigger 1: Replaced By A Real Tool

The script’s job is now done by:

When this happens, run both in parallel for at least one full operational cycle (one week minimum, one month preferred), verify the new tool’s outputs match the old script’s, then deprecate.

Trigger 2: No Consumers

The script’s outputs (files, metrics, alerts) have no remaining consumer. Verify:

# Find anything still referencing the script
grep -r "nightly-backup.sh" /etc /opt /home /var/spool /usr/local/bin
grep -r "myapp_backup_bytes_total" /etc/prometheus  # any alerts?
git log --all --oneline -- scripts/active/nightly-backup.sh  # any recent activity?

If all three are empty, the script is unused. Deprecate immediately; retire after 30 days.

Trigger 3: Repeated Failures Without A Fix

If a script’s myapp_script_last_success_seconds has been stale for >30 days and nobody has been able (or willing) to fix it, the script is dead in fact if not in name. Deprecate it and either:

The worst state is “the cron is still listed but the script silently fails every night.” The retirement is more honest.

Trigger 4: Owner Team Departure With No Successor

Already covered in lifecycle. If owner_team is empty for >30 days, the script transitions to DEPRECATED automatically. After 90 more days, it retires.

Trigger 5: Annual Review Says “No Longer Needed”

The yearly check on last_review. The owner team explicitly says “this isn’t earning its keep.” Deprecate immediately.

The Retirement Ceremony

The day a script retires:

  1. Remove from cron / systemd: sudo systemctl disable --now myapp-script.timer.

  2. Move the file: git mv scripts/active/foo.sh scripts/retired/2026/foo.sh.

  3. Update its lifecycle.yaml: state: RETIRED, set retired_after.

  4. Archive monitoring: move alert rules to prometheus/rules/retired/.

  5. Final commit message:

    retire: scripts/active/nightly-backup.sh
    
    Replaced by Velero (https://velero.example.com).
    Velero has run in parallel for 30 days and outputs match.
    No remaining consumers. Cron entry removed.
    
  6. Note in team weekly: “Retired script X, total active scripts now N.”

The “total active scripts now N” metric is itself worth tracking. A team where N goes up monotonically is accumulating debt; a team where N stays flat or shrinks is managing its surface.

The Anti-Patterns To Watch For (And Reject At Review)

After 41 lessons, these are the patterns that should fail review every time:

Anti-Pattern 1: The “Just A Quick Script” That Lives For 5 Years

Every script in production was once a “just a quick script.” Skip the lifecycle policy at your peril. Reject at review if a draft is being merged without lifecycle.yaml.

Anti-Pattern 2: Silent 2>/dev/null Without Justification

Suppressing stderr hides bugs. Every 2>/dev/null should have a comment: # stderr suppressed because rm prints "no such file" but we don't care.

Anti-Pattern 3: Hardcoded Paths That Break On The Other OS

/proc/sys/kernel/... works on Linux, doesn’t exist on macOS / BSD. dscl works on macOS, doesn’t exist on Linux. Either explicitly target one OS or do feature detection (command -v ... >/dev/null && ...).

Anti-Pattern 4: Globals Named i, tmp, data

Bash has no real namespacing. A for i in ... in a sourced library can collide with the caller’s i. Always use descriptive names in libraries, and local everywhere.

Anti-Pattern 5: Comments That Lie

# fast path — copies in O(1) — but the code does a recursive directory walk. Outdated comments are worse than no comments.

Anti-Pattern 6: Print-Then-Sleep

echo "Restarting..."; sleep 5; restart_thing — if restart_thing fails, the operator has been told a lie. Print after the action succeeds, not before.

Anti-Pattern 7: Magic Numbers

sleep 30 — why 30? tail -n 1000 — why 1000? Either a constant with a name (readonly RETRY_BACKOFF_SECONDS=30) or a comment.

Anti-Pattern 8: Unsourced Library Behavior Differences

Some libraries source at runtime, some at parse time. source lib/foo.sh inside a function works differently than at top level. Test both if your script does it.

Sample Production Script Skeleton

Pull together everything from L1-L41 into a template:

#!/usr/bin/env bash
#
# nightly-cleanup.sh — Remove stale temp files and rotate cleanup logs.
# Owner: platform-ops@example.com
# Runbook: https://runbooks.example.com/nightly-cleanup
# Lifecycle: see nightly-cleanup.lifecycle.yaml
#
# Env:
#   CLEANUP_DRY_RUN — if "true", logs intended actions but does not delete
#   CLEANUP_AGE_DAYS — files older than this are removed (default 7)
#
# Exit codes:
#   0  — success
#   1  — generic failure
#   2  — usage error
#   65 — input data invalid (sysexits.h EX_DATAERR)

set -o errexit -o nounset -o pipefail
IFS=$'\n\t'

readonly SCRIPT_NAME="$(basename "$0")"
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

source /usr/local/lib/log.sh
source /usr/local/lib/err.sh
source /usr/local/lib/fs.sh
source /usr/local/lib/lock.sh
source /usr/local/lib/metrics.sh

readonly CLEANUP_AGE_DAYS=${CLEANUP_AGE_DAYS:-7}
readonly CLEANUP_DRY_RUN=${CLEANUP_DRY_RUN:-false}
readonly CLEANUP_DIRS=(/tmp /var/tmp /var/spool/myapp/work)

usage() {
  cat <<EOF
Usage: $SCRIPT_NAME [--dry-run] [--age-days N]
Remove files older than --age-days from configured cleanup dirs.
EOF
}

main() {
  parse_args "$@"
  acquire_lock "$SCRIPT_NAME" || { log_warn "another instance running"; exit 0; }

  log_info "starting cleanup; age=${CLEANUP_AGE_DAYS}d dry_run=${CLEANUP_DRY_RUN}"
  metric_set "myapp_cleanup_last_run_seconds" "$(date +%s)" "script=\"$SCRIPT_NAME\""
  SECONDS=0

  local removed_total=0
  for dir in "${CLEANUP_DIRS[@]}"; do
    [[ -d "$dir" ]] || { log_warn "missing dir: $dir"; continue; }
    local removed
    removed=$(cleanup_dir "$dir") || removed=0
    removed_total=$((removed_total + removed))
  done

  metric_set "myapp_cleanup_files_removed_total" "$removed_total" "script=\"$SCRIPT_NAME\""
  metric_set "myapp_cleanup_duration_seconds" "$SECONDS" "script=\"$SCRIPT_NAME\""
  log_info "removed $removed_total files in ${SECONDS}s"
}

cleanup_dir() {
  local dir="$1"
  local count=0
  while IFS= read -r -d '' file; do
    if "$CLEANUP_DRY_RUN"; then
      log_debug "would remove: $file"
    else
      rm -f "$file" && count=$((count + 1))
    fi
  done < <(find "$dir" -type f -mtime "+$CLEANUP_AGE_DAYS" -print0)
  echo "$count"
}

parse_args() {
  while (( $# > 0 )); do
    case "$1" in
      --dry-run)    CLEANUP_DRY_RUN=true ;;
      --age-days)   shift; CLEANUP_AGE_DAYS="$1" ;;
      -h|--help)    usage; exit 0 ;;
      *)            usage >&2; exit 2 ;;
    esac
    shift
  done
}

on_exit() {
  local rc=$1
  metric_set "myapp_cleanup_last_exit_code" "$rc" "script=\"$SCRIPT_NAME\""
  if (( rc == 0 )); then
    metric_set "myapp_cleanup_last_success_seconds" "$(date +%s)" "script=\"$SCRIPT_NAME\""
  fi
}
trap 'on_exit $?' EXIT

main "$@"

This template hits every category of the review checklist:

Combined with a Bats test file and a lifecycle.yaml, this is a script ready for ACTIVE state.

The Capstone Quick-Reference Card

THE SEVEN-CATEGORY REVIEW
  1. Boilerplate (shebang, set -e, trap)
  2. Args & UX (--help, --dry-run, validation)
  3. Error handling (stderr, exit codes, no silent suppress)
  4. Safety (idempotent, atomic, no secrets, sanitized inputs)
  5. Observability (logs, 4 metrics, audit log)
  6. Testing (bats, shellcheck, shfmt, CI)
  7. Documentation (header, runbook, CHANGELOG)

THE FIVE STATES
  DRAFT → PROVISIONAL → ACTIVE → DEPRECATED → RETIRED
  Each transition is owner-driven and tracked in lifecycle.yaml

THE FOUR STANDARD METRICS
  myapp_*_last_run_seconds       (when did it run?)
  myapp_*_last_success_seconds   (when did it last succeed?)
  myapp_*_last_exit_code         (what was the outcome?)
  myapp_*_duration_seconds       (how long did it take?)

THE FIVE SUNSET TRIGGERS
  1. Replaced by a real tool (with parallel-run validation)
  2. No remaining consumers (grep across infra)
  3. Failing for >30 days without fix
  4. Owner team departed without successor
  5. Annual review says "no longer needed"

THE LIBRARY FAMILY
  lib/log.sh, lib/err.sh, lib/fs.sh, lib/lock.sh,
  lib/observe.sh, lib/secrets.sh, lib/test.sh,
  lib/metrics.sh, lib/backup.sh, lib/db.sh,
  lib/loganalyze.sh, lib/heal.sh, lib/migrate.sh,
  lib/compliance.sh, lib/forensics.sh

THE EIGHT ANTI-PATTERNS
  1. "Just a quick script" without lifecycle.yaml
  2. Silent 2>/dev/null without justification
  3. OS-hardcoded paths
  4. i, tmp, data globals (no namespacing)
  5. Comments that lie
  6. Print-then-sleep
  7. Magic numbers
  8. Library behavior differences from sourcing

THE NUMBER THAT MATTERS
  Total active scripts: track it weekly. Up = debt; flat = managed.

Closing — What This Course Is Really About

If you’ve read all 42 lessons in order, you now have the equivalent of 4-5 years of senior-engineer apprenticeship in production shell scripting, distilled. But the deeper lesson isn’t any specific pattern.

The deeper lesson is shell is a serious engineering surface when treated with the same discipline as any compiled language: version-controlled, tested, monitored, reviewed, owned, lifecycle-managed, retired. Most teams treat shell as a graveyard of one-off tools because it’s easy to do that — write a script, drop it on a host, never think about it again. That’s how you end up with 200 scripts on every box, half of which fail silently every night, and nobody knows which ones still matter.

The investment in lifecycle policy, review checklist, standard metrics, and sunset criteria is not bureaucracy — it’s the cheapest way to keep shell scripts from becoming the most expensive part of your infrastructure five years out.

The series ends here. Use it. Keep the checklists on the wall. Retire scripts ruthlessly. And when in doubt, source lib/log.sh first.

shellstyle-guidereview-checklistlifecyclesunsetproduction-readinessengineering-disciplinecapstonemetricsdeprecation
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments