Shell Monitoring Agents: Writing Prometheus Exporters, Health Probes, Watchdogs & Liveness/Readiness Endpoints From Bash

Why Shell-Native Monitoring Matters Even With Vendors Everywhere

You have Datadog. You have Prometheus. You have CloudWatch. So why write monitoring in shell?

Because the gap between “what your vendor sees” and “what’s actually true” is exactly the surface where outages live. Vendor agents collect what their schema knows about; they don’t know that your nightly batch job emits a sentinel file at /var/lib/jobs/last-success, that your custom build of nginx puts a status JSON at /run/nginx/status.json, or that the real health of your app is “the queue depth in /var/spool/myapp is < 1000.” Shell-native monitoring lets you measure exactly those things.

The four shell-script monitoring patterns:

Pattern	What it measures	Where it runs	Output
Textfile exporter	Custom metrics for Prometheus	Cron / timer	Files in `/var/lib/node_exporter/textfile_collector/`
HTTP health probe	Liveness/readiness for LB or k8s	Service container	HTTP 200 or 5xx via simple HTTP server
Watchdog	Detect “alive but stuck”	sd_notify or external	systemd restart / alert
Push agent	Active reporting to dashboards	Continuous	HTTP POST to ingestion endpoint

This lesson teaches the discipline of each pattern, the Prometheus exposition format your scripts must produce, the difference between liveness and readiness probes (and why getting it wrong cascades incidents), and a lib/metrics.sh you can source.

The Prometheus Exposition Format (5-Minute Tutorial)

Prometheus scrapes targets that expose metrics in a specific text format. The format is plain text, line-oriented, designed for shell scripts to emit:

# HELP myapp_jobs_total Total jobs processed.
# TYPE myapp_jobs_total counter
myapp_jobs_total{queue="orders",status="success"} 12345
myapp_jobs_total{queue="orders",status="failure"} 17
myapp_jobs_total{queue="billing",status="success"} 9821

# HELP myapp_queue_depth Current queue depth.
# TYPE myapp_queue_depth gauge
myapp_queue_depth{queue="orders"} 42
myapp_queue_depth{queue="billing"} 7

# HELP myapp_request_duration_seconds Request latency.
# TYPE myapp_request_duration_seconds histogram
myapp_request_duration_seconds_bucket{le="0.1"} 1450
myapp_request_duration_seconds_bucket{le="0.5"} 1490
myapp_request_duration_seconds_bucket{le="1.0"} 1500
myapp_request_duration_seconds_bucket{le="+Inf"} 1500
myapp_request_duration_seconds_sum 234.5
myapp_request_duration_seconds_count 1500

The four metric types

Type	Semantic	Example
`counter`	Monotonically increasing; reset only on process restart	`requests_total`, `errors_total`
`gauge`	Goes up and down	`memory_bytes`, `queue_depth`
`histogram`	Sample distribution into buckets	`request_duration_seconds`
`summary`	Like histogram but with quantiles computed at the source	Less common in shell

Format rules

One metric per line.
Optional # HELP <name> <text> and # TYPE <name> <type> lines describe the metric.
Labels in {key="value",key2="value2"} — comma-separated, double-quoted values.
Value is a float (or integer; integers are accepted).
An optional trailing timestamp in milliseconds since epoch (rarely used).
Empty lines and blank space are ignored.
The whole exposition must end with a newline.

The format is forgiving but strict on structure: a missing newline at the end, or unquoted label values, breaks parsing.

Pattern 1: Textfile Exporter

The simplest pattern. node_exporter (the standard host-metrics agent) has a --collector.textfile.directory flag that picks up any *.prom file from a directory and exposes its contents as part of its scrape output.

# /etc/cron.d/nightly-job-metrics
*/5 * * * * root /opt/myapp/bin/emit-metrics.sh

# /opt/myapp/bin/emit-metrics.sh
#!/usr/bin/env bash
set -Eeuo pipefail

OUT=/var/lib/node_exporter/textfile_collector
TMP=$(mktemp "${OUT}/myapp.prom.XXXXXX")
trap 'rm -f "$TMP"' EXIT

# Compute metrics.
queue_depth=$(find /var/spool/myapp -type f | wc -l)
last_success_age=$(( $(date +%s) - $(stat -c %Y /var/lib/myapp/last-success 2>/dev/null || echo 0) ))
disk_used_pct=$(df --output=pcent /var/lib/myapp | tail -1 | tr -d ' %')

# Emit.
cat >"$TMP" <<EOF
# HELP myapp_queue_depth Pending jobs.
# TYPE myapp_queue_depth gauge
myapp_queue_depth $queue_depth

# HELP myapp_last_success_age_seconds Seconds since last successful run.
# TYPE myapp_last_success_age_seconds gauge
myapp_last_success_age_seconds $last_success_age

# HELP myapp_disk_used_percent Disk usage of /var/lib/myapp.
# TYPE myapp_disk_used_percent gauge
myapp_disk_used_percent $disk_used_pct
EOF

# Atomic move into place — so the exporter never reads a partial file.
mv "$TMP" "$OUT/myapp.prom"
trap - EXIT

The atomic-move-from-tmp pattern is critical: node_exporter reads *.prom files at scrape time. If you write directly with >, the exporter can read a half-written file and emit garbage to Prometheus. Always tmp-then-rename in the same directory.

Why textfile is the right pattern for batch / scheduled work

No HTTP server in your script.
node_exporter is already running on the host; you piggyback on its endpoint.
Cron-driven, so works for jobs that run periodically (backups, sync jobs, batch).
Survives if your script dies — the .prom file remains; Prometheus sees stale data, alerts on freshness.

Pattern 2: HTTP Health Probe

For liveness/readiness probes, you need an HTTP endpoint. The dead simple way is socat or ncat listening on a port and returning a static or computed response:

# /opt/myapp/bin/healthd
#!/usr/bin/env bash
set -Eeuo pipefail

PORT=${PORT:-8080}

while :; do
  # Accept one connection at a time. ncat -k keeps the listener open.
  ncat -l -p "$PORT" -k -e /opt/myapp/bin/health-handler.sh
done

# /opt/myapp/bin/health-handler.sh — invoked per request
#!/usr/bin/env bash
set -Eeuo pipefail

# Compute health.
last_heartbeat=$(stat -c %Y /var/lib/myapp/heartbeat 2>/dev/null || echo 0)
age=$(( $(date +%s) - last_heartbeat ))

if (( age < 30 )); then
  status_code="200 OK"
  body='{"status":"healthy","heartbeat_age_seconds":'"$age"'}'
else
  status_code="503 Service Unavailable"
  body='{"status":"unhealthy","heartbeat_age_seconds":'"$age"'}'
fi

# Read the request line (we don't care about its contents but must consume).
read -r request_line || true

printf 'HTTP/1.1 %s\r\n' "$status_code"
printf 'Content-Type: application/json\r\n'
printf 'Content-Length: %d\r\n' "${#body}"
printf 'Connection: close\r\n'
printf '\r\n'
printf '%s' "$body"

For Kubernetes liveness probe:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

Liveness vs Readiness: get this distinction right

Liveness: “is the process alive enough to be useful?” Failed liveness → restart the process. Should rarely fail; a transient failure ≠ kill.

Readiness: “is the process ready to serve traffic right now?” Failed readiness → remove from load-balancer rotation. Can flap freely; useful for “still warming up” or “circuit breaker open.”

Common bug: making liveness check too strict (e.g., requires an external database) — when the DB blips, all replicas fail liveness, k8s restarts them all simultaneously, and the DB blip becomes a full outage.

Rule: liveness checks only test “this process is responsive.” Readiness checks test “I’m fully functional.” External dependency checks belong in readiness, never in liveness.

# /healthz/live — only checks process self
liveness_check() {
  # Did our event loop tick recently?
  local hb_age
  hb_age=$(( $(date +%s) - $(stat -c %Y /var/lib/myapp/heartbeat 2>/dev/null || echo 0) ))
  (( hb_age < 60 ))   # tolerate up to 60s — restart is expensive
}

# /healthz/ready — checks downstream dependencies
readiness_check() {
  # Can we reach the database?
  pg_isready -h "$DB_HOST" -p 5432 -t 2 >/dev/null 2>&1 || return 1
  # Is the cache warm?
  [[ -f /var/lib/myapp/cache.warm ]] || return 1
  return 0
}

Pattern 3: Watchdog — Detecting “Alive But Stuck”

A liveness probe answers “is the process responding?” A watchdog answers “is the process making progress?” — a much harder question.

The pattern:

The main loop writes a “heartbeat” timestamp on every iteration.
A separate watchdog reads the heartbeat; if it’s stale, the process is stuck.
Action: kill the process (so systemd restarts it) or trigger an alert.

sd_notify watchdog (preferred for systemd-managed services)

# In your service script:
main_loop() {
  systemd-notify --ready --status="Started"
  while :; do
    process_one_batch || break
    systemd-notify WATCHDOG=1 --status="Last batch: $(date -u +%FT%TZ)"
    sleep 5
  done
}

In the unit file:

[Service]
Type=notify
WatchdogSec=30
Restart=on-failure

WatchdogSec=30 — if 30s pass without WATCHDOG=1, systemd considers the service stuck and restarts it. The script must call systemd-notify WATCHDOG=1 more often than every 30s. Half the timeout is a good interval (so a 30s watchdog → ping every 15s).

External watchdog (for non-systemd or cron-driven contexts)

# /opt/myapp/bin/watchdog.sh — runs from cron every minute
#!/usr/bin/env bash
set -Eeuo pipefail

HEARTBEAT=/var/lib/myapp/heartbeat
MAX_AGE=120
PID_FILE=/var/run/myapp.pid

if [[ ! -f "$HEARTBEAT" || ! -f "$PID_FILE" ]]; then
  echo "watchdog: no heartbeat or pidfile; nothing to do" >&2
  exit 0
fi

age=$(( $(date +%s) - $(stat -c %Y "$HEARTBEAT") ))
pid=$(cat "$PID_FILE")

if (( age > MAX_AGE )) && kill -0 "$pid" 2>/dev/null; then
  echo "watchdog: pid=$pid heartbeat is ${age}s stale; SIGTERM"
  kill -TERM "$pid"
  sleep 5
  if kill -0 "$pid" 2>/dev/null; then
    echo "watchdog: pid=$pid still alive; SIGKILL"
    kill -KILL "$pid"
  fi
fi

The kill-with-grace pattern: SIGTERM first, give 5 seconds, then SIGKILL. SIGTERM lets the process clean up (close DB connections, flush buffers); SIGKILL is the hammer when grace is over.

Pattern 4: Push-Based Reporting

Some monitoring systems ingest metrics over HTTP rather than scraping. A push agent runs continuously, computes metrics, sends them to the ingestion endpoint:

# /opt/myapp/bin/push-agent.sh
#!/usr/bin/env bash
set -Eeuo pipefail

METRICS_URL="${METRICS_URL:?metrics URL required}"
METRICS_TOKEN="${METRICS_TOKEN:?token required}"
INTERVAL=${INTERVAL:-60}

while :; do
  # Build payload.
  ts=$(date +%s)
  payload=$(jq -nc \
    --arg ts "$ts" \
    --arg host "$(hostname)" \
    --arg cpu "$(awk '{print $1}' /proc/loadavg)" \
    --arg mem "$(awk '/MemAvailable:/ {print $2}' /proc/meminfo)" \
    '{
      timestamp: ($ts | tonumber),
      host: $host,
      metrics: {
        load_1min: ($cpu | tonumber),
        memory_available_kb: ($mem | tonumber)
      }
    }')

  # Send. Don't crash on transient failures.
  if ! curl -fsS -X POST \
       -H "Authorization: Bearer $METRICS_TOKEN" \
       -H "Content-Type: application/json" \
       --max-time 10 \
       --data "$payload" \
       "$METRICS_URL"; then
    echo "$(date -u +%FT%TZ) push failed" >&2
    # Continue; maybe next iteration succeeds.
  fi

  sleep "$INTERVAL"
done

Run it under systemd with Restart=on-failure so a crash doesn’t silently stop reporting.

A Drop-In Library: `lib/metrics.sh`

# lib/metrics.sh — emit Prometheus textfile metrics from any script.

: "${METRICS_DIR:=/var/lib/node_exporter/textfile_collector}"
: "${METRICS_NAMESPACE:=myapp}"

# Internal: collected metrics buffered in associative arrays (bash 4+).
declare -A METRIC_HELP METRIC_TYPE
declare -a METRIC_LINES

metrics_init() {
  METRIC_HELP=()
  METRIC_TYPE=()
  METRIC_LINES=()
}

# Declare a metric. Idempotent.
metrics_declare() {
  local name="$1" type="$2" help="$3"
  METRIC_HELP["$name"]="$help"
  METRIC_TYPE["$name"]="$type"
}

# Add a sample. labels can be empty.
metrics_set() {
  local name="$1" value="$2" labels="${3:-}"
  if [[ -n "$labels" ]]; then
    METRIC_LINES+=("${name}{${labels}} ${value}")
  else
    METRIC_LINES+=("${name} ${value}")
  fi
}

# Increment a counter (read existing, add). Useful for cron-driven counters.
metrics_inc() {
  local name="$1" labels="${2:-}" by="${3:-1}"
  local file="${METRICS_DIR}/${METRICS_NAMESPACE}.counters"
  local key
  if [[ -n "$labels" ]]; then
    key="${name}{${labels}}"
  else
    key="${name}"
  fi
  # File format: "key value"
  local current
  current=$(awk -v k="$key" '$1==k {print $2; exit}' "$file" 2>/dev/null || echo 0)
  current=${current:-0}
  local new=$(( current + by ))
  # Atomic update via temp file.
  local tmp
  tmp=$(mktemp "${file}.XXXXXX")
  awk -v k="$key" -v v="$new" '
    $1==k {print k, v; found=1; next}
    {print}
    END { if (!found) print k, v }
  ' "$file" 2>/dev/null > "$tmp" || echo "$key $new" > "$tmp"
  mv "$tmp" "$file"
}

# Emit all collected metrics atomically.
metrics_emit() {
  local outfile="${METRICS_DIR}/${METRICS_NAMESPACE}.prom"
  local tmp
  tmp=$(mktemp "${outfile}.XXXXXX")
  trap "rm -f '$tmp'" EXIT

  # Group by metric name for HELP/TYPE headers.
  declare -A seen
  local line metric
  {
    for line in "${METRIC_LINES[@]}"; do
      metric="${line%%[ {]*}"
      if [[ -z "${seen[$metric]:-}" ]]; then
        seen[$metric]=1
        printf '# HELP %s %s\n' "$metric" "${METRIC_HELP[$metric]:-}"
        printf '# TYPE %s %s\n' "$metric" "${METRIC_TYPE[$metric]:-untyped}"
      fi
      printf '%s\n' "$line"
    done
  } > "$tmp"

  install -m 0644 "$tmp" "$outfile"
  rm -f "$tmp"
  trap - EXIT
}

# ─── Health endpoint helpers ───────────────────────────────────────────────

health_response_ok() {
  local body="${1:-{\"status\":\"healthy\"}}"
  printf 'HTTP/1.1 200 OK\r\n'
  printf 'Content-Type: application/json\r\n'
  printf 'Content-Length: %d\r\n' "${#body}"
  printf 'Connection: close\r\n\r\n'
  printf '%s' "$body"
}

health_response_unhealthy() {
  local reason="${1:-unhealthy}"
  local body="{\"status\":\"unhealthy\",\"reason\":\"$reason\"}"
  printf 'HTTP/1.1 503 Service Unavailable\r\n'
  printf 'Content-Type: application/json\r\n'
  printf 'Content-Length: %d\r\n' "${#body}"
  printf 'Connection: close\r\n\r\n'
  printf '%s' "$body"
}

# ─── Heartbeat ─────────────────────────────────────────────────────────────

heartbeat_write() {
  local file="${1:-/var/lib/myapp/heartbeat}"
  date -u +%s > "$file"
}

heartbeat_age() {
  local file="${1:-/var/lib/myapp/heartbeat}"
  echo $(( $(date +%s) - $(stat -c %Y "$file" 2>/dev/null || echo 0) ))
}

Usage:

. /opt/myapp/lib/metrics.sh
metrics_init

metrics_declare myapp_queue_depth gauge "Pending jobs."
metrics_set    myapp_queue_depth $(find /var/spool -type f | wc -l)

metrics_declare myapp_jobs_total counter "Total jobs processed."
metrics_set    myapp_jobs_total 12345 'queue="orders",status="success"'
metrics_set    myapp_jobs_total 17    'queue="orders",status="failure"'

metrics_emit

Real-World Recipes

Recipe 1: Emit metrics about backup freshness

. /opt/myapp/lib/metrics.sh
metrics_init

backup_dir=/backups
metrics_declare backup_age_seconds gauge "Age of latest backup."
metrics_declare backup_size_bytes  gauge "Size of latest backup."
metrics_declare backup_count       gauge "Number of backups retained."

for app in myapp app2 app3; do
  latest=$(ls -t "$backup_dir/$app/"*.tar.gz 2>/dev/null | head -1)
  if [[ -n "$latest" ]]; then
    age=$(( $(date +%s) - $(stat -c %Y "$latest") ))
    size=$(stat -c %s "$latest")
    count=$(ls "$backup_dir/$app/"*.tar.gz 2>/dev/null | wc -l)
    metrics_set backup_age_seconds "$age" "app=\"$app\""
    metrics_set backup_size_bytes  "$size" "app=\"$app\""
    metrics_set backup_count       "$count" "app=\"$app\""
  fi
done

metrics_emit

Schedule via cron */5 * * * *. Prometheus alerts on backup_age_seconds > 86400 per app.

Recipe 2: HTTP health endpoint with multiple dependency checks

#!/usr/bin/env bash
. /opt/myapp/lib/metrics.sh
set -Eeuo pipefail

PORT="${PORT:-8080}"

handle_request() {
  read -r request_line || return
  # Read remaining headers until empty line.
  while IFS= read -r line && [[ "$line" != $'\r' ]]; do :; done

  local path
  path=$(echo "$request_line" | awk '{print $2}')

  case "$path" in
    /healthz/live)
      # Just check we're processing.
      if (( $(heartbeat_age /var/lib/myapp/heartbeat) < 60 )); then
        health_response_ok '{"status":"alive"}'
      else
        health_response_unhealthy "heartbeat stale"
      fi
      ;;
    /healthz/ready)
      # Check downstream dependencies.
      local issues=()
      pg_isready -h "$DB_HOST" -t 2 >/dev/null 2>&1 || issues+=("db")
      curl -fsS --max-time 2 "$REDIS_URL" >/dev/null || issues+=("redis")
      if [[ ${#issues[@]} -eq 0 ]]; then
        health_response_ok '{"status":"ready"}'
      else
        health_response_unhealthy "deps: ${issues[*]}"
      fi
      ;;
    *)
      printf 'HTTP/1.1 404 Not Found\r\nContent-Length: 0\r\n\r\n'
      ;;
  esac
}

while :; do
  ncat -l -p "$PORT" -e "$0 --handle"
done

Recipe 3: External watchdog with metrics

# /opt/myapp/bin/watchdog.sh — runs every minute via cron.
. /opt/myapp/lib/metrics.sh
metrics_init

services=(myapp-api myapp-worker myapp-scheduler)

metrics_declare service_active gauge "Service active state (1=active)."
metrics_declare service_restart_count counter "Service restart count."

for svc in "${services[@]}"; do
  if systemctl is-active --quiet "$svc"; then
    metrics_set service_active 1 "service=\"$svc\""
  else
    metrics_set service_active 0 "service=\"$svc\""
    # Try to restart.
    if systemctl restart "$svc"; then
      metrics_inc service_restart_count "service=\"$svc\""
    fi
  fi
done

metrics_emit

Recipe 4: Histogram-style request latency exporter

# Process a log file, emit latency histogram.
# Log format: "GET /api/users 0.342s 200"

. /opt/myapp/lib/metrics.sh
metrics_init

log=/var/log/myapp/access.log
buckets=(0.1 0.5 1.0 2.0 5.0)

declare -A bucket_counts
total_count=0
total_sum=0

# Read latencies from last 5 minutes.
since=$(date -d '5 minutes ago' +%s)

while IFS=' ' read -r _ _ duration _; do
  duration=${duration%s}
  total_count=$(( total_count + 1 ))
  total_sum=$(awk -v s="$total_sum" -v d="$duration" 'BEGIN { print s + d }')

  for bucket in "${buckets[@]}"; do
    if (( $(awk -v d="$duration" -v b="$bucket" 'BEGIN { print (d <= b) }') )); then
      bucket_counts[$bucket]=$((${bucket_counts[$bucket]:-0} + 1))
    fi
  done
done < <(tail -10000 "$log")

metrics_declare myapp_request_duration_seconds histogram "Request latency."
for bucket in "${buckets[@]}"; do
  metrics_set myapp_request_duration_seconds_bucket "${bucket_counts[$bucket]:-0}" "le=\"$bucket\""
done
metrics_set myapp_request_duration_seconds_bucket "$total_count" 'le="+Inf"'
metrics_set myapp_request_duration_seconds_sum "$total_sum"
metrics_set myapp_request_duration_seconds_count "$total_count"

metrics_emit

Footgun List

Writing .prom files non-atomically. Always tmp-then-rename in the same dir. Otherwise node_exporter reads a half-written file and emits broken metrics.
Metrics with high cardinality labels. Per-user-ID labels create millions of time series. Limit labels to bounded sets (status code, queue name, region — not request_id).
Counter going down. Counters must monotonically increase. If your script computes “errors in last 5 min” and emits as a counter, you’ll see negative deltas. Use a gauge for “current snapshot,” counter for “cumulative since process start.”
Liveness checking external deps. Cascades failures. Liveness only checks self; readiness checks deps.
Health endpoint without timeout. A hung DB query freezes the health endpoint, k8s thinks pod is dead, restarts it — and the new pod tries the same query and freezes too. Always timeout dep checks: pg_isready -t 2.
Watchdog with no grace. Killing on the first late heartbeat is wrong if heartbeats are best-effort. Allow 2–3 missed cycles before action.
Push agent that crashes on transient send failure. Wrap curl in if/then; log the failure and continue. Don’t set -e your way to silent monitor death.
Forgetting trailing newline in textfile output. Prometheus parsers may reject; always end the file with a newline.
Label values with quotes/backslashes/newlines. Escape: \\ for backslash, \" for quote, \n for newline.
Health endpoint that performs writes. Don’t make /healthz insert a row to test the DB. The probe runs every 10 seconds — you’d flood the DB. Use read-only checks.
Mixing systemd-notify watchdog with external watchdog. Pick one. Two watchdogs fighting over the same process leads to flapping restarts.
Sending raw timestamps as metric values. Prometheus expects floats. date +%s is fine; ISO-8601 strings break parsing.

Quick-Reference Card

┌─ PROMETHEUS METRIC TYPES ─────────────────────────────────────────────┐
│  counter    monotonically increasing (requests_total, errors_total)  │
│  gauge      goes up and down (queue_depth, memory_bytes)             │
│  histogram  bucketed sample distribution (request_duration_seconds)  │
│  summary    quantiles computed at source (less common in shell)      │
└────────────────────────────────────────────────────────────────────────┘

┌─ EXPOSITION FORMAT ───────────────────────────────────────────────────┐
│  # HELP <metric> <description>                                       │
│  # TYPE <metric> <type>                                              │
│  <metric>{label="value",...} <number>                                │
│  Trailing newline required                                           │
└────────────────────────────────────────────────────────────────────────┘

┌─ TEXTFILE EXPORTER ───────────────────────────────────────────────────┐
│  Drop *.prom in /var/lib/node_exporter/textfile_collector/           │
│  ATOMIC WRITE: tmp + mv (never `>`)                                  │
│  node_exporter --collector.textfile.directory=...                    │
│  Schedule via cron or systemd timer                                  │
└────────────────────────────────────────────────────────────────────────┘

┌─ LIVENESS vs READINESS ───────────────────────────────────────────────┐
│  Liveness:  am I responsive? (Failed → restart process)              │
│             Only checks self; never external deps                    │
│  Readiness: am I serving traffic? (Failed → remove from LB)          │
│             Can check downstream deps; flapping is OK                │
└────────────────────────────────────────────────────────────────────────┘

┌─ WATCHDOG PATTERN ────────────────────────────────────────────────────┐
│  Process writes heartbeat (timestamp file) every iteration            │
│  Watcher checks heartbeat freshness                                  │
│  Stale → SIGTERM with grace, then SIGKILL                            │
│  systemd: Type=notify + WatchdogSec=N + systemd-notify WATCHDOG=1    │
└────────────────────────────────────────────────────────────────────────┘

┌─ KUBERNETES PROBE FIELDS ─────────────────────────────────────────────┐
│  initialDelaySeconds   wait before first probe (allow startup)       │
│  periodSeconds         interval between probes                        │
│  timeoutSeconds        per-probe timeout (set to 2-5)                │
│  failureThreshold      consecutive failures before action             │
│  successThreshold      consecutive successes (readiness only)         │
└────────────────────────────────────────────────────────────────────────┘

What’s Next

Monitoring tells you the system’s state. Backups protect you when the state is wrong. The next lesson, Backup & Restore Scripts: Integrity, Retention, Immutability & Drill Testing, covers the discipline of backups that actually work — checksumming for integrity, retention with grandfather-father-son schemes, immutable backups via S3 Object Lock, and the practice of regularly restoring from backups to verify they’re real.

Shell Monitoring Agents: Writing Prometheus Exporters, Health Probes, Watchdogs & Liveness/Readiness Endpoints From Bash

Why Shell-Native Monitoring Matters Even With Vendors Everywhere

The Prometheus Exposition Format (5-Minute Tutorial)

The four metric types

Format rules

Pattern 1: Textfile Exporter

Why textfile is the right pattern for batch / scheduled work

Pattern 2: HTTP Health Probe

Liveness vs Readiness: get this distinction right

Pattern 3: Watchdog — Detecting “Alive But Stuck”

sd_notify watchdog (preferred for systemd-managed services)

External watchdog (for non-systemd or cron-driven contexts)

Pattern 4: Push-Based Reporting

A Drop-In Library: `lib/metrics.sh`

Real-World Recipes

Recipe 1: Emit metrics about backup freshness

Recipe 2: HTTP health endpoint with multiple dependency checks

Recipe 3: External watchdog with metrics

Recipe 4: Histogram-style request latency exporter

Footgun List

Quick-Reference Card

What’s Next

Written by Vinod

Comments

Shell Monitoring Agents: Writing Prometheus Exporters, Health Probes, Watchdogs & Liveness/Readiness Endpoints From Bash

Why Shell-Native Monitoring Matters Even With Vendors Everywhere

The Prometheus Exposition Format (5-Minute Tutorial)

The four metric types

Format rules

Pattern 1: Textfile Exporter

Why textfile is the right pattern for batch / scheduled work

Pattern 2: HTTP Health Probe

Liveness vs Readiness: get this distinction right

Pattern 3: Watchdog — Detecting “Alive But Stuck”

sd_notify watchdog (preferred for systemd-managed services)

External watchdog (for non-systemd or cron-driven contexts)

Pattern 4: Push-Based Reporting

A Drop-In Library: lib/metrics.sh

Real-World Recipes

Recipe 1: Emit metrics about backup freshness

Recipe 2: HTTP health endpoint with multiple dependency checks

Recipe 3: External watchdog with metrics

Recipe 4: Histogram-style request latency exporter

Footgun List

Quick-Reference Card

What’s Next

Written by Vinod

Comments

A Drop-In Library: `lib/metrics.sh`