Shell Lesson 35 of 42

Shell Backup & Restore: Integrity Manifests, GFS Retention, Immutable Object-Lock Storage & Drill-Tested Recovery

The Four Pillars (And Why Every Outage Has Failed at Least One of Them)

A backup is not a backup until it is integrity-verified, retention-bounded, immutable from the source, and drill-tested. Every public post-mortem you have read where “we had backups but couldn’t restore” was a failure on one of those four axes:

Pillar What it answers Common failure
Integrity “Is the byte stream the same as what we wrote?” Tape corruption, S3 multipart edge cases, silent disk rot
Retention “How long do we keep what?” Storage cost spirals, or worse, the only good copy was pruned
Immutability “Can the source attacker (or a bug) delete the backup?” Ransomware encrypts both prod and backup; bug rm -rf’s an S3 prefix
Drill testing “Can we actually restore in time?” “Untested backup” — the only honest description until you prove otherwise

This lesson teaches each pillar with shell scripts that work on Linux servers, including a lib/backup.sh you can source. We use sha256sum manifests for integrity, the GFS (Grandfather-Father-Son) retention scheme for bounded storage, S3 Object Lock and ZFS snapshots for immutability, and an automated restore-drill verifier that runs weekly in CI.

RTO vs RPO: The Two Numbers You Owe Your Stakeholders

Before you write a single line of backup code, you need two numbers signed off by a stakeholder:

Without these two numbers you cannot decide between hourly snapshots vs. daily, or between cold tape (cheap, slow) and warm S3 (expensive, fast). Every script in this lesson references RPO/RPO at the top.

Pillar 1: Integrity — The sha256sum Manifest Pattern

The single most useful backup primitive is the manifest: a sidecar file listing every backed-up file along with its sha256 hash and size. With a manifest you can:

  1. Verify a backup is byte-identical to source at write time (catch network corruption).
  2. Verify months later that storage hasn’t rotted (catch silent disk failure).
  3. Compare two backups to find what changed (incremental planning).
  4. Prove to auditors that restoration was bit-perfect.

Generating a Manifest

# Generate manifest of /var/lib/myapp at backup time
manifest_create() {
  local src="$1" dest="$2"
  ( cd "$src" && find . -type f -print0 \
      | xargs -0 sha256sum \
      | sort -k 2 \
  ) > "$dest"
}

manifest_create /var/lib/myapp /var/backups/myapp-2026-06-22.manifest

The manifest format is the standard sha256sum format: <hash> <relative-path>. It’s plain text, sorted, and trivially comparable with diff.

Verifying a Manifest at Restore Time

manifest_verify() {
  local src="$1" manifest="$2"
  ( cd "$src" && sha256sum -c "$manifest" --quiet )
}

# Returns 0 if all files match, non-zero with mismatch list on stderr
manifest_verify /var/restore/myapp /var/backups/myapp-2026-06-22.manifest

Run this immediately after every restore drill and at scrape time (yes, monthly, even if no restore is happening). Silent corruption is real and the only defense is periodic re-checksumming.

Storing the Manifest Separately From the Backup

Anti-pattern: storing the manifest inside the same tar/zip as the data. If the archive corrupts, your verifier corrupts with it. Store the manifest as a sidecar file with a parallel name:

/var/backups/myapp-2026-06-22.tar.zst
/var/backups/myapp-2026-06-22.tar.zst.sha256   # checksum of the tarball itself
/var/backups/myapp-2026-06-22.manifest         # checksum of every file inside
/var/backups/myapp-2026-06-22.manifest.sha256  # checksum of the manifest

You now have a chain of trust:

  1. manifest.sha256 → proves the manifest itself is uncorrupted.
  2. manifest → proves every file is uncorrupted.
  3. tar.zst.sha256 → proves the tarball is uncorrupted (catches damage that wouldn’t be caught by file-level manifest because tar metadata could be wrong).

Why Not GPG Signatures?

GPG signatures are stronger than checksums (they prove origin, not just integrity), but they are operationally heavy: key management, expiry, revocation. For most internal backups, sha256 + immutable storage is sufficient. Reserve GPG for cross-org backup transfer (e.g., sending backups to a partner) where you need non-repudiation.

Pillar 2: Retention — GFS (Grandfather-Father-Son)

Naive retention is “keep N days.” The problem: a corruption that started 30 days ago (silent), discovered today, gives you N=30 useless backups and zero good ones.

GFS solves this by keeping backups at multiple time horizons:

Tier Frequency Retention Purpose
Son (daily) Every day 7 days Recent operational rollback
Father (weekly) Sunday of every week 4 weeks Last-month rollback
Grandfather (monthly) First Sunday of month 12 months Audit, long-tail corruption discovery
Yearly (optional) First Sunday of January 7 years Compliance (tax, HIPAA, etc.)

Total backups kept: ~7 + 4 + 12 = 23 backups, vs. naive “keep 365 days” = 365 backups. Storage cost is ~6% of naive while preserving discovery windows of 1 year.

Implementing GFS Pruning in Bash

# Prune backups in a directory according to GFS policy.
# Assumes backups are named: myapp-YYYY-MM-DD.tar.zst
gfs_prune() {
  local dir="$1"
  local now today day_of_week day_of_month month
  now=$(date +%s)
  today=$(date +%Y-%m-%d)

  for f in "$dir"/myapp-*.tar.zst; do
    [[ -f "$f" ]] || continue
    local base date_str ts age_days
    base=$(basename "$f" .tar.zst)
    date_str=${base#myapp-}
    ts=$(date -d "$date_str" +%s 2>/dev/null) || continue
    age_days=$(( (now - ts) / 86400 ))

    day_of_week=$(date -d "$date_str" +%u)   # 1-7, Mon-Sun
    day_of_month=$(date -d "$date_str" +%d)
    month=$(date -d "$date_str" +%m)

    local keep=false
    # Son: keep last 7 days
    (( age_days <= 7 )) && keep=true
    # Father: Sundays in last 28 days
    (( age_days <= 28 )) && [[ "$day_of_week" == "7" ]] && keep=true
    # Grandfather: first Sunday of month in last 365 days
    (( age_days <= 365 )) && [[ "$day_of_week" == "7" ]] && (( 10#$day_of_month <= 7 )) && keep=true
    # Yearly: first Sunday of January in last 7 years
    (( age_days <= 365*7 )) && [[ "$day_of_week" == "7" ]] && (( 10#$day_of_month <= 7 )) && [[ "$month" == "01" ]] && keep=true

    if ! $keep; then
      echo "PRUNE: $f (age=${age_days}d)"
      # rm "$f" "$f.sha256" "${f%.tar.zst}.manifest" "${f%.tar.zst}.manifest.sha256"
    fi
  done
}

Note 10#$day_of_month: bash treats numbers with leading zeros as octal, so 08 and 09 would be parse errors. The 10# prefix forces base-10. This is a classic shell footgun in any date arithmetic.

The rm is commented out for safety — always run with echo first, eyeball the list, then enable deletes. A bad prune script is indistinguishable from ransomware.

Why Not Just Use Restic’s Built-In Retention?

restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 12 --keep-yearly 7 does GFS for you and is the right answer for restic-native workflows. Use it when you can. The bash version above exists for tar-based backups, S3 prefixes, ZFS snapshots, and any other store where you don’t have a built-in retention engine.

Pillar 3: Immutability — Object Lock, WORM, ZFS Snapshots

Ransomware operators specifically target backup systems. If the backup credential is on the prod box, the ransomware uses it to delete or encrypt the backup. Mutable backups are not backups against ransomware; they are just slow disks.

The defense is immutability at the storage layer: even an attacker with valid credentials cannot delete the data until a retention period has passed. Three patterns:

Pattern A: S3 Object Lock (Compliance Mode)

S3 Object Lock in compliance mode means not even the AWS root account can delete the object before the retention date. This is the gold standard.

# Bucket must be created with object-lock enabled (cannot be added later)
aws s3api create-bucket \
  --bucket myapp-backups-prod \
  --object-lock-enabled-for-bucket \
  --region us-east-1

# Default 30-day compliance lock on every uploaded object
aws s3api put-object-lock-configuration \
  --bucket myapp-backups-prod \
  --object-lock-configuration '{
    "ObjectLockEnabled": "Enabled",
    "Rule": {
      "DefaultRetention": {
        "Mode": "COMPLIANCE",
        "Days": 30
      }
    }
  }'

# Upload — object is locked for 30 days from this moment
aws s3 cp myapp-2026-06-22.tar.zst s3://myapp-backups-prod/daily/

Once uploaded, that object cannot be deleted, overwritten, or have its retention shortened by anyone, including the AWS root user, until day 31. Governance mode is similar but allows specific IAM principals to override; for true ransomware defense use compliance mode.

Pair Object Lock with MFA Delete on the bucket and a separate AWS account for backups (so a compromise of the prod account cannot pivot to delete the backup account’s resources).

Pattern B: ZFS Snapshots With zfs hold

For on-prem or self-managed storage, ZFS snapshots are atomic, copy-on-write, and can be marked as held against deletion:

# Take snapshot
zfs snapshot tank/myapp@daily-2026-06-22

# Place a hold (named tag) — snapshot cannot be destroyed while hold exists
zfs hold compliance-30d tank/myapp@daily-2026-06-22

# Try to destroy — fails with "dataset is busy"
zfs destroy tank/myapp@daily-2026-06-22
# cannot destroy 'tank/myapp@daily-2026-06-22': snapshot is busy

# Cron job 30 days later releases hold
zfs release compliance-30d tank/myapp@daily-2026-06-22

The key property: the script that takes the snapshot has create+hold privilege but not release privilege. The release script runs from a separate account/host. Even if the backup-creator account is compromised, the attacker cannot release the hold.

Pattern C: Append-Only Filesystem (chattr +a)

For local backups, chattr +a on Linux ext4/xfs makes a file append-only — even root cannot truncate or delete it without first removing the attribute, which itself requires CAP_LINUX_IMMUTABLE. Combined with a tightly-scoped capability set, this gives partial immutability against compromised prod credentials:

# Run as root once at backup directory creation
chattr +a /var/backups/myapp/
# Now backups can be added but not modified or deleted by normal users

This is weaker than S3 Object Lock or ZFS holds because root with CAP_LINUX_IMMUTABLE can override it; it’s a defense-in-depth layer, not the primary control.

What “Air-Gapped” Really Means in 2026

A truly air-gapped backup is one that no online credential can reach. Examples:

Cloud Object Lock is “air-gapped enough” for most threat models because the lock duration outlives the time-to-detect a compromise. Tape is the gold standard for nation-state-level threats but operationally heavy.

Pillar 4: Drill Testing — The Restore Verifier

A backup that has never been restored is a wish, not a plan. The single most important script in this lesson is the one that automatically restores the latest backup to a sandbox VM and verifies it works.

The Weekly Restore-Drill Script

#!/usr/bin/env bash
# weekly-restore-drill.sh — Run from CI every Sunday at 02:00.
# Restores latest backup to a clean sandbox, verifies the manifest,
# and runs application smoke tests. Posts results to monitoring.

set -euo pipefail

readonly SANDBOX=/srv/restore-sandbox
readonly BACKUP_BUCKET=s3://myapp-backups-prod
readonly REPORT_FILE=/var/log/restore-drill/$(date +%Y-%m-%d).log

log() { printf '[%s] %s\n' "$(date -Iseconds)" "$*"; }
metric() { printf '%s %s\n' "$1" "$2" > "/var/lib/node_exporter/textfile_collector/restore_drill.prom.tmp"
           mv "/var/lib/node_exporter/textfile_collector/restore_drill.prom.tmp" \
              "/var/lib/node_exporter/textfile_collector/restore_drill.prom"; }

# 1. Find latest backup
log "Finding latest backup"
LATEST=$(aws s3 ls "$BACKUP_BUCKET/daily/" \
  | sort | tail -1 | awk '{print $4}')
log "Latest: $LATEST"

# 2. Wipe sandbox (defense: ensure we're testing the backup, not stale data)
log "Wiping sandbox $SANDBOX"
rm -rf "${SANDBOX:?}"/*

# 3. Download backup + manifest
log "Downloading backup"
aws s3 cp "$BACKUP_BUCKET/daily/$LATEST" "$SANDBOX/$LATEST"
aws s3 cp "$BACKUP_BUCKET/daily/${LATEST%.tar.zst}.manifest" "$SANDBOX/manifest"
aws s3 cp "$BACKUP_BUCKET/daily/${LATEST%.tar.zst}.manifest.sha256" "$SANDBOX/manifest.sha256"

# 4. Verify manifest checksum
log "Verifying manifest integrity"
( cd "$SANDBOX" && sha256sum -c manifest.sha256 ) || {
  log "ERROR: manifest is corrupted"
  metric restore_drill_status 0
  exit 1
}

# 5. Extract
log "Extracting backup"
mkdir -p "$SANDBOX/data"
tar -xf "$SANDBOX/$LATEST" -C "$SANDBOX/data"

# 6. Verify file-level manifest
log "Verifying file manifest"
( cd "$SANDBOX/data" && sha256sum -c "$SANDBOX/manifest" --quiet ) || {
  log "ERROR: file manifest mismatch"
  metric restore_drill_status 0
  exit 1
}

# 7. Application smoke test (app-specific!)
log "Running smoke test"
if /usr/local/bin/myapp-smoke-test "$SANDBOX/data"; then
  log "Smoke test passed"
  metric restore_drill_status 1
  metric restore_drill_last_success "$(date +%s)"
else
  log "ERROR: smoke test failed"
  metric restore_drill_status 0
  exit 1
fi

Two non-obvious decisions in this script:

  1. Wipe before restore. If you don’t, a subtle bug where tar fails to extract a critical file gets masked by a leftover from the previous drill. Always start from empty.
  2. Application smoke test. The manifest only proves bytes are correct; it doesn’t prove the application can start. The smoke test for a Postgres backup is pg_isready && SELECT count(*) FROM critical_table. For a stateful app it’s app --self-check. An untested smoke test is half a drill.

What “Drill-Tested” Means Audit-Side

For SOC 2, ISO 27001, and HIPAA, “drill-tested” means:

Wire restore_drill_status and restore_drill_last_success to Prometheus alerts:

# prometheus/rules/backup.yml
groups:
- name: backup
  rules:
  - alert: RestoreDrillFailing
    expr: restore_drill_status == 0
    for: 1h
    annotations:
      summary: "Last weekly restore drill failed"

  - alert: RestoreDrillStale
    expr: time() - restore_drill_last_success > 86400 * 14
    annotations:
      summary: "No successful restore drill in 14 days"

RestoreDrillStale catches the “the drill script itself broke and nobody noticed” failure mode.

The Drop-In lib/backup.sh

# lib/backup.sh — sourced helpers for backup scripts.
#
# Usage:
#   source /usr/local/lib/backup.sh
#   backup_create_tar /var/lib/myapp /var/backups myapp
#   backup_upload_s3 /var/backups/myapp-2026-06-22.tar.zst s3://bkp/daily/
#   backup_verify_remote s3://bkp/daily/myapp-2026-06-22.tar.zst

set -o errexit -o nounset -o pipefail

readonly BACKUP_LOG="${BACKUP_LOG:-/var/log/backup.log}"

backup_log() {
  printf '[%s] %s\n' "$(date -Iseconds)" "$*" | tee -a "$BACKUP_LOG"
}

# Create a tar.zst archive + manifest + sidecar checksums.
# Args: src_dir, dest_dir, name_prefix
backup_create_tar() {
  local src="$1" dest="$2" name="$3"
  local stamp out manifest
  stamp=$(date +%Y-%m-%d-%H%M%S)
  out="$dest/${name}-${stamp}.tar.zst"
  manifest="$dest/${name}-${stamp}.manifest"

  backup_log "Creating manifest for $src"
  ( cd "$src" && find . -type f -print0 | xargs -0 sha256sum | sort -k 2 ) > "$manifest"
  sha256sum "$manifest" > "$manifest.sha256"

  backup_log "Creating tar archive $out"
  tar --create --zstd --file="$out" -C "$src" .
  sha256sum "$out" > "$out.sha256"

  backup_log "Created: $out (size=$(stat -c %s "$out")B)"
  printf '%s\n' "$out"
}

# Upload tar + sidecars to S3.
# Args: tar_path, s3_prefix
backup_upload_s3() {
  local tar="$1" prefix="$2"
  local base="${tar%.tar.zst}"
  for f in "$tar" "$tar.sha256" "$base.manifest" "$base.manifest.sha256"; do
    [[ -f "$f" ]] || { backup_log "WARN: $f missing, skipping"; continue; }
    backup_log "Uploading $f to $prefix"
    aws s3 cp "$f" "$prefix" --no-progress
  done
}

# Verify a remote tar.zst by downloading sidecars and re-hashing the tar.
# Args: s3_url
backup_verify_remote() {
  local url="$1"
  local tmp
  tmp=$(mktemp -d)
  trap "rm -rf '$tmp'" EXIT

  aws s3 cp "$url" "$tmp/archive.tar.zst" --no-progress
  aws s3 cp "$url.sha256" "$tmp/archive.tar.zst.sha256" --no-progress

  ( cd "$tmp" && sha256sum -c archive.tar.zst.sha256 --quiet ) \
    && backup_log "OK: $url integrity verified" \
    || { backup_log "FAIL: $url integrity check failed"; return 1; }
}

# Local GFS prune. Args: dir, name_prefix
backup_gfs_prune() {
  local dir="$1" prefix="$2"
  local now today
  now=$(date +%s)

  find "$dir" -name "${prefix}-*.tar.zst" -print | while read -r f; do
    local base date_str ts age_days dow dom mon keep
    base=$(basename "$f" .tar.zst)
    date_str=${base#${prefix}-}
    date_str=${date_str%-*}  # strip HHMMSS suffix
    ts=$(date -d "$date_str" +%s 2>/dev/null) || continue
    age_days=$(( (now - ts) / 86400 ))
    dow=$(date -d "$date_str" +%u)
    dom=$(date -d "$date_str" +%d)
    mon=$(date -d "$date_str" +%m)

    keep=false
    (( age_days <= 7 )) && keep=true
    (( age_days <= 28 )) && [[ "$dow" == "7" ]] && keep=true
    (( age_days <= 365 )) && [[ "$dow" == "7" ]] && (( 10#$dom <= 7 )) && keep=true
    (( age_days <= 365*7 )) && [[ "$dow" == "7" ]] && (( 10#$dom <= 7 )) && [[ "$mon" == "01" ]] && keep=true

    if ! $keep; then
      backup_log "PRUNE: $f"
      rm -f "$f" "$f.sha256" "${f%.tar.zst}.manifest" "${f%.tar.zst}.manifest.sha256"
    fi
  done
}

# Restore + verify. Args: tar_path, dest_dir
backup_restore_verify() {
  local tar="$1" dest="$2"
  local base manifest
  base="${tar%.tar.zst}"
  manifest="$base.manifest"

  [[ -f "$manifest" ]] || { backup_log "FAIL: manifest missing for $tar"; return 1; }

  backup_log "Verifying tarball integrity"
  sha256sum -c "$tar.sha256" --quiet || return 1

  backup_log "Extracting to $dest"
  mkdir -p "$dest"
  tar -xf "$tar" -C "$dest"

  backup_log "Verifying file manifest"
  ( cd "$dest" && sha256sum -c "$manifest" --quiet ) || return 1

  backup_log "OK: restore verified at $dest"
}

Using the Library

#!/usr/bin/env bash
# nightly-backup.sh — runs from cron or systemd timer at 01:00
source /usr/local/lib/backup.sh

readonly SRC=/var/lib/myapp
readonly LOCAL_DEST=/var/backups
readonly S3_PREFIX=s3://myapp-backups-prod/daily/

backup_log "===== nightly backup starting ====="
tar=$(backup_create_tar "$SRC" "$LOCAL_DEST" myapp)
backup_upload_s3 "$tar" "$S3_PREFIX"
backup_verify_remote "$S3_PREFIX$(basename "$tar")"
backup_gfs_prune "$LOCAL_DEST" myapp
backup_log "===== nightly backup complete ====="

The whole nightly orchestration is ~10 lines of glue because the library encodes the discipline.

Restic: When You Want Dedup + Encryption Out of the Box

For datasets where dedup and encryption matter (tens of GB+, or where backups travel over untrusted networks), restic does it natively. Wrap it in shell:

#!/usr/bin/env bash
set -euo pipefail
export RESTIC_REPOSITORY=s3:s3.amazonaws.com/myapp-restic
export RESTIC_PASSWORD_FILE=/etc/restic/password
export AWS_ACCESS_KEY_ID="$(cat /etc/restic/aws-key)"
export AWS_SECRET_ACCESS_KEY="$(cat /etc/restic/aws-secret)"

# Backup
restic backup /var/lib/myapp --tag daily --tag "$(hostname)"

# Verify (re-hashes a 5% random sample of pack files)
restic check --read-data-subset=5%

# GFS prune
restic forget \
  --keep-daily 7 \
  --keep-weekly 4 \
  --keep-monthly 12 \
  --keep-yearly 7 \
  --prune

restic check --read-data-subset=5% is the integrity verifier — run it weekly. Over a year you’ll have re-read your entire backup at least once, catching silent corruption.

Pair restic’s S3 backend with Object Lock for immutability — restic will operate normally, but the backing objects are still locked against deletion until retention expires. This is the gold-standard combo for mid-size shops.

ZFS Snapshot Replication for Local-Plus-Remote

If your dataset is on ZFS, you get snapshot + send/recv built-in. The pattern:

#!/usr/bin/env bash
set -euo pipefail

readonly DATASET=tank/myapp
readonly REMOTE_HOST=backup.internal
readonly REMOTE_DATASET=backup-pool/myapp

# Local snapshot
SNAP="$DATASET@$(date +%Y-%m-%d-%H%M)"
zfs snapshot "$SNAP"
zfs hold compliance-30d "$SNAP"

# Find last snapshot on remote (incremental basis)
LAST_REMOTE=$(ssh "$REMOTE_HOST" "zfs list -t snapshot -H -o name $REMOTE_DATASET" \
  | tail -1 | awk -F@ '{print $2}')

# Send incremental
if [[ -n "$LAST_REMOTE" ]]; then
  zfs send -i "@$LAST_REMOTE" "$SNAP" | ssh "$REMOTE_HOST" zfs recv "$REMOTE_DATASET"
else
  zfs send "$SNAP" | ssh "$REMOTE_HOST" zfs recv "$REMOTE_DATASET"
fi

The zfs hold is critical: without it, a bug or attacker that runs zfs destroy -r tank/myapp deletes everything including snapshots. With holds, the destroy fails for held snapshots.

The 8 Footguns

1. Backing Up Open Database Files Without Quiescing

Copying /var/lib/postgresql while postgres is running gives you a backup that is internally inconsistent — pages that were partially written when the copy crossed them are corrupt. The result restores but corrupts on first read.

Fix: Use the database’s own backup tool (pg_dump, pg_basebackup, mysqldump --single-transaction, wal-g) which creates a transactionally consistent snapshot. Never tar a live datadir.

2. The “Backup Succeeded” That Wasn’t

tar returns exit 0 even when files were skipped due to permission errors (depending on flags). A successful exit code is not proof of a successful backup. Fix: Always cross-verify with the manifest count: if find | wc -l ≠ manifest line count, fail loudly.

src_count=$(find "$src" -type f | wc -l)
mfst_count=$(wc -l < "$manifest")
if (( src_count != mfst_count )); then
  backup_log "FAIL: file count mismatch (src=$src_count, manifest=$mfst_count)"
  exit 1
fi

3. Storing Backup Credentials On The Box Being Backed Up

Ransomware on the prod host reads /root/.aws/credentials and uses it to delete the S3 backup. Fix: Use IAM instance roles with s3:PutObject and s3:GetObject only — not s3:DeleteObject. Deletion is performed by a separate retention-orchestrator account that the prod host cannot impersonate.

4. Forgetting to Test the Sidecar Files

You verify the tar.zst, but never the manifest or its checksum. Months later you discover the manifest is the corrupted file. Fix: The drill script checksums every sidecar.

5. GFS Math With Leading-Zero Octal Bug

(( 08 <= 7 )) errors out with value too great for base. (( 10#$day <= 7 )) fixes it. Affects every script that does date arithmetic. Always use 10# for date numerics in bash.

6. The Backup-And-Restore-To-Same-Host Anti-Pattern

Drill-restoring on the production host means a bad restore can corrupt prod. Fix: Drill on a separate VM or container. Use cloud-init or Vagrant to spin up a clean target every time.

7. Compression Format Lock-In

You backed up everything as .tar.bz2 5 years ago. Today restoring on a stripped-down container that doesn’t have bzip2 is a 30-minute-into-an-incident discovery. Fix: Standardize on widely-available formats (.tar.gz, .tar.zst) and include the decompression binary alongside long-term archives (“backup the tools, not just the data”).

8. Missing Retention Stop on Compliance-Sensitive Data

You implemented GFS retention, but for a customer-data dataset under GDPR right-to-erasure, you cannot keep yearlies forever. Fix: GFS retention rules must encode both a maximum keep horizon and a deletion guarantee. For compliance buckets, the yearly tier might be capped at 7 years; for ephemeral dev data it might be capped at 30 days.

Quick-Reference Card

INTEGRITY
  - sha256sum manifest beside every archive
  - sha256sum sidecar for the manifest itself (chain of trust)
  - verify monthly even if no restore (catch silent rot)

RETENTION (GFS)
  - 7 daily + 4 weekly + 12 monthly + 7 yearly = ~30 backups for 7 years
  - Use restic forget --keep-* for restic; bash math for tar.zst
  - 10#$day to avoid octal parse errors in date math

IMMUTABILITY
  - S3 Object Lock COMPLIANCE mode (not even root can delete)
  - ZFS hold + separate release account
  - chattr +a as defense in depth (not primary)
  - Air-gap = credential cannot reach the data

DRILL TESTING
  - Weekly automated restore to clean sandbox
  - Manifest verification + application smoke test
  - Prometheus metric restore_drill_status; alert on stale

THREAT MODELS
  - Disk failure → integrity manifests catch it
  - Ransomware → immutability blocks the delete
  - Bug in retention → MFA-delete + separate account
  - Untested → only drills prove restorability

What’s Next

You now have a backup pipeline that produces integrity-verified, retention-bounded, immutable, drill-tested archives. But for databases specifically there’s a deeper layer: PITR (point-in-time recovery), online schema migrations that don’t lock production for 4 hours, and the orchestration of pg_dump / pg_basebackup / wal-g pipelines.

In the next lesson — Database Admin Scripting: pg_dump Pipelines, MySQL Backup Orchestration & Online Schema Patterns — we’ll build on lib/backup.sh with lib/db.sh, covering Postgres logical+physical backups with WAL archiving, MySQL mysqldump --single-transaction discipline, online schema change tooling (gh-ost, pt-online-schema-change) wrapped in shell, and PITR drills that prove you can restore to any second in the last 7 days.

shellbackuprestoreintegrityretentionimmutabilityransomwareobject-lockresticzfsrtorpo
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments