Why a systemd Unit Around Your Script Matters
Your script runs fine when you SSH in and run it. You want it to run on reboot, restart on failure, log to journald, time out after an hour, run with restricted privileges, and pull secrets from /etc/myapp/env only readable by root. You write a unit file:
[Service]
ExecStart=/opt/myapp/bin/run.sh
It works. Six months later you discover:
- The script ran with
Type=simple(default), which means systemd considered it “started” the moment fork returned. Health checks based on “is the service active?” return success even when the script silently exits in the first second. - It restarts on crash, but with
Restart=alwaysand noRestartSec, a buggy script that crashes immediately consumes 100% CPU in a tight respawn loop until you manuallysystemctl stopit. - It writes to
/var/log/myapp.log— but the systemd-journald is also collecting stdout/stderr, so logs are duplicated. - It runs as root because no
User=was set. Anything the script can do, root can do — a shell-injection bug means root compromise. - The
EnvironmentFile=/etc/myapp/envyou added contains a secret in plain text, and you didn’t restrict file mode, so any local user can read it.
This lesson covers the unit file as a contract between your script and systemd: how to declare what the script promises (will exit normally, will signal readiness, won’t fork into the background), how to harden the runtime, and how to test that the unit does what you think it does. It’s the capstone of Tier 4 — every script you write at this level should ship with a hardened unit.
The Anatomy of a Service Unit File
A systemd unit lives at /etc/systemd/system/myapp.service (system-wide) or ~/.config/systemd/user/myapp.service (per-user). Reload after editing:
sudo systemctl daemon-reload
sudo systemctl enable --now myapp.service
sudo systemctl status myapp.service
Unit file structure:
[Unit] # metadata: when to start, what it depends on
Description=My App Service
After=network-online.target # ordering: start after this
Wants=network-online.target # weak dep: pulls it in if available
Requires=postgresql.service # strong dep: fail if this can't start
ConditionPathExists=/etc/myapp/config.json # don't start if this is missing
[Service] # how to run
Type=simple # how systemd knows when "started"
User=myapp # don't run as root
Group=myapp
EnvironmentFile=/etc/myapp/env # source env vars from a file
ExecStartPre=/opt/myapp/bin/preflight.sh # run BEFORE main; failure aborts service
ExecStart=/opt/myapp/bin/run.sh # the main process
ExecReload=/bin/kill -HUP $MAINPID # systemctl reload sends this
ExecStopPost=/opt/myapp/bin/cleanup.sh # run AFTER stop, success or fail
Restart=on-failure
RestartSec=5
TimeoutStartSec=60
TimeoutStopSec=30
[Install] # who triggers `systemctl enable`
WantedBy=multi-user.target # start at multi-user (normal boot)
Three sections. [Unit] is the metadata; [Service] is the contract; [Install] is what systemctl enable activates.
Type=: How systemd Knows When You’re “Ready”
Type= is the most-confused field in unit files. Pick wrong and your dependencies start before your service is actually ready, or systemd thinks your service crashed when it didn’t.
| Type | Meaning | Use for |
|---|---|---|
simple (default) |
Service is “active” the moment ExecStart’s first process is forked | Daemons that don’t fork; wrong for scripts that exit |
forking |
Service is “active” once the parent process exits (the child becomes the daemon) | Old-style daemons that double-fork (Apache 2.2, dhcpd) |
oneshot |
Run ExecStart, wait for it to exit, mark service as completed (or failed) | Setup scripts, one-shot jobs, anything that’s not long-running |
notify |
Service must call sd_notify(READY=1) to signal it’s ready |
Long-running daemons that have a non-trivial init phase |
notify-reload |
Same as notify, but supports reload via sd_notify | Services with reload handling |
dbus |
Service is ready when it claims a D-Bus name | D-Bus services |
idle |
Like simple, but delay execution until other jobs finish | Avoid log-spam at boot |
Type=simple is wrong for shell scripts that exit
# WRONG: a script that runs to completion, with Type=simple.
[Service]
Type=simple
ExecStart=/opt/myapp/bin/sync-data.sh
This unit “succeeds” the moment fork returns, even if the script exits 200ms later. systemctl is-active returns “active” briefly, then “inactive” — confusing for monitoring. Worse, dependents that say After=myapp.service will start concurrently with sync-data.sh, not after.
Use Type=oneshot for “run once and exit”
[Service]
Type=oneshot
ExecStart=/opt/myapp/bin/sync-data.sh
RemainAfterExit=yes # report active even after exit (so dependents see "completed")
oneshot waits for the script to exit. RemainAfterExit=yes keeps is-active returning “active” so dependents like backup-completion services can chain off of it.
Use Type=notify for daemons with a real init phase
[Service]
Type=notify
NotifyAccess=main # only main process may signal
ExecStart=/opt/myapp/bin/run-daemon.sh
WatchdogSec=30
Inside the script:
#!/usr/bin/env bash
set -Eeuo pipefail
# ... initialization that takes a while ...
load_config
warm_caches
open_database
# Tell systemd we're ready.
systemd-notify --ready --status="Listening on :8080"
# Main loop.
while :; do
# ... do work ...
systemd-notify WATCHDOG=1 # reset the watchdog timer
sleep 10
done
systemd-notify is the shell-friendly wrapper around sd_notify(3). --ready tells systemd “I’m done initializing.” After that, dependents start.
WatchdogSec=30 says: if the service doesn’t send WATCHDOG=1 within 30 seconds, systemd assumes it’s hung and restarts it. This is the canonical way to detect a daemon that’s running but stuck.
Restart Policy: Don’t Death-Loop
Restart=on-failure
RestartSec=5
StartLimitIntervalSec=60
StartLimitBurst=5
Restart=:
| Value | Restart on |
|---|---|
no |
Never (default for oneshot) |
on-success |
Clean exit (rare; useful for retry-loops) |
on-failure |
Non-zero exit, signal kill, watchdog timeout |
on-abnormal |
Signal kill or watchdog only (not non-zero) |
on-watchdog |
Watchdog only |
on-abort |
SIGABRT only |
always |
Every exit, success or fail |
RestartSec=5: wait 5 seconds before restarting. Without this, a script that crashes immediately respawns at 100% CPU.
StartLimitIntervalSec=60 + StartLimitBurst=5: if there are 5 starts within 60 seconds, refuse further restarts. This is the kill-switch that prevents infinite respawn loops.
Combined: a buggy script gets 5 retries, then systemd gives up, marks the unit failed, and stops trying. You see systemctl status say “start request repeated too quickly” — exactly the diagnosis you want, not a CPU-burning host.
Sandboxing: Defense in Depth From the Service File
Hardening directives let you restrict the script’s privileges from outside the script. Even if the script has bugs (or is compromised), the impact is contained.
[Service]
# ─── Identity ────────────────────────────────────────────────────────
User=myapp
Group=myapp
DynamicUser=no # if yes: systemd creates a transient user; great for one-shots
# ─── Filesystem isolation ───────────────────────────────────────────
ProtectSystem=strict # /usr, /boot, /efi read-only; everything else inaccessible
ReadWritePaths=/var/lib/myapp /var/log/myapp # opt-in writable paths
ProtectHome=true # /home, /root, /run/user invisible
PrivateTmp=true # private /tmp, /var/tmp; cleared on stop
PrivateDevices=true # only /dev/null, /dev/zero, /dev/random, etc.
ProtectKernelTunables=true # /proc/sys, /sys read-only
ProtectKernelModules=true # cannot load modules
ProtectControlGroups=true # /sys/fs/cgroup read-only
ProtectClock=true # cannot change system time
# ─── Capabilities & privilege ───────────────────────────────────────
NoNewPrivileges=true # PR_SET_NO_NEW_PRIVS; cannot gain privileges via setuid
CapabilityBoundingSet= # drop ALL capabilities (empty = none)
AmbientCapabilities= # no ambient caps either
RestrictSUIDSGID=true # cannot create SUID/SGID files
# ─── Network ─────────────────────────────────────────────────────────
RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6 # block AF_PACKET, AF_NETLINK
PrivateNetwork=false # set true for no network at all (offline-only scripts)
IPAddressDeny=any # deny all (then allow specific):
IPAddressAllow=10.0.0.0/8 127.0.0.0/8
# ─── System call filtering ──────────────────────────────────────────
SystemCallFilter=@system-service
SystemCallFilter=~@privileged @resources @debug @mount @raw-io
SystemCallArchitectures=native # block 32-bit ABI on 64-bit kernels
# ─── Resource limits ────────────────────────────────────────────────
MemoryMax=512M
CPUQuota=50%
TasksMax=128
LimitNOFILE=4096
Hardening rationale
ProtectSystem=strict+ReadWritePaths=is the cleanest way to enforce “the script can write here and nowhere else.” Anything else triggers EROFS.PrivateTmp=truegives the service its own/tmp, eliminating an entire class of/tmprace conditions and information leaks.NoNewPrivileges=trueis the hard-block: even if the script execs a setuid binary, the new process inherits the no-priv flag.CapabilityBoundingSet=(empty) drops every Linux capability. If the script doesn’t need to bind to a port < 1024 or open raw sockets, it has zero special capabilities.SystemCallFilteruses systemd’s groups:@system-serviceis a well-known set of “things a service typically does.”~removes from the set.MemoryMax=512Mtriggers OOM-kill for that one service when it exceeds — preventing a runaway script from consuming all host memory.
Test what hardening actually applies
# Show effective security settings on a running service.
systemd-analyze security myapp.service
# Outputs a score 0-10 (lower = more hardened) and a per-directive breakdown.
# Show what each setting expanded to.
systemctl show myapp.service | grep -E '^(Protect|Restrict|Cap|System|Memory)'
systemd-analyze security is the audit tool. Score under 3 is excellent; under 5 is acceptable; over 7 means you’re running with way too much privilege.
Logging: Just Use Journald
[Service]
StandardOutput=journal
StandardError=journal
SyslogIdentifier=myapp
journal is the default. SyslogIdentifier sets the tag in journalctl output (otherwise journald uses the executable name).
In your script: write to stdout/stderr. Don’t open /var/log/myapp.log yourself. journald captures everything, indexes by unit, retains structured fields.
journalctl -u myapp.service # all logs for this unit
journalctl -u myapp.service -f # follow
journalctl -u myapp.service --since '1h ago' # last hour
journalctl -u myapp.service -p err # errors and worse
journalctl -u myapp.service -o json # structured output
For structured logs from shell, use the journald protocol:
# In your script:
log() {
printf '<%s>%s: %s\n' "$1" "${SYSLOG_IDENTIFIER:-myapp}" "$2"
}
log 6 "service starting" # priority 6 = info (RFC 5424 numeric)
log 3 "database unreachable" # priority 3 = error
systemd-journald reads the leading <N> and assigns the priority field. Combined with journalctl -p err, you can filter precisely.
EnvironmentFile: Secrets and Config Without Hard-Coding
[Service]
EnvironmentFile=-/etc/myapp/env
EnvironmentFile=/etc/myapp/local-env
The leading - means “okay if missing.” Files are loaded in order; later files override earlier.
The file:
# /etc/myapp/env
DATABASE_URL=postgres://app:hidden@db.internal/app
API_KEY=secret-value
LOG_LEVEL=info
Restrict access:
chown root:myapp /etc/myapp/env
chmod 0640 /etc/myapp/env
Better: LoadCredential (systemd 250+)
For modern systemd, LoadCredential= reads a secret into the service’s ${CREDENTIALS_DIRECTORY} and exposes it without leaking via env or /proc/$pid/environ:
[Service]
LoadCredential=db-password:/etc/myapp/db-password
ExecStart=/opt/myapp/bin/run.sh
In the script:
DB_PASSWORD=$(< "${CREDENTIALS_DIRECTORY}/db-password")
CREDENTIALS_DIRECTORY is a tmpfs mount only the service can see, never visible to other processes. Strictly preferable to env-var secrets if your systemd is new enough.
Timer Units: Replacing cron
A timer unit triggers a service unit on a schedule. Two files: the timer and its corresponding service.
# /etc/systemd/system/backup.service
[Unit]
Description=Nightly backup
ConditionACPower=true # don't run on battery (laptops)
[Service]
Type=oneshot
User=backup
ExecStart=/opt/backup/bin/run.sh
Nice=19 # lowest CPU priority
IOSchedulingClass=idle # only run when no other I/O
# /etc/systemd/system/backup.timer
[Unit]
Description=Run backup nightly
[Timer]
OnCalendar=*-*-* 03:00:00 # every day at 3 AM
RandomizedDelaySec=900 # spread load: actual run is 03:00–03:15
Persistent=true # if missed (host was off), run on next boot
Unit=backup.service # what to start
[Install]
WantedBy=timers.target
Enable:
sudo systemctl enable --now backup.timer
systemctl list-timers --all
# NEXT LEFT LAST PASSED UNIT ACTIVATES
# Tue 2025-01-14 03:11:23 UTC 14h left - - backup.timer backup.service
OnCalendar syntax
OnCalendar=daily # 00:00:00 every day
OnCalendar=hourly # 00:00 every hour
OnCalendar=Mon..Fri 09:00 # weekdays at 9 AM
OnCalendar=*-*-01 04:00:00 # 1st of every month at 4 AM
OnCalendar=2025-12-31 23:59:00 # one specific time
OnCalendar=*:0/15 # every 15 minutes
OnCalendar=*-*-* 03:00:00 # every day at 3 AM
systemd-analyze calendar 'Mon..Fri 09:00' validates and shows the next firing.
Cron equivalents
| cron line | systemd OnCalendar |
|---|---|
0 3 * * * |
*-*-* 03:00:00 |
*/15 * * * * |
*:0/15 |
0 9 * * 1-5 |
Mon..Fri 09:00 |
0 0 1 * * |
*-*-01 00:00:00 |
@reboot |
OnBootSec=2min (Timer) or After=multi-user.target |
Why timers beat cron
- Logs go to journald, not
/var/log/cron. - Failed runs surface via
systemctl status backup.servicewith exit code and recent logs. Persistent=truecatches up on missed runs after downtime; cron silently skips.- Resource limits (
MemoryMax,CPUQuota) apply per-run. - Hardening directives apply (cron jobs run with full user privileges).
- Dependencies work: timer can require
network-online.target, etc.
A Hardened Production Template
# /etc/systemd/system/myapp.service
[Unit]
Description=My App
After=network-online.target postgresql.service
Wants=network-online.target
Requires=postgresql.service
StartLimitIntervalSec=60
StartLimitBurst=5
[Service]
Type=notify
NotifyAccess=main
WatchdogSec=30
User=myapp
Group=myapp
SupplementaryGroups=
EnvironmentFile=-/etc/myapp/env
LoadCredential=db-password:/etc/myapp/db-password
WorkingDirectory=/opt/myapp
ExecStartPre=/opt/myapp/bin/preflight.sh
ExecStart=/opt/myapp/bin/run.sh
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5
TimeoutStartSec=120
TimeoutStopSec=30
KillSignal=SIGTERM
KillMode=mixed
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=myapp
LogRateLimitIntervalSec=10
LogRateLimitBurst=200
# Filesystem
ProtectSystem=strict
ReadWritePaths=/var/lib/myapp /var/log/myapp
ProtectHome=true
PrivateTmp=true
PrivateDevices=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectKernelLogs=true
ProtectControlGroups=true
ProtectClock=true
ProtectHostname=true
ProtectProc=invisible
ProcSubset=pid
# Privilege
NoNewPrivileges=true
CapabilityBoundingSet=
AmbientCapabilities=
RestrictSUIDSGID=true
RestrictRealtime=true
LockPersonality=true
MemoryDenyWriteExecute=true
# Network
RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6
RestrictNamespaces=true
# Syscalls
SystemCallFilter=@system-service
SystemCallFilter=~@privileged @resources @debug @mount @raw-io @reboot @swap
SystemCallArchitectures=native
SystemCallErrorNumber=EPERM
# Resources
MemoryMax=1G
MemoryHigh=768M
CPUQuota=200%
TasksMax=256
LimitNOFILE=8192
LimitNPROC=128
LimitCORE=0
# Misc
UMask=0027
[Install]
WantedBy=multi-user.target
This template scores well on systemd-analyze security (typically 1–2 out of 10, “OK” range). Adjust by removing things your script actually needs (e.g., remove MemoryDenyWriteExecute=true for JIT languages).
Real-World Recipes
Recipe 1: One-shot setup script with idempotent guard
[Unit]
Description=One-time database initialization
ConditionPathExists=!/var/lib/myapp/initialized
After=postgresql.service
Requires=postgresql.service
[Service]
Type=oneshot
User=myapp
ExecStart=/opt/myapp/bin/init-db.sh
ExecStartPost=/usr/bin/touch /var/lib/myapp/initialized
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
The ConditionPathExists=!/var/lib/myapp/initialized means: don’t run if the marker file exists. After the script succeeds, ExecStartPost creates the marker. On reboot, the unit is “skipped (precondition not met)” and journalctl logs that. Idempotent across reboots.
Recipe 2: Long-running daemon with watchdog
/opt/myapp/bin/run-daemon.sh:
#!/usr/bin/env bash
set -Eeuo pipefail
# Initialization phase.
load_config
warm_caches
open_database
# Tell systemd we're ready.
systemd-notify --ready --status="Listening on :8080"
# Main loop with periodic watchdog ping.
while :; do
if ! main_iteration; then
systemd-notify --status="Iteration failed; exiting"
exit 1
fi
systemd-notify WATCHDOG=1 --status="Iteration completed at $(date -u +%FT%TZ)"
sleep 5
done
Unit sets WatchdogSec=30, Type=notify. If main_iteration hangs > 30 seconds, watchdog fires and systemd restarts the service.
Recipe 3: Timer-driven backup with offline persistence
/etc/systemd/system/myapp-backup.service:
[Unit]
Description=MyApp backup
[Service]
Type=oneshot
User=backup
ExecStart=/opt/myapp/bin/backup.sh
Nice=19
IOSchedulingClass=idle
TimeoutStartSec=2h
StandardOutput=journal
SyslogIdentifier=myapp-backup
ProtectSystem=strict
ReadWritePaths=/var/lib/backup /var/lib/myapp
PrivateTmp=true
NoNewPrivileges=true
/etc/systemd/system/myapp-backup.timer:
[Unit]
Description=Run myapp backup daily
[Timer]
OnCalendar=*-*-* 03:00:00
RandomizedDelaySec=15min
Persistent=true
Unit=myapp-backup.service
[Install]
WantedBy=timers.target
Persistent=true means: if the host was off at 03:00, run the backup as soon as the host is up. This is what cron’s @reboot should be — guaranteed catch-up.
Recipe 4: Service with reload that re-reads config
[Service]
Type=notify
ExecStart=/opt/myapp/bin/run.sh
ExecReload=/bin/kill -HUP $MAINPID
In the script:
reload_config() {
echo "received SIGHUP; reloading config"
load_config
systemd-notify --reloading
warm_caches
systemd-notify --ready --status="Reloaded at $(date -u +%FT%TZ)"
}
trap reload_config HUP
# main loop
systemctl reload myapp sends SIGHUP, the script re-reads config without exit. systemctl reload is preferred over systemctl restart when the change is config-only; no downtime.
Recipe 5: Per-instance template units
You have 5 worker queues, identical config except for the queue name. Use a template unit:
/etc/systemd/system/myapp-worker@.service:
[Unit]
Description=MyApp worker for queue %i
[Service]
Type=notify
User=myapp
Environment=QUEUE_NAME=%i
ExecStart=/opt/myapp/bin/worker.sh
Restart=on-failure
Enable with the instance name after @:
sudo systemctl enable --now myapp-worker@orders.service
sudo systemctl enable --now myapp-worker@billing.service
sudo systemctl enable --now myapp-worker@notifications.service
%i in the unit file is replaced with the part after @. Five instances, one unit file, individual control: systemctl restart myapp-worker@orders.
Footgun List
-
Type=simplefor a script that exits. Useoneshot, notsimple. Service will appear to “succeed” then immediately become inactive. -
Restart=alwayswithoutRestartSec. A crash-loop pegs the CPU. Always setRestartSec=andStartLimitBurst=. -
User=rootbecause you didn’t think about it. Default is root. Always setUser=to a service account; for one-shots, considerDynamicUser=true. -
After=network.targetinstead ofnetwork-online.target.network.targetonly guarantees the network stack is initialized, not that the network is up. For network-dependent services, usenetwork-online.targetandWants=network-online.target. -
EnvironmentFile=with permissive mode.chmod 0640with group read for the service account; never world-readable. -
Logging to a file you also tee to journald. Pick one. Logs in two places means a 2x storage bill and
grepconfusion. -
ExecReload=systemctl reload-or-try-restart(recursive) — don’t do this.ExecReloadis the implementation of reload, usuallykill -HUP $MAINPID. -
ProtectSystem=strictwith noReadWritePaths=. Service has nowhere to write. AddReadWritePaths=/var/lib/myappetc. for the legitimate writable paths. -
WatchdogSec=withoutType=notify.WatchdogSeconly fires forType=notifyservices that sendWATCHDOG=1. Other types ignore it. -
ConditionPathExists=confused withRequiresMountsFor=.ConditionPathExistsis checked once at start; if false, the unit is skipped.RequiresMountsFor=ensures a path’s mount is up. Different semantics. -
Editing the unit and forgetting
daemon-reload. systemd caches unit files; withoutdaemon-reload, your changes don’t apply. -
Forgetting
[Install]meanssystemctl enabledoes nothing. TheWantedBy=is what creates the symlink that triggers auto-start.
Quick-Reference Card
┌─ Type SELECTION ──────────────────────────────────────────────────────┐
│ Type=oneshot scripts that run-and-exit (with RemainAfterExit) │
│ Type=simple daemons that don't fork; trivial init │
│ Type=notify daemons with non-trivial init (call sd_notify) │
│ Type=forking legacy double-forking daemons (rare today) │
└────────────────────────────────────────────────────────────────────────┘
┌─ RESTART POLICY ──────────────────────────────────────────────────────┐
│ Restart=on-failure most services │
│ RestartSec=5 back-off between restarts │
│ StartLimitIntervalSec=60 │
│ StartLimitBurst=5 max 5 starts in 60s; then "failed" │
└────────────────────────────────────────────────────────────────────────┘
┌─ HARDENING (DEFAULT-ON FOR NEW SERVICES) ─────────────────────────────┐
│ ProtectSystem=strict + ReadWritePaths=... │
│ ProtectHome=true │
│ PrivateTmp=true │
│ NoNewPrivileges=true │
│ CapabilityBoundingSet= │
│ SystemCallFilter=@system-service │
│ RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6 │
│ MemoryMax=N TasksMax=N CPUQuota=X% │
└────────────────────────────────────────────────────────────────────────┘
┌─ sd_notify FROM SHELL ────────────────────────────────────────────────┐
│ systemd-notify --ready service is started │
│ systemd-notify WATCHDOG=1 kick the watchdog timer │
│ systemd-notify --status="..." set status field for status │
│ systemd-notify --stopping shutting down │
│ systemd-notify --reloading reloading config │
└────────────────────────────────────────────────────────────────────────┘
┌─ TIMER ESSENTIALS ────────────────────────────────────────────────────┐
│ OnCalendar=*-*-* 03:00:00 daily 3 AM │
│ RandomizedDelaySec=15min jitter to avoid thundering herd │
│ Persistent=true run on boot if missed │
│ systemd-analyze calendar EXP validate the expression │
└────────────────────────────────────────────────────────────────────────┘
┌─ AUDIT COMMANDS ──────────────────────────────────────────────────────┐
│ systemd-analyze security UNIT hardening score │
│ systemctl show UNIT all expanded settings │
│ systemctl status UNIT current state + recent log │
│ journalctl -u UNIT [-f] [--since] log access │
│ systemctl list-timers --all all configured timers │
│ systemd-analyze calendar 'EXPR' validate timer schedule │
└────────────────────────────────────────────────────────────────────────┘
Tier 4 Capstone
This lesson closes Tier 4. You now have the toolset:
- POSIX portability (L23) and performance discipline (L24)
- Security hardening (L25), secret hygiene (L26), idempotency (L27)
- Filesystem semantics (L28), kernel introspection (L29)
- Container automation (L30), cloud-CLI mastery (L31), and now systemd integration (L32)
What ties them together: every script you write at this level is inspectable, reversible, and bounded. You can trace what it did, you can roll it back, and you can put a wall around what it can do (Linux capabilities, systemd hardening, IAM scope). These are the skills that separate scripts that survive five years from scripts that break next quarter.
The next tier (Wave 4: Tier 5 Specialist) takes these foundations and applies them to specific operator domains: bootstrap and cloud-init, monitoring and watchdogs, backup/restore, database admin, log analysis at scale, self-healing systems, migrations, compliance, and forensics. Each lesson treats shell as the integration glue between disciplined script craft and the operational realities of running production systems.