cp -r and for f in * work for tens of files. They break around the time you have:
- A directory with filenames containing spaces, newlines, or dashes (yes, that happens — Windows shares, user uploads).
- Thousands of files in one dir (
*expansion can blow the argv limit). - Gigabytes of data where a 5-minute interrupted copy means you have to start over.
- A process that needs to resume if killed mid-flight.
- Multiple machines you’re keeping in sync.
- A target where partial writes are unacceptable (atomic config swaps, container images).
This lesson covers the tools that handle all those cases:
rsync— the Swiss army knife of file copying. Resumable, incremental, network-aware, snapshot-friendly.find -print0/xargs -0/mapfile -d ''— filename-safe iteration (reviewed from Wave 1, with new patterns).- Atomic writes —
mktemp+mvto ensure readers never see a half-written file. - Parallel-safe operations — combining concurrency from L16 with file ops.
By the end you’ll handle 1TB volumes confidently and never lose data to “the script died half-way through.”
1. rsync — the Unix copy tool you should be using
rsync is “smart cp”: it figures out what’s changed since last time and only transfers the differences. For local copies it’s competitive with cp. For remote copies it’s typically 10-100x faster on incremental transfers.
The canonical local copy
rsync -aP /source/ /dest/
-a= archive (recursive, preserve permissions, times, symlinks, etc. — almost always what you want)-P=--partial --progress(show progress, keep partially-transferred files for resume)
The trailing / on source matters. rsync -a /src/ /dst/ copies the contents of /src into /dst. rsync -a /src /dst/ copies /src itself into /dst (creating /dst/src).
rsync -aP /src/ /dst/ # /src/foo → /dst/foo
rsync -aP /src /dst/ # /src/foo → /dst/src/foo
This is the most common rsync mistake. Always think about whether you mean “copy contents” or “copy the dir itself.”
The canonical remote copy
rsync -azP /source/ user@host:/dest/
-z= compress during transfer (helpful for slow networks; harmful for fast ones)
For very fast LAN transfers, omit -z (CPU is the bottleneck, not bandwidth).
--delete — make destination match source
rsync -aP --delete /source/ /dest/
Files in /dest/ that aren’t in /source/ are deleted. Use carefully — you can wipe the target if you fat-finger arguments. Always test with --dry-run first:
rsync -aPn --delete /source/ /dest/ # -n = --dry-run; just print what would happen
Always do this on first run with --delete.
--exclude and --include
rsync -aP --exclude='*.log' --exclude='node_modules/' /src/ /dst/
Patterns are checked against the path relative to the source root. --exclude='*.log' matches any .log file at any depth. --exclude='/cache/' matches only top-level cache/.
For complex rule sets, use --exclude-from=FILE:
# .rsync-excludes
node_modules/
.git/
*.log
*.tmp
__pycache__/
rsync -aP --exclude-from=.rsync-excludes /src/ /dst/
--link-dest — incremental snapshots
This is rsync’s killer feature. Hard-link files unchanged from a previous backup, only copying changed ones. Result: each “snapshot” appears full but actually shares disk with the previous.
# Yesterday's snapshot is at /backups/2026-06-21/
# Today, build /backups/2026-06-22/ that hard-links to yesterday's unchanged files
rsync -aP --link-dest=/backups/2026-06-21/ /source/ /backups/2026-06-22/
Now /backups/2026-06-22/ is a complete tree, but identical files share inodes with yesterday. Disk usage is roughly the size of new + modified files, not the full source.
This is how Time Machine, BackupPC, rsnapshot, and most “rolling N-day backup” systems work. Trivial to implement.
--partial and --inplace
If a transfer is interrupted, by default rsync deletes the partial file and starts over. With --partial (-P includes this), it keeps the partial. Re-running rsync resumes from the partial.
--inplace writes directly to the destination file rather than to a temp + rename. Faster, but readers may see partial data. Use only when readers won’t trip over half-files.
--bwlimit — rate limiting
rsync -aP --bwlimit=10000 /src/ user@host:/dst/ # 10,000 KB/s = 10 MB/s
Use during business hours so the rsync doesn’t saturate the link.
Verbosity and dry-run
rsync -aPv /src/ /dst/ # verbose: list every file copied
rsync -aPvv /src/ /dst/ # extra verbose
rsync -aPn /src/ /dst/ # dry run; show what WOULD be transferred
Always -n first when using --delete or aggressive exclude patterns.
--checksum vs default
By default, rsync skips files where size and mtime match. With --checksum it also reads and hashes both sides. Slow but bulletproof when you suspect mtimes are lying.
rsync over SSH with custom config
rsync -aP -e 'ssh -i ~/.ssh/backup_key -p 2222' /src/ user@host:/dst/
-e overrides the remote-shell command. Use to specify alternate keys, ports, or even different transports.
Combined real-world example: nightly backup
#!/usr/bin/env bash
set -Eeuo pipefail
source "$(dirname "${BASH_SOURCE[0]}")/lib/log.sh"
readonly SOURCE=/var/www
readonly DEST=/backups
readonly TODAY=$(date -u +%Y-%m-%d)
readonly YESTERDAY=$(date -u -d 'yesterday' +%Y-%m-%d 2>/dev/null || date -u -v-1d +%Y-%m-%d)
readonly TODAY_DIR="$DEST/$TODAY"
readonly YEST_DIR="$DEST/$YESTERDAY"
mkdir -p "$DEST"
OPTS=( -aP --delete --exclude-from=/etc/backup/excludes )
if [[ -d "$YEST_DIR" ]]; then
OPTS+=( --link-dest="$YEST_DIR" )
fi
info "starting backup" date=$TODAY
rsync "${OPTS[@]}" "$SOURCE/" "$TODAY_DIR/"
# Prune backups older than 30 days
find "$DEST" -maxdepth 1 -type d -name '????-??-??' -mtime +30 -print0 \
| xargs -0r rm -rf
info "backup complete"
This is the foundation of every “rolling daily backup” script. Adapt for your data.
2. Filename-safe iteration (recap + new patterns)
We covered this in L4 and L6. Recap:
# WRONG — breaks on spaces/newlines in filenames
for f in $(find . -name '*.log'); do …; done
# RIGHT — NUL-separated, mapfile collects safely
mapfile -d '' -t FILES < <(find . -name '*.log' -print0)
for f in "${FILES[@]}"; do …; done
# RIGHT — for direct piping
find . -name '*.log' -print0 | xargs -0 -n 1 process
# RIGHT — read loop with explicit IFS=
find . -name '*.log' -print0 | while IFS= read -r -d '' f; do
process "$f"
done
The IFS= (empty) before read prevents word-splitting. -d '' makes the delimiter NUL.
find ... -exec ...
For simple ops, skip xargs entirely:
# Per-file: forks once per file (slow if many files)
find . -name '*.log' -exec gzip {} \;
# Batched: forks once per batch (fast)
find . -name '*.log' -exec gzip {} +
Always prefer + over \; when the command supports multiple args. We covered this in L11.
find -execdir
find . -name '*.tmp' -execdir rm -- {} \;
-execdir runs the command in the directory containing the file. Useful for git/svn operations that act on cwd.
When **/ (globstar) is enough
For in-shell iteration where you don’t need find’s full power:
shopt -s globstar nullglob
for f in **/*.log; do
process "$f"
done
Quote "$f" even when the glob match is “safe” — it’s a habit you don’t want to break.
parallel over find
find . -name '*.log' -print0 | parallel -0 -j 8 gzip {}
parallel -0 reads NUL-separated input. Combine with -j 8 for 8-way parallelism. We covered this in L16.
3. Atomic writes — never leave a half-finished file
If a process is killed (Ctrl-C, OOM, kernel panic) while writing a file, readers may see truncated content. For configs, manifests, anything important, write atomically: write to a temp file, fsync, then rename.
The basic pattern
TMP=$(mktemp /var/lib/myapp/data.XXXXXX)
trap 'rm -f "$TMP"' EXIT
generate_data > "$TMP"
mv -- "$TMP" /var/lib/myapp/data.json
trap - EXIT
mv on the same filesystem is atomic at the kernel level — readers either see the old version or the new one, never partial.
Why same filesystem?
mv across filesystems is cp + unlink — non-atomic. If the destination is on /data (separate FS from /tmp), you must:
TMP=$(mktemp -p "$(dirname "$DEST")" data.XXXXXX)
-p DIR puts the temp file in DIR — same filesystem as DEST.
Adding fsync
mv is atomic from the kernel’s view, but the data may still be in page cache. To guarantee durability, sync first:
TMP=$(mktemp -p "$(dirname "$DEST")" .data.XXXXXX)
generate_data > "$TMP"
sync # flush ALL pending writes (heavyweight)
# Or, more targeted:
# python3 -c "import os; f=open('$TMP'); os.fsync(f.fileno())"
mv -- "$TMP" "$DEST"
For most use cases, the implicit kernel flushing is fine. Add explicit sync only when crashes are a real concern (e.g. database snapshots).
Reusable helper
atomic_write() {
local target=$1
local tmpdir; tmpdir=$(dirname "$target")
local tmp; tmp=$(mktemp -p "$tmpdir" ".$(basename "$target").XXXXXX")
trap 'rm -f "$tmp"' EXIT
cat > "$tmp" # read stdin → temp file
mv -- "$tmp" "$target"
trap - EXIT
}
# Use:
generate_config | atomic_write /etc/myapp/config.yaml
cat > "$tmp" reads stdin to the temp. The function is generic.
Atomic directory replace
For atomically replacing a directory (e.g. blue/green static content):
NEW=/var/www/site.new
LIVE=/var/www/site
OLD=/var/www/site.old
# Build the new tree
rsync -aP /source/ "$NEW/"
# Atomic flip via rename — actually two-step on most filesystems
mv -- "$LIVE" "$OLD" # not atomic in the strict sense
mv -- "$NEW" "$LIVE" # but very fast — sub-millisecond gap
# Cleanup
rm -rf "$OLD"
For truly atomic dir-swap, use a symlink:
# Build the new dir at /var/www/sites/v2/
rsync -aP /source/ /var/www/sites/v2/
# Atomic symlink swap (mv replaces atomically)
ln -sfn /var/www/sites/v2 /var/www/site
The ln -sfn updates an existing symlink atomically (single rename syscall). This is how most blue/green static-content deployments work.
4. Parallel-safe directory operations
When fanning out file ops across cores, watch for races and contention.
Per-file work in parallel
# Compress every .log file with 8 cores
find /var/log -type f -name '*.log' -print0 \
| xargs -0 -P 8 -n 1 gzip
This is safe — each gzip operates on its own file. No coordination needed.
Aggregating from parallel jobs
# Counting words across many files in parallel — DON'T just append
mkdir -p /tmp/results
find . -name '*.txt' -print0 | xargs -0 -P 8 -I {} \
bash -c 'wc -w "$1" > "/tmp/results/$(basename "$1").wc"' _ {}
# Aggregate after
cat /tmp/results/*.wc | awk '{ s += $1 } END { print s }'
Each worker writes to its own file. After wait, aggregate. We saw this pattern in L16; here it’s specialised for file ops.
Walking very large trees efficiently
For directories with millions of files, naive find reads the whole tree at once. Use -maxdepth:
# Process top-level dirs in parallel; each find within is shallow
find /data -mindepth 1 -maxdepth 1 -type d -print0 \
| xargs -0 -P 4 -I {} bash -c 'find "$1" -name "*.log" | wc -l' _ {}
This breaks the tree into independent subtrees, processes each in parallel.
rsync from many sources to one dest
rsync doesn’t natively parallelise. Workaround: rsync each top-level subdir in parallel.
ls -1 /source | parallel -j 4 'rsync -a /source/{}/ /dest/{}/'
Be careful with --delete — it deletes anything in the dest dir not present in the source dir, but each parallel rsync only sees its own subdir. So --delete is safe here as long as the dest layout mirrors the source.
5. Common patterns
Disk-usage one-liners
# Top 20 largest files in a tree
find /var -type f -printf '%s %p\n' | sort -rn | head -n 20
# Top 20 largest directories (by their direct content)
du -sh /var/* 2>/dev/null | sort -hr | head -n 20
# Total size of files matching a pattern
find . -name '*.log' -printf '%s\n' | awk '{s+=$1} END {print s}'
# Files older than 7 days, total size
find . -mtime +7 -type f -printf '%s\n' | awk '{s+=$1} END {print s/1024/1024 " MB"}'
Safely deleting many files
# DON'T — argv overflow risk
rm /var/cache/*.tmp
# DO — find + xargs handles arbitrary file counts
find /var/cache -name '*.tmp' -print0 | xargs -0 rm --
# Or with -delete (no fork, but no pre-list)
find /var/cache -name '*.tmp' -delete
Mirroring with hard links (zero-copy “branching”)
# Make a snapshot that shares storage with the original
cp -al /source /snapshot # -a archive; -l hard-link instead of copy
# OR with rsync
rsync -a --link-dest=/source /source/ /snapshot/
Both create /snapshot/ where every file is a hard link to /source/. Both directories now point to the same blocks; modifying one (via overwrite, not in-place edit) breaks the link.
This is how container layer snapshots work conceptually.
Find files NOT matching a pattern
find . -type f -not -name '*.tmp'
find . -type f \! -name '*.tmp' # ! escaped for shell
# Multiple patterns
find . -type f -not \( -name '*.tmp' -o -name '*.bak' \)
Atomic config reload without restart
# Pattern: write atomically, then signal the daemon
atomic_write /etc/myapp/config.yaml < new-config.yaml
systemctl reload myapp.service # or kill -HUP $(cat myapp.pid)
The daemon re-reads on SIGHUP. Atomic-write means the daemon never reads a partial config.
6. Common pitfalls
cp losing perms / xattrs
cp file dest # may not preserve perms, ownership, ACLs
cp -a file dest # archive mode (preserves)
cp -p file dest # preserve mode/owner/timestamps only
Use -a (or cp --preserve=all) when fidelity matters.
rm -rf with empty variable
The classic disaster:
DIR=""
rm -rf "$DIR/cache" # if DIR is empty: rm -rf "/cache" — DELETES /cache !
Always validate:
[[ -n "$DIR" ]] || die "DIR is empty"
rm -rf -- "$DIR/cache"
The -- ends option processing — protects against $DIR accidentally starting with a dash.
mv across filesystems
mv /var/data/big.tar /tmp/ # if /tmp is a separate FS, this is cp + unlink
Cross-FS mv can leave the source partially deleted if interrupted. For huge files, prefer cp -av && rm -- $src so you control timing.
find -exec with ; vs +
find . -name '*.log' -exec gzip {} \; # forks gzip per file — slow
find . -name '*.log' -exec gzip {} + # batched — fast
Always use + unless your command really only takes one file.
cp -r dir1/ dir2/ vs cp -r dir1 dir2/
cp -r dir1/ dir2/ # copies CONTENTS of dir1 into dir2
cp -r dir1 dir2/ # copies dir1 ITSELF into dir2 (creates dir2/dir1)
Same trailing-slash rule as rsync. Memorise.
Filesystem quirks
- NFS: locking, atime, sometimes
mvnot atomic, definitely noflock. Use lockfile-create. - SMB/CIFS: case insensitivity (Windows-style);
mv FOO foomay be a no-op. - FAT/exFAT: no permissions, no symlinks. Operations may silently lose metadata.
- tmpfs (RAM): fast but bounded; large copies fail with “no space” sooner than you’d think.
du vs ls -l size
ls -l file # logical size (what you'd read)
du -h file # disk usage (rounded to block; sparse files lower)
For sparse files (VM disk images, database files), du can be much smaller than ls -l. For atomic-write planning (where blocks matter), use du.
7. The lib/files.sh framework
# lib/files.sh — file-operation helpers
atomic_write() {
local target=$1
local tmp
tmp=$(mktemp -p "$(dirname "$target")" ".$(basename "$target").XXXXXX")
trap 'rm -f "$tmp"' EXIT
cat > "$tmp"
mv -- "$tmp" "$target"
trap - EXIT
}
mirror_safely() {
local src=$1 dst=$2
[[ -d "$src" ]] || die "source not found: $src"
rsync -aP --delete --exclude-from='.rsync-excludes' "$src/" "$dst/"
}
snapshot_with_link_dest() {
local src=$1 dest_root=$2
local today
today=$(date -u +%Y-%m-%d)
local target="$dest_root/$today"
local prev
prev=$(ls -1 "$dest_root" 2>/dev/null | sort -r | grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}$' | head -1)
local opts=( -aP --delete )
[[ -n "$prev" && -d "$dest_root/$prev" ]] && opts+=( --link-dest="$dest_root/$prev" )
rsync "${opts[@]}" "$src/" "$target/"
}
count_files() {
find "$1" -type f -print0 | tr -cd '\0' | wc -c
}
total_size_mb() {
find "$1" -type f -printf '%s\n' 2>/dev/null | awk '{s+=$1} END {print s/1024/1024}'
}
prune_older_than_days() {
local dir=$1 days=$2
find "$dir" -mindepth 1 -maxdepth 1 -type d -mtime "+$days" -print0 \
| xargs -0r rm -rf
}
Use:
source "$(dirname "${BASH_SOURCE[0]}")/lib/files.sh"
generate_config | atomic_write /etc/myapp/config.yaml
snapshot_with_link_dest /var/www /backups
prune_older_than_days /backups 30
8. Twelve idioms for daily use
# 1. rsync local copy (idiomatic)
rsync -aP /src/ /dst/
# 2. rsync remote
rsync -azP /src/ user@host:/dst/
# 3. rsync with delete + dry-run first
rsync -aPn --delete /src/ /dst/ # confirm
rsync -aP --delete /src/ /dst/ # apply
# 4. rsync incremental snapshots
rsync -aP --link-dest=$LAST_BACKUP /src/ /backups/$TODAY/
# 5. NUL-safe iteration
mapfile -d '' -t FILES < <(find . -type f -print0)
for f in "${FILES[@]}"; do …; done
# 6. find + parallel
find . -name '*.log' -print0 | xargs -0 -P 8 -n 1 gzip
# 7. Atomic write
TMP=$(mktemp -p "$(dirname $TARGET)" ".$(basename $TARGET).XXXXXX")
generate > "$TMP" && mv "$TMP" "$TARGET"
# 8. Atomic dir-replace via symlink
rsync -aP /src/ /dest/v2/ && ln -sfn /dest/v2 /dest/live
# 9. find -delete
find /tmp -mindepth 1 -mmin +60 -delete
# 10. Top 20 largest files
find /var -type f -printf '%s %p\n' | sort -rn | head -n 20
# 11. Hard-link snapshot (zero-copy clone)
cp -al /source /snapshot
# 12. Check before destructive op
[[ -n "$DIR" ]] || die "DIR empty"; rm -rf -- "$DIR/cache"
9. What you must internalise before lesson 19
- What’s the difference between
rsync /src/ /dst/andrsync /src /dst/? (Trailing slash on source: copy contents vs copy directory itself.) - What’s
--link-destfor? (Hard-link unchanged files from a previous backup — incremental snapshots that look full.) - Why use
mv tmp targetinstead of writing directly to target? (Atomic: readers see old or new, never partial.) - Why must
tmpandtargetbe on the same filesystem for atomic mv? (Cross-FS mv is cp+unlink — not atomic.) - What’s
ln -sfn /new /linkfor? (Atomic symlink update — single rename syscall.) - What’s
find -exec cmd {} +? (Batched exec — single fork per batch instead of per file.) - What does
cp -aldo? (Archive copy with hard links — zero-copy clone of a directory tree.) - What’s the canonical NUL-safe pipeline? (
find ... -print0 | xargs -0 ....) - What’s the safest “rm -rf $X” pattern? (Validate
$Xis non-empty first, thenrm -rf -- "$X/subdir"with--.) - What’s
duvsls -lfor sparse files? (dushows actual disk usage;ls -lshows logical size — sparse files have logical >> actual.)
What’s next
Lesson 19: Date & Time Arithmetic — ISO 8601, Time Zones, Locale Hazards & Reliable Cron-Time Math. Working with dates in shell is full of traps: date -d is GNU only; macOS BSD date uses different flags; the %N format isn’t portable; cron uses local time; UTC is mandatory in production. We cover the GNU/BSD difference comprehensively, the canonical ISO 8601 patterns, time-zone handling in scripts, computing yesterday/last-week/last-month dates, and the standard “cron-safe” date math. After L19 you’ll never have a “this script broke at DST” bug again.
See you there.