Three out of every four real shell tasks boil down to: find some files, look through their contents, rewrite parts of them. The tools for this — globs, find, grep, sed — have been part of Unix since 1971, and they have lost none of their relevance in 2026. Cloud-native engineers, DevOps platform owners, SREs at hyperscalers — they all reach for these tools every day.
But most engineers use 5% of their power. This lesson covers them in real depth: globs beyond *.txt, regex carefully (because shell mixes three different regex flavours and the differences matter), find as the most powerful filesystem-traversal language ever shipped, grep with all the flags that turn it from “search for word” into “extract every error since 9am with 3 lines of context”, and sed for stream editing without losing files.
If you’ve been doing this work in Python because shell felt too primitive — read this lesson and re-evaluate. For a million tasks, find | grep | sed is one line, fast, and on every machine you’ll ever touch.
1. Globs revisited: nullglob, dotglob, globstar, extglob
We covered the basic globs in lesson 4. Bash has shell options that change globbing behaviour. Set them with shopt -s NAME (set) and shopt -u NAME (unset).
nullglob — empty match expands to nothing
By default:
ls /nonexistent/*.log
ls: cannot access '/nonexistent/*.log': No such file or directory
Bash leaves the literal /nonexistent/*.log in place when nothing matches. Inside a for loop, this means you iterate once with the literal pattern as the value. With nullglob:
shopt -s nullglob
ls /nonexistent/*.log # silent — no files, no error
for f in /nonexistent/*.log; do
echo "$f" # body never runs
done
This is almost always what you want for scripts. Set it at the top of any script that iterates over globs.
dotglob — include hidden files
By default, globs do not match files starting with . (the “hidden” convention):
ls *
# regular-files-only
shopt -s dotglob
ls *
# now also includes .config .ssh .git etc.
Use dotglob when you actually need to process all files. Otherwise leave it off.
globstar — recursive **
shopt -s globstar
ls **/*.log
# matches *.log in current dir AND recursively in all subdirectories
Without globstar, ** is just two *s (no special meaning). With it, ** matches zero or more path components. This is bash 4+ only.
# Find all .py files in the project
shopt -s globstar nullglob
for f in src/**/*.py; do
process "$f"
done
extglob — extended pattern matching
shopt -s extglob
# now you can use:
?(pat) # 0 or 1 occurrence of pat
*(pat) # 0 or more occurrences
+(pat) # 1 or more
@(pat) # exactly one
!(pat) # NOT pat
# Examples:
ls !(*.log) # everything except .log files
ls *.@(jpg|png|gif) # any of three extensions
ls ?(README|LICENSE) # match either, or empty
Extglob is bash-specific but extremely useful. Especially !(...) for “everything except”:
# Remove everything in /tmp/cache except the lockfile
shopt -s extglob
rm -rf /tmp/cache/!(lockfile)
Glob options together
The standard “I want my globs to behave sensibly” preamble:
shopt -s nullglob globstar extglob
For most modern bash scripts, this is the right baseline. Add dotglob only when you specifically need it.
2. The three regex flavours in shell
Shell tools use different regex dialects. This trips up everyone. The three flavours:
BRE — Basic Regular Expression (POSIX, the oldest)
Default for grep and sed without flags. Special characters: . * ^ $ \[ \]. Other “metacharacters” must be backslash-escaped to be special: \?, \+, \{n,m\}, \|, \(, \).
echo "hello123" | grep '[0-9]\+' # BRE — backslash-escape +
echo "hello123" | sed 's/[0-9]\+/X/' # BRE — same
This is the most surprising flavour for people coming from other regex languages. It’s also the default. Learn it (or always use -E).
ERE — Extended Regular Expression
grep -E (or egrep), sed -E (or sed -r), awk. Special characters: . * ? + { } | ( ) ^ $ \[ \]. No backslash-escaping for ?, +, |, {, (.
echo "hello123" | grep -E '[0-9]+' # ERE — natural +
echo "hello123" | sed -E 's/[0-9]+/X/' # ERE — same
ERE is what most people think of as “regex.” If you have -E available, use it.
PCRE — Perl-Compatible Regular Expression
grep -P, pcregrep, ripgrep, most modern languages. The richest dialect: lookahead, lookbehind, named groups, non-greedy *?, \d, \w, \s, etc.
echo "hello123" | grep -P '\d+' # PCRE — \d for digits
echo "hello123world" | grep -P '(?<=hello)\d+' # lookbehind: digits AFTER "hello"
PCRE is not in plain sed or awk. For PCRE in stream editing you use perl -pe:
echo "hello123" | perl -pe 's/\d+/X/' # in-place ERE/PCRE-style
Picking a flavour
Default to ERE for clarity (grep -E, sed -E). Drop to PCRE only when you need lookahead/lookbehind/named groups. Avoid plain BRE for new code.
A handy rule of thumb: always use grep -E or grep -P. Never plain grep. The mental tax of remembering BRE backslash-escaping is too high.
Common regex character classes (work in ERE/PCRE)
[A-Za-z] # letters
[0-9] # digits
[A-Za-z0-9] # alphanumeric
[[:alpha:]] # POSIX letter class — locale-aware
[[:digit:]] # POSIX digit class
[[:space:]] # whitespace (space, tab, newline)
[[:punct:]] # punctuation
[[:xdigit:]] # hex digit
\d # digit (PCRE only)
\w # word char (PCRE only): [A-Za-z0-9_]
\s # whitespace (PCRE only)
\b # word boundary (PCRE)
The [[:class:]] POSIX classes work in BRE, ERE, and PCRE.
3. find — the filesystem-traversal language
find is its own little Turing-incomplete language for “give me files matching these criteria, do these things to them.” Most people use find . -name '*.log' and stop. The full power is staggering.
Basic structure
find [PATHS] [TESTS] [ACTIONS]
PATHS are starting points. TESTS are filters that decide whether each file matches. ACTIONS are what to do with matched files. Default action is -print if you don’t specify.
The most useful tests
-name 'PATTERN' # filename (with shell glob); case-sensitive
-iname 'PATTERN' # case-insensitive filename
-type TYPE # f=file, d=dir, l=symlink, b=block, c=char, p=fifo, s=socket
-size N[bckMG] # size: +1M is "more than 1 megabyte"; -100k is "less than 100k"
-mtime N # modified N days ago: -7 = within 7 days, +30 = over 30 days
-atime N # accessed N days ago
-ctime N # ctime (inode change time)
-newer FILE # newer than FILE (handy: -newer .last-run)
-mmin N # modified N minutes ago
-perm MODE # permission bits: -perm -u+x means "user-executable"
-user NAME # owned by user
-group NAME # owned by group
-empty # empty file or dir
-readable / -writable / -executable # by current user
-regex 'PATTERN' # match full path with BRE; pair with -regextype posix-extended
-path 'PATTERN' # match full path with shell glob (different from -name)
-not / ! # negate
-and / -or # combine (default is -and)
The most useful actions
-print # print path (default if no action given)
-print0 # NUL-terminated; use for piping (lesson 4)
-printf 'FORMAT\n' # printf-style; %p path, %f filename, %s size, %T@ mtime, etc.
-delete # delete the file
-exec CMD {} \; # run CMD once per match, replacing {} with path
-exec CMD {} + # run CMD once with ALL matches batched as args (faster!)
-execdir CMD {} \; # same but cd to file's directory first
-prune # don't recurse into this directory (the SKIP action)
-ls # ls-style output
-quit # stop after this match (find at most 1)
Combining tests
# Files larger than 100MB, ending in .log
find /var/log -type f -size +100M -name '*.log'
# Empty directories
find . -type d -empty
# Modified in last 7 days, not in .git
find . -type f -mtime -7 -not -path '*/.git/*'
# Owned by nobody (often security cleanup)
find / -nouser -print
The default combination is “and.” Use -or for “or”; parentheses (escaped or quoted) for grouping:
find . \( -name '*.log' -o -name '*.tmp' \) -delete
Note the escaped parens \( \) — required because parens are shell syntax otherwise.
-exec vs -exec +
This is the optimization most people miss:
# WRONG — forks gzip once per file (slow for many files)
find /var/log -name '*.log' -exec gzip {} \;
# RIGHT — batches up filenames and runs gzip ONCE with all of them
find /var/log -name '*.log' -exec gzip {} +
The trailing + (instead of \;) tells find to batch matches. find accumulates filenames until argv-length limits are reached, then exec’s gzip with as many as fit, repeats until done. Dramatically faster for “many files, simple op.”
The prune trick — skip directories
# Find all .py files, skipping .git and node_modules
find . \( -name .git -o -name node_modules \) -prune -o -type f -name '*.py' -print
Read this as: “for each entry, if it’s named .git or node_modules, prune (don’t recurse); otherwise, if it’s a file ending in .py, print it.” The -o is “or” — -prune returns false (since pruned things aren’t matches) and the second branch handles real matches.
-print0 and the pipeline pattern
For piping find output safely, always use -print0 and pair it with NUL-aware tools:
find /var/log -name '*.log' -print0 | xargs -0 gzip
find /tmp -mtime +30 -print0 | xargs -0 rm --
mapfile -d '' -t FILES < <(find . -type f -print0)
We covered this in lessons 4 and 6. It’s the only completely-robust file-collection pattern.
4. grep — text search with all the flags
Plain grep PATTERN FILE is rarely enough. The flags are essential.
Pattern flavour flags
grep PATTERN FILE # BRE — escape special chars
grep -E PATTERN FILE # ERE — natural regex
grep -F PATTERN FILE # FIXED string — no regex, fastest
grep -P PATTERN FILE # PCRE — full Perl regex
-F (fixed string) is much faster than regex when you just need a literal substring. Use it when applicable:
grep -F 'ERROR: connection refused' app.log
Output mode flags
grep -l PATTERN *.log # print only filenames that match (no matching lines)
grep -L PATTERN *.log # print only filenames that DON'T match
grep -c PATTERN file # print only count of matching lines
grep -q PATTERN file # quiet — no output, just exit code (for if conditions)
grep -o PATTERN file # print only the matched part of each line
grep -n PATTERN file # prefix each line with line number
Examples:
# Files in /etc that contain "deprecated"
grep -lF deprecated /etc/*.conf
# Count of error lines
grep -c '^ERROR' app.log
# Test whether a file contains a marker (in a script)
if grep -q 'STARTED' /var/log/app.log; then
echo "App started"
fi
# Extract just the matching emails
grep -oE '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}' file.txt
Context flags
grep -A 3 PATTERN file # print 3 lines AFTER each match
grep -B 3 PATTERN file # 3 lines BEFORE
grep -C 3 PATTERN file # 3 lines BEFORE and AFTER
Wonderful for log analysis:
# Show 5 lines of context around any ERROR in the last 1000 lines
tail -n 1000 app.log | grep -C 5 ERROR
Recursion flags
grep -r PATTERN dir # recurse into dir
grep -R PATTERN dir # also follow symlinks (rare; usually -r is what you want)
grep --include='*.py' -r 'TODO' . # only search .py files
grep --exclude-dir=.git -r 'TODO' . # skip .git
grep --exclude='*.lock' -r 'TODO' .
Word and line matching
grep -w foo file # match "foo" only as a whole word (not "foobar")
grep -x EXACT file # entire line must equal EXACT
grep -v PATTERN file # invert: print lines that do NOT match
grep -i PATTERN file # case-insensitive
-w is invaluable for symbol search:
grep -wn 'username' src/**/*.py # find every "username" as a whole word
Multi-pattern and pattern files
grep -e foo -e bar file # match either "foo" OR "bar"
grep -E '(foo|bar)' file # same with ERE
grep -f patterns.txt file # patterns in a file, one per line
ripgrep (rg) — the modern alternative
ripgrep (rg) is a from-scratch rewrite of grep, written in Rust. It’s:
- Faster: typically 5-10x faster than grep on large repositories.
- Smarter defaults: respects
.gitignoreautomatically, recurses by default, parallel. - Perl-style regex: full PCRE2 support.
rg PATTERN # recurses current dir, ignoring .git/.gitignore'd files
rg -tpy 'def main' # only Python files
rg -A 3 -B 3 PATTERN # context
rg -c PATTERN # counts per file
If you write a lot of search-heavy shell, install ripgrep (brew install ripgrep / apt install ripgrep) and prefer it. For maximum portability of scripts, stick with grep.
5. sed — the stream editor
sed reads input line by line, applies a script, writes output. For one-shot edits to a stream or file, it’s irreplaceable.
Substitution — the 80% use case
sed 's/OLD/NEW/' file # replace FIRST occurrence per line
sed 's/OLD/NEW/g' file # replace ALL occurrences (g = global)
sed 's/OLD/NEW/2' file # replace the SECOND occurrence per line
sed 's/OLD/NEW/gI' file # global, case-insensitive (gnu sed)
Delimiters
The default delimiter is /, but you can use any character. Especially helpful for paths:
sed 's|/usr/local/bin|/opt/local/bin|g' file
sed 's#OLD#NEW#g' file
Anchors and groups (ERE form with -E)
sed -E 's/^foo/bar/' file # only at start of line
sed -E 's/foo$/bar/' file # only at end of line
sed -E 's/(\w+) (\w+)/\2 \1/' file # swap two words; capture groups
In default BRE, you’d need to backslash-escape: s/\(\w*\) \(\w*\)/\2 \1/.
Address ranges — apply only to certain lines
sed '5d' file # delete line 5
sed '5,10d' file # delete lines 5-10
sed '/^#/d' file # delete lines starting with #
sed '/PATTERN/,/END_PATTERN/d' file # delete from PATTERN through END_PATTERN
sed -n '5,10p' file # print only lines 5-10 (-n suppresses default print)
sed -n '/foo/,/bar/p' file # print from "foo" through "bar"
The address can be a line number, $ (last line), or /PATTERN/.
Multiple commands
sed -e 's/a/A/' -e 's/b/B/' file # two substitutions
sed 's/a/A/; s/b/B/' file # same, ; separator
In-place editing
GNU sed (Linux):
sed -i 's/OLD/NEW/g' file # modify file in place
sed -i.bak 's/OLD/NEW/g' file # also save a backup as file.bak
BSD sed (macOS) requires an empty backup extension explicitly:
sed -i '' 's/OLD/NEW/g' file # macOS — empty extension means "no backup"
For portable scripts that work on both:
sed -i.bak 's/OLD/NEW/g' file && rm -f file.bak
Or:
TMP=$(mktemp)
sed 's/OLD/NEW/g' file > "$TMP" && mv "$TMP" file
The mv-temp pattern is also atomic (mv on same fs is atomic) — readers either see the old or new version, never partial. Often the right choice for production scripts.
Sed example: rewrite config
# Update hostname in nginx.conf
sudo sed -i.bak -E 's/^(\s*server_name)\s+\S+;/\1 example.com;/' /etc/nginx/nginx.conf
The \S+ matches any non-whitespace (the old hostname); we replace with our new one. The capture group (\s*server_name) preserves indentation.
Sed example: extract a section
# Print the [database] section of an INI file
sed -n '/^\[database\]/,/^\[/p' config.ini
/PATTERN/,/PATTERN/ is a range; -n + p prints just those lines.
6. Combined examples
Find all TODOs in code, with file and line
grep -rEn --include='*.{js,ts,py,go}' '\b(TODO|FIXME|XXX)\b' .
Count lines of source per language
for ext in py js ts go; do
COUNT=$(find . -name "*.${ext}" -not -path '*/node_modules/*' \
-not -path '*/.git/*' -print0 | xargs -0 cat 2>/dev/null | wc -l)
printf '%-5s %d\n' "$ext" "$COUNT"
done
Rename batch of files
shopt -s nullglob
for f in *.JPG; do
mv -- "$f" "${f%.JPG}.jpg"
done
Update copyright year in all source files
find . -type f -name '*.py' -print0 \
| xargs -0 sed -i.bak -E 's/Copyright \(c\) [0-9]{4}/Copyright (c) 2026/'
find . -name '*.bak' -delete
Strip trailing whitespace from all source files
find . -type f \( -name '*.py' -o -name '*.js' \) -print0 \
| xargs -0 sed -i -E 's/[ \t]+$//'
Find files larger than 100MB and delete after confirmation
find /var/log -type f -size +100M -print0 | while IFS= read -r -d '' f; do
read -p "Delete $f? [y/N] " -n 1 -r REPLY
echo
if [[ "$REPLY" =~ ^[Yy]$ ]]; then
rm -- "$f"
fi
done
Show files modified in last 24 hours, sorted by size
find . -type f -mtime -1 -printf '%s\t%p\n' | sort -n
7. Common pitfalls
Forgetting nullglob
for f in *.log; do process "$f"; done runs once with f="*.log" if no files match. Always either set shopt -s nullglob or guard with [ -e "$f" ] || continue.
Mixing regex flavours
Writing grep '\d+' file and getting nothing — \d is PCRE. Use grep -P '\d+' or grep -E '[0-9]+'.
find -name is a glob, not a regex
find . -name '*.py' # glob — works
find . -name '.*\.py' # regex syntax — gives nothing
For regex on names, use -regex:
find . -regextype posix-extended -regex '.*\.(py|js)'
Sed in-place differences
GNU vs BSD differ on the -i flag’s argument requirement. Always test on both, or use the temp-file pattern.
grep without --
If your file starts with -, grep thinks it’s a flag:
grep PATTERN -strange-file # ERROR — "-strange-file" looks like a flag
grep PATTERN -- -strange-file # CORRECT
The -- separator says “no more flags.” Use it for any user-supplied filenames.
Forgetting to escape regex metacharacters in grep -F
grep -F doesn’t interpret regex, so grep -F '$10' finds literal $10. But people sometimes write grep -F for performance and then put regex in the pattern, getting empty results. Use -F only for fixed strings.
8. Twelve idioms for daily use
# 1. Recursive grep with file type filter and ignore patterns
grep -rEn --include='*.{py,js,ts}' --exclude-dir={.git,node_modules} 'PATTERN' .
# 2. Recursive find with NUL-safe iteration
find /path -type f -name '*.log' -print0 | xargs -0 cmd
# 3. Count files of a type
find . -type f -name '*.py' | wc -l
# 4. Total size of files matching a pattern
find /var/log -name '*.log' -printf '%s\n' | awk '{s += $1} END {print s}'
# 5. Find empty files / dirs
find . -empty -print
# 6. Files modified in last N minutes
find . -type f -mmin -30
# 7. Files larger than 100MB
find . -type f -size +100M
# 8. Replace text in all matching files (atomic-friendly)
find . -type f -name '*.txt' -print0 | xargs -0 sed -i.bak 's/OLD/NEW/g'
find . -name '*.bak' -delete
# 9. Strip trailing whitespace
find . -type f -name '*.py' -print0 | xargs -0 sed -i 's/[ \t]*$//'
# 10. Top 10 largest files
find . -type f -printf '%s\t%p\n' | sort -rn | head -n 10
# 11. Find duplicate files by size (first pass)
find . -type f -printf '%s %p\n' | sort -n | uniq -d -w 11
# 12. Count occurrences of pattern across all logs
grep -c PATTERN /var/log/*.log
9. What you must internalise before lesson 12
- What’s
nullgloband why use it? (Empty globs expand to nothing instead of staying literal; safer for loops.) - What’s
globstar? (Bash 4+ option;**recurses into subdirectories.) - What’s
extglob? (Extended pattern matching:?(pat),!(pat), etc.) - What’s the difference between BRE, ERE, and PCRE? (Backslash-escaping for
+ ? | ( )differs; PCRE adds lookahead/lookbehind/named groups.) - Which
grepflag enables ERE? (-E. PCRE is-P. Fixed string is-F.) - What does
find -exec cmd {} +do? (Batches arguments — much faster than\;.) - What’s
find -prunefor? (Skip a directory without recursing into it.) - What’s the canonical NUL-safe pipe pattern? (
find ... -print0 | xargs -0 cmd.) - What’s the difference between
sed -ion GNU and BSD? (GNU accepts-ialone; BSD requires-i ''for “no backup.”) - What’s the atomic in-place edit pattern? (
sed ... > tmp && mv tmp file.)
If any felt fuzzy, re-read. Lesson 12 (awk, jq, yq, csvkit) covers the structured-data toolkit — for when grep/sed runs out of expressive power.
What’s next
Lesson 12 is the closer for Tier 2: text processing at the structured-data level. We cover awk deeply (its data model, BEGIN/END, FS/OFS, arrays, multi-file processing), jq for JSON (filters, transformations, the standard idioms), yq for YAML, csvkit for proper CSV handling, and the locale and UTF-8 pitfalls that trip up everyone working with international data. Bring everything from lessons 1-11. After L12, Tier 1 + Tier 2 (Wave 1) of this course is complete and we move into the advanced material in Wave 2.