Containerization Containers

Mastering Multi-Stage Dockerfiles: BuildKit Cache Mounts, Slim Images & Reproducible Builds

Most “slow, fat” container images are not a Docker problem — they are a Dockerfile problem. This is a practical walk through the techniques that take a 1.2 GB image with a five-minute cold build down to a sub-100 MB image that rebuilds in seconds, without sacrificing reproducibility.

Everything here assumes BuildKit, which is the default builder in modern Docker Engine and is always on when you use docker buildx. If you are on an older daemon, export DOCKER_BUILDKIT=1 before building.

1. Why your image is 1.2 GB: anatomy of layers and the build context

Two things bloat images: what you copy in, and what each instruction leaves behind.

A Docker image is an ordered stack of read-only layers. Every RUN, COPY, and ADD adds a layer, and each layer is a diff over the previous one. Critically, deleting a file in a later layer does not reclaim the space — the bytes still live in the earlier layer. This is why the classic anti-pattern below ships the entire apt cache forever:

# Anti-pattern: the cache deletion is a separate layer, so nothing shrinks
RUN apt-get update && apt-get install -y build-essential
RUN rm -rf /var/lib/apt/lists/*

The build context is the second culprit. When you run docker build ., the CLI tars up the entire directory and sends it to the daemon. Drag in .git, node_modules, and local target/ directories and you are shipping hundreds of megabytes before a single instruction runs. Fix it with a .dockerignore:

.git
node_modules
**/*.log
dist
target
.env
*.md

Inspect what is actually in a layer with docker history:

docker history --no-trunc --format '{{.Size}}\t{{.CreatedBy}}' myapp:latest

A useful rule: every layer should leave the image strictly smaller or functionally necessary. Cleanup must happen in the same RUN that created the mess, or it does nothing.

2. Multi-stage builds: separating build, test, and runtime stages cleanly

The core idea: use a heavy image with compilers and toolchains to produce artifacts, then copy only the artifacts into a clean runtime image. Build tools never reach production.

Here is a Go service. The first stage compiles; the final stage is a near-empty runtime that receives a single static binary:

# syntax=docker/dockerfile:1
FROM golang:1.22 AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -trimpath -ldflags="-s -w" -o /out/app ./cmd/app

# A dedicated test stage that CI can target explicitly
FROM build AS test
RUN go vet ./... && go test ./...

# Tiny runtime: only the binary, nothing else
FROM gcr.io/distroless/static-debian12:nonroot AS runtime
COPY --from=build /out/app /app
USER nonroot:nonroot
ENTRYPOINT ["/app"]

Two details that matter:

The -ldflags="-s -w" strips the symbol table and DWARF debug info; -trimpath removes absolute build paths from the binary, which also helps with reproducibility (see step 6).

3. BuildKit cache mounts for package managers and compilers

A multi-stage build still re-downloads every dependency whenever a source file changes, because the RUN go mod download layer is invalidated. Cache mounts solve this: --mount=type=cache attaches a persistent directory that survives across builds but is not part of the final image.

Go modules and build cache:

# syntax=docker/dockerfile:1
FROM golang:1.22 AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN --mount=type=cache,target=/go/pkg/mod \
    go mod download
COPY . .
RUN --mount=type=cache,target=/go/pkg/mod \
    --mount=type=cache,target=/root/.cache/go-build \
    CGO_ENABLED=0 go build -trimpath -ldflags="-s -w" -o /out/app ./cmd/app

The same pattern transforms apt and Node builds. For apt, you must disable the default cache-cleaning behavior so the downloaded .deb files persist in the mount:

RUN rm -f /etc/apt/apt.conf.d/docker-clean && \
    echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' \
      > /etc/apt/apt.conf.d/keep-cache
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
    --mount=type=cache,target=/var/lib/apt,sharing=locked \
    apt-get update && apt-get install -y --no-install-recommends curl ca-certificates

For Node with a frozen lockfile:

RUN --mount=type=cache,target=/root/.npm \
    npm ci --prefer-offline

The sharing=locked option serializes concurrent builds that touch the same cache (apt’s dpkg database is not safe for parallel writes); the default sharing=shared is fine for append-mostly caches like npm or pip.

Cache mounts live on the builder, not in the image. They make the second build fast; they do nothing for image size. This is the single highest-leverage change for local iteration speed.

4. Secret mounts and SSH forwarding without baking credentials into layers

Never use ARG or ENV for tokens — they are recoverable from image history and metadata. BuildKit provides ephemeral secret and SSH mounts that exist only for the duration of one RUN and never land in any layer.

A registry token for a private package feed:

# syntax=docker/dockerfile:1
RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \
    npm ci

Supply it at build time from a file or an environment variable:

docker buildx build --secret id=npmrc,src=$HOME/.npmrc .
# or straight from an env var:
docker buildx build --secret id=npmrc,env=NPM_CONFIG_TOKEN .

For cloning private Git repos over SSH, forward the agent rather than copying a key:

RUN --mount=type=ssh \
    git clone git@github.com:org/private-repo.git /src
docker buildx build --ssh default .

The secret is mounted as a tmpfs file for that command only. After the RUN completes it is gone — docker history and a docker save tarball both come up empty.

5. Choosing a minimal base: alpine vs distroless vs scratch

The runtime base is your biggest size lever after multi-stage separation. The trade-off is always size and attack surface versus debuggability and libc compatibility.

Base Approx size Shell / package mgr libc Best for
debian:slim ~75 MB yes (apt) glibc General apps needing OS tooling
alpine ~8 MB yes (apk) musl Small images where musl is acceptable
gcr.io/distroless/* ~2-25 MB no glibc Compiled apps, hardened runtime
scratch 0 bytes no none Fully static binaries only

Key gotchas:

FROM scratch
COPY --from=build /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
COPY --from=build /out/app /app
ENTRYPOINT ["/app"]

For most teams, distroless nonroot is the sweet spot: tiny, no shell, glibc compatibility, and a non-root user already configured.

6. Reproducible, multi-arch images with buildx, SOURCE_DATE_EPOCH, and pinned digests

“Reproducible” means the same source produces a bit-for-bit identical image digest. Two things break this: embedded timestamps and floating base tags.

Pin base images by digest. A tag like golang:1.22 moves; a digest does not:

FROM golang:1.22@sha256:<digest> AS build

Normalize timestamps with SOURCE_DATE_EPOCH. BuildKit honors this to rewrite layer and image timestamps to a fixed value:

export SOURCE_DATE_EPOCH=$(git log -1 --pretty=%ct)
docker buildx build \
  --build-arg SOURCE_DATE_EPOCH=$SOURCE_DATE_EPOCH \
  --output type=image,name=registry.example.com/myapp:1.4.0,rewrite-timestamp=true \
  .

Build multi-arch in one shot with a buildx builder backed by the docker-container driver (the default docker driver cannot produce multi-platform manifests):

docker buildx create --name multi --driver docker-container --use
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t registry.example.com/myapp:1.4.0 \
  --push .

This produces a single tag backed by a manifest list; clients automatically pull the variant matching their architecture. Note that emulated cross-builds (amd64 host building arm64 via QEMU) are correct but slow — for hot paths, use native arm64 runners and let buildx merge the manifests.

7. Wiring layer caching into CI

CI runners are usually ephemeral, so the local build cache is empty on every run. Export the cache to a durable location and import it next time. BuildKit supports several backends; here are the two you will reach for most.

Registry cache (portable across any CI):

docker buildx build \
  --cache-from type=registry,ref=registry.example.com/myapp:buildcache \
  --cache-to type=registry,ref=registry.example.com/myapp:buildcache,mode=max \
  -t registry.example.com/myapp:1.4.0 --push .

mode=max exports cache for all stages including intermediate ones — essential so the build and test stages stay cached, not just the final layers.

GitHub Actions cache via the official action:

- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v6
  with:
    context: .
    push: true
    tags: registry.example.com/myapp:1.4.0
    cache-from: type=gha
    cache-to: type=gha,mode=max
    platforms: linux/amd64,linux/arm64
    secrets: |
      npmrc=${{ secrets.NPMRC }}

There is also type=inline, which embeds cache metadata directly in the image you push. It is the simplest to set up but only caches the final stage (it cannot carry intermediate stages), so prefer registry or gha with mode=max for multi-stage builds.

Enterprise scenario

A fintech platform team ran ~140 microservices through self-hosted GitLab Runners on EKS. Builds used type=registry cache against ECR, and warm rebuilds were still 6-9 minutes. The smoking gun: every pipeline started with Cache miss. The cause was the ECR lifecycle policy — it expired untagged images after 7 days, and BuildKit cache manifests pushed with mode=max are untagged blobs. Low-traffic services rebuilt less than weekly, so their cache was garbage-collected before the next run, guaranteeing a cold build every time.

Two things fixed it. First, they moved the cache off the artifact registry entirely onto an S3 backend, which decoupled cache retention from image GC and removed the per-layer ECR API throttling they were also hitting:

docker buildx build \
  --cache-to   type=s3,region=us-east-1,bucket=ci-buildkit-cache,name=$SERVICE,mode=max \
  --cache-from type=s3,region=us-east-1,bucket=ci-buildkit-cache,name=$SERVICE \
  -t $ECR/$SERVICE:$CI_COMMIT_SHA --push .

Second, they discovered the runner’s docker-container builder was recreated per job, so even the local cache was cold. Pinning a persistent builder backed by a node-local PVC kept hot caches resident across jobs on the same node:

docker buildx create --name ci --driver docker-container \
  --driver-opt env.BUILDKIT_STEP_LOG_MAX_SIZE=10485760 --use --bootstrap

Median warm rebuild dropped to ~70 seconds. The lesson: a registry-backed cache is only as durable as that registry’s retention policy, and “cache miss” in CI is far more often an eviction problem than a Dockerfile problem.

Verify

Confirm each property actually holds before trusting it.

# Image size and per-layer breakdown
docker image ls myapp:1.4.0
docker history myapp:1.4.0

# No secrets leaked into any layer (should print nothing)
docker save myapp:1.4.0 -o /tmp/img.tar && tar -xf /tmp/img.tar -C /tmp/img \
  && grep -rl "NPM_TOKEN\|BEGIN OPENSSH PRIVATE KEY" /tmp/img || echo "clean"

# Multi-arch manifest contains both platforms
docker buildx imagetools inspect registry.example.com/myapp:1.4.0

# Deep layer analysis: find wasted space and duplicate files
dive myapp:1.4.0

dive reports an “efficiency score” and flags files that are added then later modified or deleted — the exact waste multi-stage builds are meant to eliminate. For a before/after audit, record three numbers: final image size, cold build time (--no-cache), and warm rebuild time after a one-line source change. A well-optimized build typically shows a 5-15x size reduction and a warm rebuild dominated by your compiler, not by dependency downloads.

Optimization checklist

Pitfalls and next steps

A few traps that bite even experienced teams:

Next, generate an SBOM and provenance attestation at build time (docker buildx build --sbom=true --provenance=true), and scan the slimmed image with trivy or grype. A small base does not just build faster — it gives a scanner far less to flag, which is the quiet payoff of every gram you strip out.

DockerBuildKitDockerfileCIMulti-arch

Comments

Keep Reading