DevOps Multi-Cloud

Operating Harbor as an Enterprise Artifact Registry: Projects, Replication, and Vulnerability Gating

A container registry is not a dumb blob store. It is the choke point every artifact passes through on the way to production, which makes it the cheapest place to enforce supply-chain policy. Harbor (a CNCF graduated project) turns that choke point into a control plane: project-scoped RBAC, built-in Trivy scanning, signature verification, replication, and admission-style deploy gates. This guide walks the operational decisions that matter when you run it for multiple teams across regions, not the five-minute demo.

1. Harbor architecture: the components you actually operate

Harbor is a set of cooperating services in front of a standard OCI distribution registry. Knowing which component owns which failure mode is what lets you debug a stuck push at 2 a.m.

Component Responsibility Failure symptom when it dies
core API, auth, RBAC, policy decisions, web UI backend UI 500s, logins fail, robot tokens rejected
registry (distribution) Stores and serves OCI blobs/manifests docker push/pull hang or 503
registryctl Triggers garbage collection, manages registry config GC jobs never start
jobservice Async worker pool: scans, replication, retention, GC Scans/replications queue forever
trivy-adapter Vulnerability scanning via Trivy “Not Scanned” stuck, scan jobs error
core DB (PostgreSQL) Projects, users, policies, scan results metadata Total outage; this is the source of truth
Redis Job queues, UI session cache, registry blob cache Jobs lost, sessions drop, slow pulls
portal Static web UI (nginx) UI unreachable, API still works

Two rules follow directly from this table. First, PostgreSQL is your single source of truth for everything except blobs; back it up like a production database, not a cache. Second, blobs live in object storage, not the database in any real deployment, so the registry’s storage backend (S3, Azure Blob, GCS) is a separate durability concern. Never run a production Harbor with filesystem storage on a single PVC.

For storage backend config (the registry component reads this), point it at object storage explicitly. In a Helm values file:

persistence:
  imageChartStorage:
    type: s3
    s3:
      region: us-east-1
      bucket: harbor-prod-registry
      # Prefer IRSA / workload identity over static keys.
      # Leave accesskey/secretkey unset to use the instance/pod role.
      encrypt: true
      secure: true
      v4auth: true

2. Designing projects, robot accounts, and quota for multi-team isolation

A project is Harbor’s unit of isolation: it scopes RBAC, quota, scanning policy, retention, and immutability. Model one project per team-or-application boundary, not one giant library project everyone pushes to. The default public library project should be deleted or locked on day one.

Roles within a project are fixed and well-defined: Limited Guest (pull only, no catalog), Guest (pull), Developer (push + pull), Maintainer (push, pull, scan, manage), and Project Admin (everything including members and policies). Map these to your IdP groups via OIDC so membership is managed in one place.

Create a project with a storage quota and let CVEs through only via explicit allowlist (covered in step 5). The Harbor API is the scriptable path:

# Create a project with a 200 GiB quota and auto-scan on push.
curl -sS -u "admin:${HARBOR_ADMIN_PASS}" \
  -X POST "https://harbor.example.com/api/v2.0/projects" \
  -H "Content-Type: application/json" \
  -d '{
    "project_name": "payments",
    "metadata": {
      "public": "false",
      "auto_scan": "true",
      "reuse_sys_cve_allowlist": "true"
    },
    "storage_limit": 214748364800
  }'

CI systems must never authenticate as a human or as admin. Use a robot account, which is a scoped, optionally expiring credential bound to specific actions on specific resources. Project-level robots cover one project; system-level robots can span projects (useful for a shared CI runner).

# Project-scoped robot: push+pull on repositories, pull on artifacts, 90-day expiry.
curl -sS -u "admin:${HARBOR_ADMIN_PASS}" \
  -X POST "https://harbor.example.com/api/v2.0/robots" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "ci-pusher",
    "duration": 90,
    "level": "project",
    "permissions": [{
      "kind": "project",
      "namespace": "payments",
      "access": [
        {"resource": "repository", "action": "push"},
        {"resource": "repository", "action": "pull"}
      ]
    }]
  }'

The response returns the secret exactly once. The robot name is prefixed (e.g. robot$payments+ci-pusher); feed both name and secret to docker login. Rotate them with PATCH .../robots/{id} and a refresh secret call, and keep duration short for anything that touches production projects.

Callout: Quota is enforced at push time against unique blob usage, not the naive sum of tag sizes. Because layers are deduplicated, two tags sharing a base image count that base once. This is why a project can hold far more “image-equivalents” than its quota suggests, and why deleting a tag rarely frees space until garbage collection runs (step 8).

3. Proxy cache projects to front Docker Hub and survive rate limits

Docker Hub’s anonymous and free-tier pull limits will eventually break a busy CI fleet, and pulling public base images directly couples your builds to an external registry’s availability. A Harbor proxy cache project fixes both: it transparently fronts an upstream registry, caches pulled artifacts locally, and serves subsequent pulls from Harbor.

First register the upstream as a registry endpoint, then create a project of type proxy bound to it:

# 1. Register Docker Hub as an endpoint (store credentials to lift anon limits).
curl -sS -u "admin:${HARBOR_ADMIN_PASS}" \
  -X POST "https://harbor.example.com/api/v2.0/registries" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "dockerhub",
    "type": "docker-hub",
    "url": "https://hub.docker.com",
    "credential": {"access_key": "'"${DH_USER}"'", "access_secret": "'"${DH_TOKEN}"'"}
  }'

# 2. Create the proxy cache project (registry_id from the previous response).
curl -sS -u "admin:${HARBOR_ADMIN_PASS}" \
  -X POST "https://harbor.example.com/api/v2.0/projects" \
  -H "Content-Type: application/json" \
  -d '{"project_name":"dockerhub-proxy","registry_id": 5, "metadata":{"public":"true"}}'

Developers and CI then pull through Harbor instead of Docker Hub:

# Was: docker pull nginx:1.27
docker pull harbor.example.com/dockerhub-proxy/library/nginx:1.27

A few operational truths about proxy caches. The cache is populated lazily on first pull, so the first request still hits upstream. Cached artifacts are subject to the project’s retention policy, so set a sensible TTL or they accumulate forever. And critically, scanning still applies to proxied images, so you get CVE visibility on third-party base images you previously trusted blind. Point your golden-base-image policy at the proxy and you have a single audited path for all external pulls.

4. Replication: pull/push policies across regions and to DR

Replication moves artifacts between Harbor and a remote registry endpoint (another Harbor, ECR, ACR, GCR, Quay, or Docker Hub). You will use it for three patterns: geo-distribution (push releases to a registry near each region’s clusters), disaster recovery (continuous replication to a cold-standby Harbor), and promotion (replicate from a staging registry to production on a trigger).

A policy has a direction, a filter set, a trigger, and a destination. This push-based policy mirrors only release-tagged, signed images in the payments project to a DR registry on every event:

{
  "name": "payments-to-dr",
  "src_registry": null,
  "dest_registry": {"id": 7},
  "dest_namespace": "payments",
  "dest_namespace_replace_count": 1,
  "trigger": {"type": "event_based"},
  "filters": [
    {"type": "name", "value": "payments/**"},
    {"type": "tag", "value": "v*"}
  ],
  "deletion": false,
  "override": true,
  "enabled": true
}

Key decisions encoded here:

Callout: Choose pull-based replication when the destination cannot reach the source but the source-side Harbor can reach it (common with locked-down DR networks where only the DR site initiates connections). Push-based requires the source to open a connection to the destination. The right direction is dictated by your firewall rules, not by preference.

5. Vulnerability scanning with Trivy and CVE allowlists

Harbor ships Trivy as the default scanner. Enable auto_scan per project (step 2) so every push triggers a scan, and schedule a system-wide rescan so newly disclosed CVEs are re-evaluated against existing images without a re-push, because a CVE published today against an image you built last month will only appear after a rescan.

# Nightly full rescan so new CVE data is applied to existing artifacts.
curl -sS -u "admin:${HARBOR_ADMIN_PASS}" \
  -X PUT "https://harbor.example.com/api/v2.0/system/scanAll/schedule" \
  -H "Content-Type: application/json" \
  -d '{"schedule": {"type": "Custom", "cron": "0 0 2 * * *"}}'

Trivy’s database refresh matters in air-gapped or rate-limited environments. The adapter pulls vulnerability data periodically; in restricted networks, mirror the DB and point the adapter at your mirror via TRIVY_DB_REPOSITORY rather than letting it fail open with stale data.

False positives are inevitable, and the wrong response is to lower the severity gate globally. Instead, use a CVE allowlist: a scoped, expiring waiver for a specific CVE ID. Maintain a system allowlist for truly universal cases and per-project allowlists for the rest. Always set an expiry so waivers are re-justified rather than living forever.

# Project-level allowlist with an expiry (Unix epoch seconds).
curl -sS -u "admin:${HARBOR_ADMIN_PASS}" \
  -X PUT "https://harbor.example.com/api/v2.0/projects/payments/cve_allowlist" \
  -H "Content-Type: application/json" \
  -d '{
    "items": [{"cve_id": "CVE-2024-12345"}],
    "expires_at": 1767225600
  }'

6. Deployment security: prevent vulnerable from running and enforce signatures

This is where Harbor stops being a passive store. Two project-level switches turn it into an enforcement point at pull time.

Prevent vulnerable images from running blocks pulls of any artifact whose scan found a vulnerability at or above a chosen severity. An allowlisted CVE (step 5) is excluded from the decision, which is exactly how a justified waiver lets a known-but-accepted finding through while still blocking everything else.

# Block pulls of images with High+ vulnerabilities (allowlist still applies).
curl -sS -u "admin:${HARBOR_ADMIN_PASS}" \
  -X PUT "https://harbor.example.com/api/v2.0/projects/payments" \
  -H "Content-Type: application/json" \
  -d '{"metadata": {"prevent_vul": "true", "severity": "high"}}'

The behavior to internalize: a blocked pull returns an error to the client, including your Kubernetes nodes. A kubelet pulling a freshly-flagged image will get a denial and the Pod will fail to start with ErrImagePull. That is the gate working as designed, but it means a new CVE can stop a previously-passing deployment, so wire scan results into your alerting, not just the registry.

For provenance, Harbor integrates Cosign signature verification. Enabling content trust / signature enforcement at the project level means Harbor only serves artifacts that carry a valid Cosign signature stored alongside them.

# Require a valid Cosign signature on every artifact served from this project.
curl -sS -u "admin:${HARBOR_ADMIN_PASS}" \
  -X PUT "https://harbor.example.com/api/v2.0/projects/payments" \
  -H "Content-Type: application/json" \
  -d '{"metadata": {"enable_content_trust_cosign": "true"}}'

Your CI signs after push, before promotion:

# Keyless signing in CI (OIDC identity from the runner), then Harbor will serve it.
IMAGE="harbor.example.com/payments/api@${DIGEST}"
COSIGN_EXPERIMENTAL=1 cosign sign --yes "${IMAGE}"

Callout: Harbor’s prevent_vul and signature enforcement act at the registry boundary, which catches pulls Harbor can see. They are a strong first layer, but defense in depth means also enforcing at the cluster boundary with an admission controller like Kyverno or the Sigstore policy controller, verifying the same Cosign signature against your trusted identity. Registry-side and admission-side checks fail closed independently; relying on only one leaves a gap (e.g. an image pulled by a node bypassing the cluster you do not control).

7. Tag retention and immutability: control storage, protect releases

Two policies, two different jobs, frequently confused.

Tag retention decides which tags to keep (everything else becomes eligible for deletion). Model it as “retain the last N of each, keep recent pushes” and scope it with repository and tag filters. This rule keeps the 10 most recent SHA-tagged dev images and the 20 most recent release tags:

{
  "algorithm": "or",
  "rules": [
    {
      "template": "latestPushedK",
      "params": {"latestPushedK": 10},
      "scope_selectors": {"repository": [{"kind": "doublestar", "decoration": "repoMatches", "pattern": "**"}]},
      "tag_selectors": [{"kind": "doublestar", "decoration": "matches", "pattern": "sha-*"}]
    },
    {
      "template": "latestPushedK",
      "params": {"latestPushedK": 20},
      "scope_selectors": {"repository": [{"kind": "doublestar", "decoration": "repoMatches", "pattern": "**"}]},
      "tag_selectors": [{"kind": "doublestar", "decoration": "matches", "pattern": "v*"}]
    }
  ],
  "trigger": {"kind": "Schedule", "settings": {"cron": "0 0 1 * * 0"}}
}

Run retention in dry-run first (the API and UI both support it) and read the “what would be deleted” report before you ever let it delete. A misfiled tag selector that matches your release tags will happily mark them for cleanup.

Immutability is the opposite guarantee: it prevents matching tags from being overwritten or deleted at all, regardless of retention. Protect your release tags so a re-push of v1.4.2 can never silently change what that tag points to:

{
  "disabled": false,
  "scope_selectors": {"repository": [{"kind": "doublestar", "decoration": "repoMatches", "pattern": "**"}]},
  "tag_selectors": [{"kind": "doublestar", "decoration": "matches", "pattern": "v*"}]
}

The correct combination for most teams: immutability on v* release tags so they are tamper-proof and pinned, plus retention on sha-* and dev-* tags so the long tail of build artifacts gets pruned. Immutability wins conflicts: a retention rule cannot delete an immutable tag.

8. Garbage collection, HA topology, and backup/restore

Deleting a tag only removes the manifest reference. The underlying blobs stay on disk until garbage collection runs and removes layers no longer referenced by any manifest. Until GC runs, your storage bill does not drop.

# Trigger GC; dry_run first to see reclaimable space, then run for real.
curl -sS -u "admin:${HARBOR_ADMIN_PASS}" \
  -X POST "https://harbor.example.com/api/v2.0/system/gc/schedule" \
  -H "Content-Type: application/json" \
  -d '{"schedule": {"type": "Manual"}, "parameters": {"dry_run": true, "delete_untagged": true}}'

GC operational rules:

For HA, the stateless Harbor services (core, jobservice, portal, registry, adapters) scale horizontally behind a load balancer with replicas > 1. The hard requirements are the stateful dependencies: an external, HA PostgreSQL (managed RDS/Cloud SQL/Flexible Server, not the in-chart single Postgres), an external Redis (cluster or Sentinel), and shared object storage for blobs so any registry replica serves any blob. The embedded chart database and Redis are fine for a lab and disqualifying for production.

# Production Helm values: external state, multiple stateless replicas.
core:
  replicas: 3
jobservice:
  replicas: 2
registry:
  replicas: 3
database:
  type: external
  external:
    host: harbor-pg.internal
    port: "5432"
    username: harbor
    coreDatabase: registry
    sslmode: require
redis:
  type: external
  external:
    addr: harbor-redis.internal:6379
    sentinelMasterSet: ""   # set when using Redis Sentinel

A backup/restore runbook has exactly three concerns, in priority order:

  1. PostgreSQL is the crown jewel. Take continuous WAL-archived backups or frequent pg_dump snapshots. Everything else can be rebuilt; lose this and you lose projects, RBAC, policies, robots, and scan history.
  2. Object storage holds the blobs. Enable bucket versioning and cross-region replication at the storage layer; restoring blobs without the matching database leaves dangling manifests.
  3. Secrets (the Harbor encryption keys / core secret, registry HTTP secret). These encrypt sensitive data at rest in the database. Restoring a database backup against a Harbor with different secrets will fail to decrypt stored credentials. Back up the secret material with the database and keep their versions aligned.

Restore order is the inverse of dependency: provision object storage and secrets first, restore PostgreSQL, then bring up the stateless services pointed at all three. Test the restore on a non-prod Harbor quarterly; a backup you have never restored is a hypothesis, not a recovery plan.

Verify

Walk these end-to-end after standing up or changing the platform.

# 1. Health: all components report healthy.
curl -sS "https://harbor.example.com/api/v2.0/health" | jq '.status, .components[].name'

# 2. Robot login + push works, human/admin push is not used by CI.
echo "${ROBOT_SECRET}" | docker login harbor.example.com -u 'robot$payments+ci-pusher' --password-stdin
docker tag alpine:3.20 harbor.example.com/payments/test:probe && docker push harbor.example.com/payments/test:probe

# 3. Scan ran and results are queryable.
curl -sS -u "admin:${HARBOR_ADMIN_PASS}" \
  "https://harbor.example.com/api/v2.0/projects/payments/repositories/test/artifacts/probe/additions/vulnerabilities" \
  | jq '."application/vnd.security.vulnerability.report; version=1.1".severity'

# 4. Gate works: push a known-vulnerable image and confirm the pull is blocked.
docker pull harbor.example.com/payments/test:probe   # should be denied if prevent_vul triggers

# 5. Proxy cache serves an upstream image through Harbor.
docker pull harbor.example.com/dockerhub-proxy/library/busybox:1.36

# 6. Replication policy executed at least once with success.
curl -sS -u "admin:${HARBOR_ADMIN_PASS}" \
  "https://harbor.example.com/api/v2.0/replication/executions?policy_id=1" | jq '.[0].status'

Expected results: health is healthy for every component; the robot push succeeds and is the only credential CI uses; the vulnerability report returns severities; a High+ image is denied on pull with the gate on; the proxied pull succeeds through Harbor; and the latest replication execution reports Succeeded.

Enterprise scenario

A payments platform team ran a single Harbor in their primary region. Their EKS clusters in a second region pulled base and app images cross-region, and during a regional network degradation, every Pod restart in the second region failed with ErrImagePull because the only registry was unreachable. Worse, their CI fronted Docker Hub directly, so when the incident coincided with Docker Hub tightening anonymous pull limits, even unaffected pipelines started failing with toomanyrequests.

The constraint: the second region’s DR network only permitted inbound-initiated connections to the primary (a hard firewall rule they could not change quickly), so a naive push-based replication from primary to the DR Harbor was blocked at the network layer.

They solved it with two changes. First, a proxy cache project in each regional Harbor fronting Docker Hub with stored credentials, so all external base images came through an audited, rate-limit-resilient local path. Second, pull-based replication initiated from the DR-region Harbor (which could reach the primary), so release tags landed in the regional registry without violating the firewall direction. They scoped replication to release tags only to keep bandwidth bounded:

{
  "name": "pull-releases-into-region-b",
  "src_registry": {"id": 3},
  "dest_registry": null,
  "dest_namespace_replace_count": 1,
  "trigger": {"type": "scheduled", "trigger_settings": {"cron": "0 */15 * * * *"}},
  "filters": [
    {"type": "name", "value": "payments/**"},
    {"type": "tag", "value": "v*"}
  ],
  "deletion": false,
  "override": false,
  "enabled": true
}

After the change, a primary-region registry outage no longer stopped second-region deployments, because nodes pulled from their local Harbor, and the 15-minute pull schedule kept release tags current with bounded, predictable cross-region traffic. The proxy cache absorbed the Docker Hub rate-limit problem entirely. The lesson the team took away: replication direction is a network-topology decision, not a preference, and the registry is where you absorb upstream-dependency risk before it reaches your clusters.

Checklist

harborartifact-managementcontainer-registrysupply-chaindevsecops

Comments

Keep Reading