DevOps Platform

Deploy GitLab Self-Managed on Kubernetes with the Official Helm Chart and Object Storage

A 600-engineer fintech has outgrown its single GitLab Omnibus box. It runs on one large VM, the Postgres data directory is 400 GB and growing, CI artifacts have filled the disk twice this quarter, and the last upgrade meant a two-hour maintenance window because everything — web, Sidekiq, Gitaly, Postgres, Redis, the registry — lives on one host that cannot be scaled or patched without taking the whole platform down. The mandate from the VP of Engineering is blunt: “I want GitLab to be a platform service, not a pet. It should survive a node dying, scale Runners for the Monday-morning pipeline storm, and let us patch app pods without a maintenance window.” This guide is the concrete path to that: GitLab self-managed on Kubernetes via the official Helm chart, with stateful services externalized — managed PostgreSQL, managed Redis, and S3-compatible object storage for everything large — so the in-cluster footprint is stateless, horizontally scalable, and upgradeable in place.

The architectural rule that makes this work, and the one teams get wrong: GitLab on Kubernetes is only operable when the data lives outside the cluster. The chart will happily run a bundled Postgres, Redis, and MinIO so a demo comes up in ten minutes — but those bundled components are explicitly not for production. You externalize Postgres and Redis to managed services with backups and failover, you push every large blob (LFS, artifacts, uploads, the container registry, packages, Terraform state, backups) to object storage, and what remains in-cluster — Webservice (Puma), Sidekiq, Gitaly, Toolbox, Registry, Shell — becomes the part you can scale and upgrade freely. Gitaly is the one deliberate exception: it is stateful by design and stays on a PersistentVolume, because Git repositories are not object-storage-shaped.

Prerequisites

Target topology

Deploy GitLab Self-Managed on Kubernetes with the Official Helm Chart and Object Storage — topology

The cluster holds the stateless and scalable GitLab tier: Webservice (the Puma web/API workers behind the GitLab UI and the Git HTTPS API), Sidekiq (background jobs — repository housekeeping, CI bookkeeping, email, webhooks), the container Registry, GitLab Shell (SSH Git access), Toolbox (backups and rake tasks), and the NGINX Ingress that fronts them. The one stateful in-cluster service is Gitaly, which owns the Git repositories on a PersistentVolume.

Everything durable and heavy lives outside the cluster: PostgreSQL (Aurora) holds the relational state, Redis (ElastiCache) handles caching, sessions, and the Sidekiq queue, and a set of S3 buckets absorbs LFS objects, CI artifacts, uploads, packages, the registry’s image layers, Terraform state, and backups. GitLab Runners register against this control plane and run as ephemeral pods (or on a separate node pool) so the Monday pipeline storm scales independently of the GitLab app itself. Identity flows from Entra ID (federated via SAML/OIDC) for SSO; secrets are injected from HashiCorp Vault; Akamai fronts the public ingress for global TLS, caching, and WAF/bot protection.

1. Provision the external data plane

Create the managed PostgreSQL and Redis instances first — the chart needs to point at them on day one, and standing them up after the fact means a migration. With Terraform (the platform team manages all cloud infra as code; Ansible handles the few host-level Runner customizations later):

resource "aws_rds_cluster" "gitlab" {
  cluster_identifier     = "gitlab-prod"
  engine                 = "aurora-postgresql"
  engine_version         = "14.12"
  database_name          = "gitlabhq_production"
  master_username        = "gitlab"
  manage_master_user_password = true            # secret lands in AWS Secrets Manager
  backup_retention_period = 14
  storage_encrypted      = true
  vpc_security_group_ids = [aws_security_group.gitlab_db.id]
}

resource "aws_elasticache_replication_group" "gitlab" {
  replication_group_id       = "gitlab-prod"
  engine                     = "redis"
  engine_version             = "6.2"
  node_type                  = "cache.r6g.large"
  num_cache_clusters         = 2                # primary + replica for failover
  automatic_failover_enabled = true
  at_rest_encryption_enabled = true
  transit_encryption_enabled = true
}

GitLab requires the pg_trgm and btree_gist extensions. Connect to the new database and enable them:

psql "host=gitlab-prod.cluster-xxxx.ap-south-1.rds.amazonaws.com \
      user=gitlab dbname=gitlabhq_production sslmode=require" <<'SQL'
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE EXTENSION IF NOT EXISTS btree_gist;
SQL

2. Create the object-storage buckets and access policy

GitLab uses distinct buckets per data type — keep them separate so lifecycle rules, sizing, and access can differ. Create the full set:

for b in artifacts lfs uploads packages registry mr-diffs \
         terraform-state ci-secure-files dependency-proxy backups tmp; do
  aws s3api create-bucket --bucket "kloudvin-gitlab-${b}" \
    --region ap-south-1 \
    --create-bucket-configuration LocationConstraint=ap-south-1
  aws s3api put-bucket-encryption --bucket "kloudvin-gitlab-${b}" \
    --server-side-encryption-configuration \
    '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"aws:kms"}}]}'
done

Prefer IRSA (IAM Roles for Service Accounts) over static keys so pods assume a role with no long-lived credentials on disk. Create an IAM policy scoped to exactly these buckets (s3:GetObject, PutObject, DeleteObject, ListBucket on arn:aws:s3:::kloudvin-gitlab-*) and bind it to a service-account role you will reference in the Helm values.

If you must use static keys instead, store the S3 connection as a Rails-format secret that the chart reads (never inline it in values):

cat > /tmp/s3-connection.yaml <<'YAML'
provider: AWS
region: ap-south-1
use_iam_profile: true
YAML
kubectl create secret generic gitlab-object-storage \
  --namespace gitlab --from-file=connection=/tmp/s3-connection.yaml
rm -f /tmp/s3-connection.yaml

3. Pre-create secrets in the namespace (sourced from Vault)

Do not let Helm auto-generate secrets you need to control — the database password, the Redis password, the registry storage credentials, and the SSO certificate all originate in HashiCorp Vault and are synced into the namespace by the Vault Secrets Operator (or vault-agent injector). That keeps the source of truth in Vault, gives you rotation and lease auditing, and means nothing sensitive sits in your Git-tracked values file.

kubectl create namespace gitlab

# Postgres password (pulled from Vault, here shown via a synced env var)
kubectl create secret generic gitlab-postgres \
  --namespace gitlab --from-literal=password="${PG_PASSWORD}"

# Redis password
kubectl create secret generic gitlab-redis \
  --namespace gitlab --from-literal=password="${REDIS_PASSWORD}"

# Initial root password for the GitLab UI
kubectl create secret generic gitlab-root-password \
  --namespace gitlab --from-literal=password="$(openssl rand -base64 24)"

A VaultStaticSecret makes the Vault → Kubernetes sync declarative so these never drift from the source of truth:

apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultStaticSecret
metadata:
  name: gitlab-postgres
  namespace: gitlab
spec:
  mount: kv
  type: kv-v2
  path: platform/gitlab/postgres
  destination:
    name: gitlab-postgres
    create: true
  refreshAfter: 1h

4. Author the Helm values

This is the heart of the deployment. The values file disables every bundled stateful component and points GitLab at the external services. Save as values-prod.yaml:

global:
  hosts:
    domain: kloudvin.com
    https: true
    gitlab:   { name: gitlab.kloudvin.com }
    registry: { name: registry.kloudvin.com }

  edition: ce                      # or 'ee' with a license for SAML/Geo

  # --- External PostgreSQL: disable the bundled chart ---
  psql:
    host: gitlab-prod.cluster-xxxx.ap-south-1.rds.amazonaws.com
    port: 5432
    database: gitlabhq_production
    username: gitlab
    password:
      secret: gitlab-postgres
      key: password

  # --- External Redis ---
  redis:
    host: gitlab-prod.xxxx.cache.amazonaws.com
    port: 6379
    auth:
      enabled: true
      secret: gitlab-redis
      key: password

  # --- Object storage for every large blob ---
  appConfig:
    object_store:
      enabled: true
      connection:
        secret: gitlab-object-storage
        key: connection
    lfs:        { bucket: kloudvin-gitlab-lfs }
    artifacts:  { bucket: kloudvin-gitlab-artifacts }
    uploads:    { bucket: kloudvin-gitlab-uploads }
    packages:   { bucket: kloudvin-gitlab-packages }
    externalDiffs: { bucket: kloudvin-gitlab-mr-diffs }
    terraformState: { bucket: kloudvin-gitlab-terraform-state }
    ciSecureFiles:  { bucket: kloudvin-gitlab-ci-secure-files }
    backups:    { bucket: kloudvin-gitlab-backups, tmpBucket: kloudvin-gitlab-tmp }

  ingress:
    configureCertmanager: false    # using a pre-issued wildcard cert
    class: nginx
  serviceAccount:
    enabled: true
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/gitlab-s3

# Turn OFF the bundled stateful charts — this is the whole point
postgresql:   { install: false }
redis:        { install: false }
minio:        { install: false }
gitlab-runner:{ install: false }  # Runners installed separately (step 8)

certmanager: { install: false }

# The registry must also use object storage, configured separately
registry:
  storage:
    secret: gitlab-registry-storage
    key: config
    extraKey: s3

# In-cluster stateless tiers: size and scale them
gitlab:
  webservice:
    minReplicas: 3
    maxReplicas: 10
    resources: { requests: { cpu: "1", memory: 2.5Gi } }
  sidekiq:
    minReplicas: 2
    maxReplicas: 8
  gitaly:
    persistence:
      size: 200Gi
      storageClass: gp3        # the ONE stateful in-cluster service

Two things worth calling out. The registry needs its own S3 config as a Rails/registry-format secret (gitlab-registry-storage) — it does not read global.appConfig.object_store. And Gitaly keeps a PersistentVolume: Git repos are not object-storage-shaped, so this is the deliberate stateful exception. Use a fast block class (gp3/pd-ssd) and size for repo growth plus housekeeping headroom.

5. Configure TLS and the ingress

Create the wildcard TLS secret the ingress will serve, then ensure the chart references it:

kubectl create secret tls gitlab-wildcard-tls \
  --namespace gitlab \
  --cert=/tmp/kloudvin-wildcard.crt \
  --key=/tmp/kloudvin-wildcard.key

Add to values-prod.yaml so both hosts use it:

global:
  ingress:
    tls:
      secretName: gitlab-wildcard-tls

In production, terminate public TLS at Akamai at the edge (global anycast, caching of static assets, and WAF/bot mitigation in front of the GitLab login and Git endpoints), with Akamai’s origin pointed at the NGINX ingress LoadBalancer. Internal cluster traffic still runs TLS end-to-end.

6. Add the Helm repo and install

helm repo add gitlab https://charts.gitlab.io/
helm repo update

# Always pin the chart version; map it to the GitLab version you want
helm search repo gitlab/gitlab --versions | head

helm upgrade --install gitlab gitlab/gitlab \
  --namespace gitlab \
  --version 8.2.3 \
  --values values-prod.yaml \
  --set global.initialRootPassword.secret=gitlab-root-password \
  --timeout 900s

The migrations Job runs the Rails schema migrations against your external Postgres before the app pods become ready — watch it first:

kubectl -n gitlab get jobs -w
kubectl -n gitlab logs -f job/gitlab-migrations-1

7. Wire SSO through Entra ID

Engineers should not have GitLab-local passwords. Federate authentication to Microsoft Entra ID (or Okta if that is your workforce IdP) using OmniAuth. Register an Entra app, then provide the provider block as a secret the chart mounts (keeping the client secret in Vault, synced as gitlab-sso):

# providers.yaml synced from Vault into secret gitlab-sso
- { name: azure_activedirectory_v2,
    label: "Entra ID",
    args: {
      client_id: "<app-client-id>",
      client_secret: "<from-vault>",
      tenant_id: "<tenant-id>" } }
# add to values-prod.yaml
global:
  appConfig:
    omniauth:
      enabled: true
      allowSingleSignOn: ["azure_activedirectory_v2"]
      blockAutoCreatedUsers: false
      providers:
        - secret: gitlab-sso
          key: provider

Re-run the same helm upgrade --install to apply. Now group membership in Entra ID governs who can log in, and offboarding a leaver in the IdP removes their GitLab access.

8. Deploy GitLab Runners

Runners are installed and scaled separately from the GitLab app so CI load never starves the control plane. Grab the runner registration token from the GitLab UI (Admin → CI/CD → Runners → create instance runner), store it in Vault, then install the runner chart pointed at your GitLab URL:

kubectl create namespace gitlab-runner
kubectl -n gitlab-runner create secret generic runner-token \
  --from-literal=runner-registration-token="" \
  --from-literal=runner-token="${RUNNER_AUTH_TOKEN}"

helm upgrade --install runner gitlab/gitlab-runner \
  --namespace gitlab-runner --version 0.66.0 \
  --set gitlabUrl=https://gitlab.kloudvin.com \
  --set runners.secret=runner-token \
  --set runners.config="
    [[runners]]
      [runners.kubernetes]
        namespace = \"gitlab-runner\"
        image = \"alpine:3.20\"
        cpu_request = \"500m\"
        memory_request = \"1Gi\"
      [runners.cache]
        Type = \"s3\"
        Shared = true
        [runners.cache.s3]
          BucketName = \"kloudvin-gitlab-tmp\"
          BucketLocation = \"ap-south-1\"
  "

The Kubernetes executor spawns an ephemeral pod per CI job and tears it down after, and a shared S3 cache means dependency caches survive across jobs and nodes. Run Runners on a dedicated, autoscaling node pool so the Monday-morning pipeline storm scales horizontally without touching the GitLab pods. CrowdStrike Falcon sensors run on every node — including the Runner pool — for runtime threat detection on the build workloads, which are the highest-risk surface (they execute arbitrary repo code) and feed detections to the SOC.

Validation

Verify each layer in order — pods, then external connectivity, then object storage, then a real Git+CI round trip.

# 1. All app pods Running/Ready, no CrashLoopBackOff
kubectl -n gitlab get pods

# 2. GitLab's own readiness probe (external DB + Redis must be green)
kubectl -n gitlab exec deploy/gitlab-webservice-default -c webservice -- \
  curl -sf http://localhost:8080/-/readiness?all=1 | jq .

# 3. Confirm Rails sees external Postgres and Redis, not bundled
kubectl -n gitlab exec deploy/gitlab-toolbox -- gitlab-rails runner \
  'puts ActiveRecord::Base.connection_db_config.host; puts Gitlab::Redis::Cache.url'

# 4. Object storage write path: push an LFS object or upload a CI artifact,
#    then confirm it landed in S3 (not on a PVC)
aws s3 ls s3://kloudvin-gitlab-artifacts/ --recursive | head

Then do the human-facing smoke test: log in via the Entra ID SSO button, create a project, git push over both HTTPS and SSH (exercises Shell + Gitaly), push a container image to registry.kloudvin.com (exercises the registry’s S3 backend), and run a one-line .gitlab-ci.yml to confirm a Runner pod spawns, executes, and uploads its artifact to S3. Point Dynatrace (or Datadog) at the namespace via its Kubernetes operator to confirm the golden signals — Puma request latency, Sidekiq queue depth and job latency, Gitaly RPC latency, and pod saturation — are flowing before you cut traffic over.

Rollback and teardown

Helm makes app-tier rollback a one-liner; the data is safe regardless because it lives in external services with their own backups.

# Roll back the GitLab release to the previous revision
helm -n gitlab history gitlab
helm -n gitlab rollback gitlab 1

# Full teardown of the in-cluster footprint (data survives in RDS/ElastiCache/S3)
helm -n gitlab uninstall gitlab
helm -n gitlab-runner uninstall runner
kubectl delete namespace gitlab gitlab-runner

Because Postgres, Redis, and every blob are external, uninstalling the chart destroys only the stateless tier — point a fresh install at the same RDS endpoint and S3 buckets and GitLab comes back with all its data. The one thing that does not survive a namespace delete is the Gitaly PersistentVolume: set its reclaimPolicy: Retain and take a tested backup with the Toolbox backup-utility (which streams the repo tar to the backups bucket) before any destructive action. Restore is the inverse: backup-utility --restore from the S3 backup into a fresh Gitaly PVC.

Common pitfalls

Security notes

Authentication is Entra ID-only via OmniAuth, so access follows IdP group membership and a leaver loses GitLab the moment they are offboarded. Every secret — DB and Redis passwords, the SSO client secret, registry storage creds, runner tokens — originates in HashiCorp Vault and is synced in by the Vault Secrets Operator with rotation and lease auditing, never committed to the values file. Pods reach S3 through IRSA with a role scoped to exactly the GitLab buckets, so there are no long-lived keys on disk. Wiz runs continuous CSPM across the cluster, the RDS instance, and the S3 buckets, alerting the moment a bucket drifts to public, encryption is disabled, or an over-broad IAM policy appears; Wiz Code scans the GitLab repositories and pipeline definitions for hardcoded secrets and vulnerable dependencies as code is pushed, shifting that check left into the merge request. CrowdStrike Falcon provides runtime protection on every node, with the Runner pool — which executes untrusted repository code — as the priority surface. Edge ingress sits behind Akamai for WAF and bot mitigation. A security finding from Wiz or a Falcon detection auto-raises a ServiceNow incident, and corpus/infra changes flow through a ServiceNow change gate so there is a documented approval, not just a helm upgrade.

Cost notes

The big levers are object storage lifecycle and Runner scheduling. Apply S3 lifecycle rules to expire old CI artifacts and transition cold LFS objects to infrequent-access tiers — artifact buckets balloon fastest and most of it is never read after a few days. Run Runners on spot/preemptible nodes with the Kubernetes executor’s ephemeral pods, so you pay for CI capacity only during the pipeline storm and the node pool scales to near-zero overnight. Size Aurora and ElastiCache to steady-state with autoscaling read replicas rather than provisioning for peak. Track all of it in Dynatrace (or Datadog) with cost dashboards per namespace so the platform team can chargeback CI spend to each product group. The net of externalizing state is not just operability — a stateless app tier on spot-friendly nodes with tiered object storage is materially cheaper than one oversized always-on Omnibus VM, while finally being something you can patch without a maintenance window.

CI/CD and IaC integration

Although this is GitLab itself, the platform that runs it is managed like any other service. Terraform provisions the cluster, RDS, ElastiCache, S3, and IAM; Ansible handles the few host-level tweaks on the Runner node pool. The Helm release is promoted through a pipeline — and here the team’s existing tooling matters: Argo CD watches the values-prod.yaml in Git and reconciles the GitLab release declaratively (GitOps), so the live state always matches the repo, while Jenkins or GitHub Actions runs the pre-deploy checks (helm lint, a dry-run --show-only, and a smoke test against a staging GitLab) as a required gate before Argo CD syncs to production. The same GitLab instance, once live, becomes the SCM and CI engine for the rest of the organization — including downstream platforms like the company’s Moodle LMS and a fleet of network virtual appliances whose configs are version-controlled and deployed straight from these pipelines. That is the payoff: GitLab stops being a pet on one VM and becomes the platform service the VP asked for — survives a node loss, scales Runners on demand, and upgrades with a single pinned helm upgrade instead of a maintenance window.

GitLabKubernetesHelmObject StoragePostgreSQLDevOps
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading