Data Multi-cloud

Deploy Databricks Asset Bundles for Job and DLT Pipeline CI/CD

A media-analytics team ships a Delta Live Tables (DLT) pipeline and three jobs that feed the executive revenue dashboard. For a year, the only way changes reached production was an engineer opening the prod workspace UI on a Friday, exporting a notebook, re-pasting cluster JSON, and clicking Run — twice that quarter a stale cluster policy or a wrong catalog name silently routed test data into the prod table, and nobody could say which commit caused it because there was no commit. The mandate from the new data-platform lead is blunt: every job and pipeline is code, every change is a pull request, and a human never clicks Deploy in the prod workspace again. This guide stands up exactly that with Databricks Asset Bundles (DABs) — the native, declarative way to package jobs, DLT pipelines, and their resources as versioned YAML — promoted through GitHub Actions across dev, staging, and prod workspaces, with identity, secrets, security scanning, and change control bolted on the way a regulated enterprise actually runs it.

DABs matter because they collapse three things that used to drift apart: the pipeline/notebook source, the job and DLT definitions (schedule, cluster spec, catalog, dependencies), and the target workspace configuration. One databricks.yml, one CLI, one set of variables per environment. The same artifact that an engineer validates on their laptop is the artifact GitHub Actions deploys to prod — no UI export, no copy-paste, no “works on my workspace.”

Prerequisites

Target topology

Deploy Databricks Asset Bundles for Job and DLT Pipeline CI/CD — topology

The flow is a straight promotion ladder. An engineer works against the dev workspace from their laptop using their own user identity, deploying a personal bundle copy under ~/.bundle. A pull request triggers GitHub Actions, which runs databricks bundle validate and a unit-test gate on every PR. Merge to main deploys the bundle to staging as the staging service principal and runs the jobs as an integration test. A tagged release — gated by a ServiceNow change approval — deploys the identical bundle to prod as the prod service principal. Microsoft Entra ID issues the workload identity each runner assumes (Okta is the upstream workforce IdP for the humans); HashiCorp Vault supplies any residual secrets; Wiz Code scans the repository and IaC in the pipeline; Dynatrace (or Datadog) watches the deployed jobs at runtime; and Terraform — which DABs uses under the hood — is also what the platform team used to provision the workspaces and catalogs this bundle targets.

Two rules make the whole thing safe and are worth holding in your head: the bundle artifact is identical across environments — only variables change, and prod is deployed only by CI as a service principal, never by a person.

1. Scaffold the bundle and pin the CLI

Initialize a bundle from the default Python template, which gives you a job, a DLT pipeline, and the directory layout DABs expects.

# Install / upgrade the CLI (Homebrew shown; curl installer also works)
brew tap databricks/tap && brew install databricks
databricks --version            # expect v0.2xx.x

# Scaffold in an empty repo dir
databricks bundle init default-python \
  --output-dir . \
  --config-file /tmp/dab-init.json

A minimal /tmp/dab-init.json answers the template prompts non-interactively:

{
  "project_name": "analytics_pipelines",
  "include_notebook": "yes",
  "include_dlt": "yes",
  "include_python": "yes"
}

This produces databricks.yml, a resources/ directory for job and pipeline YAML, a src/ directory for notebooks and Python, and a tests/ directory. Pin the CLI version for the team in the repo so local and CI behave identically:

# databricks.yml (top of file)
bundle:
  name: analytics_pipelines

# Fail loudly if someone runs an incompatible CLI
experimental:
  python_wheel_wrapper: true

2. Define the job and the DLT pipeline as resources

Keep each resource in its own file under resources/. Here is a DLT pipeline and a job that triggers it, written so that all environment-specific values are variables — catalog, schema, and a node count.

# resources/revenue_pipeline.yml
resources:
  pipelines:
    revenue_dlt:
      name: "revenue-dlt-${bundle.target}"
      catalog: ${var.catalog}
      target: ${var.schema}
      serverless: true
      continuous: false
      development: ${var.dlt_development}
      libraries:
        - notebook:
            path: ../src/dlt/revenue_transforms.py
      configuration:
        source.path: ${var.landing_path}
# resources/revenue_job.yml
resources:
  jobs:
    revenue_refresh:
      name: "revenue-refresh-${bundle.target}"
      tags:
        team: media-analytics
        managed_by: dabs
      tasks:
        - task_key: run_dlt
          pipeline_task:
            pipeline_id: ${resources.pipelines.revenue_dlt.id}
        - task_key: publish_marts
          depends_on:
            - task_key: run_dlt
          notebook_task:
            notebook_path: ../src/jobs/publish_marts.py
          job_cluster_key: marts_cluster
      job_clusters:
        - job_cluster_key: marts_cluster
          new_cluster:
            spark_version: "15.4.x-scala2.12"
            node_type_id: ${var.node_type}
            num_workers: ${var.marts_workers}
            policy_id: ${var.cluster_policy_id}
            data_security_mode: SINGLE_USER
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"
        timezone_id: "Asia/Kolkata"
        pause_status: ${var.schedule_pause}

Note ${resources.pipelines.revenue_dlt.id} — DABs wires the job’s pipeline task to the pipeline it deploys in the same bundle, so you never hardcode a pipeline ID that differs per workspace.

3. Declare per-environment targets

Targets are where dev/staging/prod diverge. Each sets its workspace host, its variable values, and the identity mode. Put this in databricks.yml.

variables:
  catalog:           { description: "Unity Catalog name" }
  schema:            { description: "Target schema/database" }
  node_type:         { default: "Standard_D4ds_v5" }      # Azure; use i3.xlarge on AWS
  marts_workers:     { default: 2 }
  cluster_policy_id: { description: "Shared cluster policy id" }
  landing_path:      { description: "Source landing location" }
  dlt_development:   { default: true }
  schedule_pause:    { default: "PAUSED" }

targets:
  dev:
    mode: development          # prefixes names with the user, pauses schedules, tags as dev
    default: true
    workspace:
      host: https://adb-1111111111111111.7.azuredatabricks.net
    variables:
      catalog: analytics_dev
      schema:  revenue
      cluster_policy_id: "A1B2C3D4E5F60001"
      landing_path: "/Volumes/analytics_dev/raw/landing"

  staging:
    mode: production
    workspace:
      host: https://adb-2222222222222222.7.azuredatabricks.net
      root_path: /Workspace/Shared/.bundle/${bundle.name}/${bundle.target}
    run_as:
      service_principal_name: sp-analytics-staging
    variables:
      catalog: analytics_staging
      schema:  revenue
      cluster_policy_id: "A1B2C3D4E5F60002"
      landing_path: "/Volumes/analytics_staging/raw/landing"
      dlt_development: false
      schedule_pause: "PAUSED"     # staging runs on demand from CI, not on a clock

  prod:
    mode: production
    workspace:
      host: https://adb-3333333333333333.7.azuredatabricks.net
      root_path: /Workspace/Shared/.bundle/${bundle.name}/${bundle.target}
    run_as:
      service_principal_name: sp-analytics-prod
    variables:
      catalog: analytics_prod
      schema:  revenue
      cluster_policy_id: "A1B2C3D4E5F60003"
      landing_path: "/Volumes/analytics_prod/raw/landing"
      dlt_development: false
      schedule_pause: "UNPAUSED"   # prod runs on its cron

mode: development is doing real safety work: it prefixes every resource name with the deploying user, force-pauses schedules, and marks clusters as dev — so two engineers in the same dev workspace never collide and a half-finished pipeline never fires on a timer. mode: production with run_as is what makes staging and prod deterministic and owned by a service principal.

4. Validate and deploy to dev from your laptop

Authenticate as yourself (OAuth U2M opens a browser; the human SSO is Okta → Entra ID, so this is your normal corporate login), then validate and deploy.

# One-time per host: OAuth user-to-machine login
databricks auth login --host https://adb-1111111111111111.7.azuredatabricks.net

# Validate: parses YAML, resolves variables, checks the workspace API — no changes made
databricks bundle validate -t dev

# Deploy your personal copy to the dev workspace
databricks bundle deploy -t dev

# Trigger the job once to smoke-test it end to end
databricks bundle run revenue_refresh -t dev

validate is your fastest feedback loop — it catches a bad catalog name, an unknown policy ID, or a malformed cron before anything is created. Because the dev target is mode: development, what you deploy is named revenue-refresh-dev prefixed with your username and is invisible to everyone else’s runs.

5. Wire identity: service principals via Entra ID and Vault

CI must never deploy as a human. Create one Entra ID service principal per non-dev environment, add it to the matching workspace, and let GitHub Actions assume it. The cleanest path is OIDC: GitHub mints a short-lived token, Entra trusts the GitHub issuer via a federated credential, and no static secret is stored.

# Per environment, the platform/identity team runs (Azure CLI):
az ad sp create-for-rbac --name "sp-analytics-prod" --skip-assignment

# Add a GitHub-issuer federated credential so Actions can assume it with no secret
az ad app federated-credential create --id <APP_ID> --parameters '{
  "name": "github-prod",
  "issuer": "https://token.actions.githubusercontent.com",
  "subject": "repo:my-org/analytics_pipelines:environment:prod",
  "audiences": ["api://AzureADTokenExchange"]
}'

Then add each service principal to its Databricks workspace with the Allow cluster creation and DLT entitlements, and grant Unity Catalog privileges on the target catalog (USE CATALOG, CREATE SCHEMA, MODIFY on the schema). Where OIDC is not available (some self-hosted runners, or a third-party token a job needs at runtime), pull a short-lived secret from HashiCorp Vault in the workflow rather than storing a PAT in GitHub:

export VAULT_ADDR=https://vault.internal:8200
DATABRICKS_TOKEN=$(vault kv get -field=token secret/databricks/prod-sp)

The principle mirrors the rest of the platform: OIDC/managed identity first, Vault for the residual short-lived secrets, no long-lived PAT in a repo.

6. Build the GitHub Actions pipeline

Three triggers, three behaviors: PR validates, merge deploys to staging, tag deploys to prod behind an approval gate. The databricks/setup-cli action installs the pinned CLI; databricks bundle deploy reads the target from -t.

# .github/workflows/deploy-bundle.yml
name: deploy-bundle
on:
  pull_request:
    branches: [main]
  push:
    branches: [main]
  release:
    types: [published]

permissions:
  id-token: write        # required for OIDC to Entra ID
  contents: read

jobs:
  validate:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: databricks/setup-cli@main
        with: { version: 0.221.0 }
      - uses: astral-sh/setup-uv@v3
      - run: uv pip install -r requirements-dev.txt --system
      - run: pytest tests/unit -q                      # unit + transform tests
      - name: Wiz Code IaC + secret scan
        run: wiz-cli dir scan --path . --policy default # fail PR on critical findings
      - run: databricks bundle validate -t staging      # parse + resolve against staging API
        env:
          DATABRICKS_HOST: ${{ vars.STAGING_HOST }}
          ARM_USE_OIDC: true

  deploy-staging:
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - uses: databricks/setup-cli@main
        with: { version: 0.221.0 }
      - run: databricks bundle deploy -t staging
        env: { DATABRICKS_HOST: ${{ vars.STAGING_HOST }}, ARM_USE_OIDC: true }
      - run: databricks bundle run revenue_refresh -t staging   # integration run

  deploy-prod:
    if: github.event_name == 'release'
    runs-on: ubuntu-latest
    environment: prod          # GitHub Environment with required reviewers + ServiceNow check
    steps:
      - uses: actions/checkout@v4
      - uses: databricks/setup-cli@main
        with: { version: 0.221.0 }
      - run: databricks bundle deploy -t prod
        env: { DATABRICKS_HOST: ${{ vars.PROD_HOST }}, ARM_USE_OIDC: true }

The environment: prod block is the control point: configure the GitHub Environment with required reviewers and a ServiceNow change-request check so a tagged release pauses until the CAB-approved change ticket is in an implementable state. That is the documented gate that replaces the old Friday click — change management has a record, and the deploy is automated and identical to what staging ran.

Validation

Confirm the deploy actually landed and is wired correctly — do not trust a green pipeline alone.

# What the bundle thinks is deployed for a target (resolved YAML + IDs)
databricks bundle summary -t staging

# Confirm the job and pipeline exist with the expected env-prefixed names
databricks jobs list  --output json | jq '.[] | select(.settings.name|test("revenue-refresh"))'
databricks pipelines list-pipelines | grep revenue-dlt

# Re-validate prod without deploying — catches drift between repo and workspace
databricks bundle validate -t prod

Then check the behavior: trigger the staging job, wait for the run to succeed, and assert the DLT pipeline wrote to the staging catalog (analytics_staging.revenue), not prod. In Dynatrace (or Datadog, via the Databricks job/Spark integration), confirm the new job appears with its run duration and that no task failed — runtime observability is where a “deployed fine but runs wrong” problem surfaces, exactly the failure that started this project. A passing validation gate plus a green integration run plus the metric showing up in Dynatrace is the three-part definition of done.

Rollback / teardown

Because the bundle is the artifact, rollback is redeploy the previous commit — there is no manual un-clicking.

# Roll prod back to the last good release tag
git checkout v1.4.2
databricks bundle deploy -t prod        # redeploys the prior definitions exactly

# Tear down everything a bundle created in a target (used for ephemeral/dev cleanup)
databricks bundle destroy -t dev --auto-approve

databricks bundle destroy removes the jobs, pipelines, and workspace files that this bundle created in the target — and only those, because DABs tracks its own resource state (Terraform under the hood). Never hand-delete a DABs-managed job in the UI; that orphans the state and the next deploy will try to recreate it. For a true emergency stop, pause the prod schedule (schedule_pause: "PAUSED", redeploy) rather than deleting, so you keep the definition while stopping new runs.

Common pitfalls

Security notes

Treat the pipeline as the privileged path it is. CI deploys as a least-privilege Entra ID service principal scoped to exactly its environment — staging’s SP can never touch the prod workspace because its federated credential subject is environment:staging. Humans authenticate Okta → Entra ID with conditional access; they get dev, not prod. Any secret that is not OIDC-issued lives in HashiCorp Vault, leased short-lived into the runner, never committed. Wiz Code runs in the PR job to scan the repo and the DABs/Terraform IaC for misconfigurations and committed secrets, failing the build on critical findings before a bad change merges. Enforce Unity Catalog grants so the job’s identity can only read its landing volume and write its own schema. And require a ServiceNow change ticket on the prod environment so every production deploy is auditable change management, not tribal memory.

Cost notes

DABs itself costs nothing — the spend is the compute it deploys, and the bundle is where you control it. Prefer serverless DLT (set serverless: true on the pipeline) so you pay only for processing time with no idle cluster, and keep continuous: false for batch pipelines that do not need a 24/7 stream. Pin job clusters to a cluster policy (policy_id) that caps node type and autoscaling so a stray edit cannot launch a 50-node cluster in prod. Keep marts_workers small in dev and size it only in staging/prod via the variable. Because staging schedules are PAUSED and run on demand from CI, you pay for staging compute only during integration runs, not around the clock. Pipe job-cost and DBU metrics into Dynatrace/Datadog so a regression in run time — the thing that quietly doubles the monthly bill — is visible the day it lands, not at month-end.

The shape of the win

The payoff is that the revenue dashboard’s pipeline now changes the same way the rest of the company’s software does: an engineer opens a PR, CI validates and unit-tests it, staging proves it on real-but-isolated data, a CAB-approved tag promotes the identical artifact to prod as a service principal, and a Dynatrace dashboard shows it running clean — with a git SHA you can point to for every byte in production. No Friday UI export, no copy-pasted cluster JSON, no “which change broke it.” Start with one job and one DLT pipeline through this ladder; the second and the fiftieth cost almost nothing more, because the machinery — DABs, GitHub Actions, the Entra service principals, Vault, Wiz Code, the ServiceNow gate — is the same for all of them.

DatabricksCI/CDGitHub ActionsDLTDataOpsTerraform
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading