AI/ML Multi-cloud

Batch ML Pipelines with Airflow, dbt and a Warehouse

There is a quiet category of machine learning that never trends on conference stages but quietly runs the business: the model that scores every loan application overnight, recalculates churn risk for forty million subscribers before the marketing team logs in, or refreshes a demand forecast for every SKU in every warehouse by 6 a.m. None of it is real-time. All of it is batch — and that is exactly why it is tractable, auditable, and cheap enough to run at enterprise scale. This article is a reference architecture for building batch ML properly: Apache Airflow to orchestrate, dbt to transform features inside the warehouse, scheduled training jobs, a model registry to version what ships, and lineage threaded through the whole thing so that when a regulator asks “why did this customer get this score on the 14th of March,” you have an answer that is a query, not an archaeology project.

The business scenario

Consider a consumer lender — a mid-size finance company issuing personal loans and credit lines across several states, roughly four million active accounts. Every night it must do three things that are not optional. It re-scores the probability of default on the existing book so the risk team can set provisions. It scores the day’s new applications so underwriters arrive to a queue that is already triaged. And it produces the regulatory reporting features that feed model-risk-management (SR 11-7 in the US, or the equivalent under the EU’s CRD framework) — every input, every transformation, every model version, reproducible on demand.

The pressure is not latency; nobody needs a default score in 50 milliseconds. The pressure is correctness, reproducibility, and auditability at volume. A score that is silently computed from yesterday’s stale feature, or from a feature definition that changed without anyone versioning it, is not a performance bug — it is a finding in the next regulatory exam and potentially a misstated provision on the balance sheet. The naive approaches fail in instructive ways. A pile of cron-driven Python scripts runs, but has no dependency graph, no retries, no record of what ran against what; when step 4 fails at 3 a.m. nobody knows steps 1–3 already half-wrote the output. A single monolithic notebook that an analyst runs manually is unreproducible by construction and dies with the analyst. Feature logic copy-pasted into both the training script and the scoring script drifts within a quarter, and now your model is trained on one definition of “30-day delinquency” and scored on another — the infamous train/serve skew, except in batch it is silent and you find out at audit.

The architecture below threads the needle. Airflow gives you a real dependency graph with retries, SLAs, and a complete run history. dbt makes the feature transformations one version-controlled, tested, documented definition that runs in the warehouse and is reused identically for training and scoring — killing skew at the source. Scheduled training jobs retrain on a cadence and register every candidate to a model registry, which becomes the single source of truth for what is approved to score in production. And lineage — captured automatically as the pipeline runs — ties a production score back through the exact model version, the exact dbt model, and the exact source rows that produced it. The scenario scales cleanly: the small shop runs one daily DAG over a few hundred gigabytes; the large enterprise runs dozens of DAGs across business units over tens of terabytes, with the same shape — more parallelism, bigger warehouse, more region pairs, not a different diagram.

Architecture overview

Batch ML Pipelines with Airflow, dbt and a Warehouse — architecture

The pipeline is best understood as four stages that Airflow chains into one Directed Acyclic Graph (DAG), with a clean separation between the control plane (Airflow deciding what runs when and in what order) and the data plane (the warehouse and compute where the actual work happens). Airflow should orchestrate and observe; it should almost never be the thing crunching the data.

Stage 1 — Ingestion (1). Raw operational data lands in the warehouse’s landing zone. In this lender’s case, nightly extracts from the loan-servicing system, the bureau pull, and the payments ledger are loaded into Snowflake (the example warehouse here; BigQuery, Databricks SQL, or Redshift fit the same slot) by a tool such as Fivetran or Airbyte, or by a vendor’s managed connector. Airflow does not move the bytes — it triggers the load and waits on a completion signal, then runs a freshness and row-count check so the rest of the DAG never builds on a half-loaded or empty table.

Stage 2 — Transform with dbt (2). Airflow invokes dbt to run the transformation graph that turns raw tables into clean, tested feature tables. This is the heart of the design: every feature — dpd_30_rollup, utilization_3m_avg, bureau_score_delta — is a SELECT in version control, with tests (not_null, accepted_values, custom assertions on distribution) that fail the run if the data is wrong. dbt compiles to SQL and executes inside the warehouse, so terabytes are transformed where they live, with no data egress. Critically, the same feature models feed both training and scoring, which is what structurally eliminates train/serve skew.

Stage 3 — Train (3). On its cadence (weekly here, not nightly), Airflow launches a training job on dedicated compute — a Databricks job cluster, SageMaker / Vertex AI training job, or a Kubernetes pod via Airflow’s KubernetesPodOperator. The job reads the feature tables, trains the candidate model, evaluates it against a held-out set, and — this is the gate — only registers it to the model registry if it beats the incumbent on the agreed metrics (AUC, KS statistic, calibration). Nightly the DAG does not retrain; it pulls the currently approved model version from the registry and scores.

Stage 4 — Score & publish (4). The approved model scores the day’s records in batch, writing predictions back to a scores table in the warehouse. Downstream, those scores feed the underwriting queue, the risk-provisioning dashboards in a BI tool, and a feature-store table for any downstream model. Akamai fronts the internal analyst portal and any API that exposes scores, providing TLS termination, caching of the heavy dashboards, and edge WAF — keeping the warehouse and app tier off the public internet.

Wrapping all four stages, lineage is captured continuously (via OpenLineage emitting from Airflow and dbt) into a catalog, so the column-level path from source row → dbt model → feature → score → model version is queryable. That backbone is what turns “trust us” into “here is the graph.”

Component breakdown

Stage Tooling (this design) Role in the pipeline Key configuration choices
Orchestration Apache Airflow (MWAA / Cloud Composer / Astronomer) Schedules and sequences the DAG; retries, SLAs, backfills, alerting Deferrable operators for long warehouse calls; pool limits on warehouse concurrency; data-aware (Dataset) scheduling
Ingestion Fivetran / Airbyte → warehouse landing zone Lands raw operational data; Airflow triggers + waits Freshness + row-count sensors gate the DAG; idempotent, partition-scoped loads
Transform dbt Core / dbt Cloud, run in-warehouse Single versioned definition of every feature; tests as quality gates not_null/accepted_values/custom tests; incremental models; dbt build fails fast on test errors
Warehouse Snowflake / BigQuery / Databricks / Redshift Stores raw, features, and scores; executes the transforms Separate virtual warehouses for transform vs. scoring; auto-suspend; result caching
Training Databricks / SageMaker / Vertex AI / K8s pod Retrains on cadence; evaluates; conditionally registers Pinned environment image; reads feature tables by partition; gated registration on metric uplift
Model registry MLflow Registry / SageMaker / Vertex Model Registry Versions models; holds the “approved-for-prod” pointer Stage transitions (Staging→Production) behind ServiceNow approval; immutable artifacts
Scoring Warehouse UDF / batch job over feature tables Applies approved model to the day’s rows Reads model version from registry, never hard-coded; writes timestamped scores
Lineage & catalog OpenLineage + Marquez / DataHub / Unity Catalog Column-level lineage from source to score Airflow + dbt emit OpenLineage events automatically; lineage queried for audit
Secrets HashiCorp Vault Warehouse creds, registry tokens, connector keys Dynamic, short-TTL warehouse credentials; Airflow connections backed by the Vault backend
Identity / SSO Okta or Entra ID SSO into Airflow, dbt, registry, BI; group-based RBAC SAML/OIDC; analysts read, only the pipeline service identity writes prod scores

A few choices deserve the why, because they are the ones teams get wrong.

Why dbt owns the features, not the training script. The single most expensive bug in batch ML is feature drift between training and scoring. If utilization_3m_avg is computed one way in a pandas training notebook and another way in the nightly scoring SQL, the model is being asked questions in a different language than it learned. Putting the definition in one dbt model that both paths consume makes skew structurally impossible, and the dbt tests turn data-quality failures into a red DAG at 3 a.m. instead of a misstated provision at quarter-end. As a bonus, dbt’s auto-generated docs and lineage graph are exactly the artifacts a model-risk reviewer asks for.

Why Airflow orchestrates but does not compute. A recurring anti-pattern is doing the heavy lifting inside Airflow workers — pulling a billion rows into a PythonOperator to transform in pandas. The workers fall over, retries replay enormous work, and you have rebuilt the warehouse badly. Airflow’s job is the control plane: trigger the warehouse, trigger dbt, trigger the training cluster, wait, check, branch, alert. Push the data work down to the warehouse (for transforms) and to elastic compute (for training). This also makes the DAG cheap — the Airflow tier can be small because it is mostly waiting, which is what deferrable operators are for: a deferred sensor releases its worker slot while a six-hour training job runs, instead of pinning a worker the whole time.

Why a registry, and why scoring reads from it. The registry is the contract between “data science experimented” and “production scores customers.” Training proposes (registers a candidate); a human approves (a stage transition gated through ServiceNow change management, which the model-risk committee signs); scoring consumes whatever version is marked Production. Because the scoring step resolves the model version from the registry at runtime rather than from a hard-coded path, promoting a new model is a registry transition — no code deploy, instantly auditable, and instantly revertable by flipping the pointer back. Hard-coding a model file into the scoring job is the anti-pattern that makes every model change a risky release.

Why lineage is a first-class component, not a nice-to-have. In an unregulated shop, lineage is convenience. In a lender under model-risk rules, it is the deliverable. When an examiner asks how a specific score was produced, OpenLineage lets you answer with a graph: this score came from model version pd_xgb:31, which scored features from features.account_monthly, built by dbt model int_account_rollup, from source tables loaded at 02:14 on the date in question. Reconstructing that by hand from scattered scripts is days of work and a credibility risk; querying it is minutes.

Implementation guidance

Express the schedule as a DAG, and gate every stage. The nightly DAG should sense fresh data, build and test with dbt, score with the approved model, and publish — failing loudly and early rather than producing a silently-wrong output. A condensed Airflow shape communicates the intent:

with DAG(
    "nightly_pd_scoring",
    schedule="0 2 * * *",        # 02:00 daily
    catchup=False,
    default_args={"retries": 2, "retry_delay": timedelta(minutes=10)},
    sla_miss_callback=page_oncall,
) as dag:

    wait_raw = SnowflakeSqlSensor(            # gate: don't build on half-loaded data
        task_id="wait_for_raw_load",
        sql="select count(*) > 0 from raw.loan_servicing where load_date = current_date",
        deferrable=True,                      # release the worker slot while waiting
    )

    transform = DbtCloudRunJobOperator(       # dbt build = run models + run tests
        task_id="dbt_build_features",
        job_id=FEATURES_JOB,
        # a failed test here fails the task, and the DAG stops before scoring
    )

    score = KubernetesPodOperator(
        task_id="score_book",
        image="registry.internal/pd-scorer:pinned",
        env_vars={"MODEL_STAGE": "Production"},  # resolves version from the registry
    )

    publish = DbtCloudRunJobOperator(task_id="publish_marts", job_id=MARTS_JOB)

    wait_raw >> transform >> score >> publish

The retraining DAG is a separate schedule (0 3 * * 0, weekly) that ends in a conditional registration — a BranchPythonOperator that registers the candidate only if it beats the incumbent, and otherwise logs and exits without touching production.

Resolve the model from the registry, never from a path. The scoring container should ask the registry for “the version currently in the Production stage” and load that. With MLflow:

import mlflow
model = mlflow.pyfunc.load_model("models:/pd_default/Production")
predictions = model.predict(feature_frame)

Promotion is then a governed registry transition, not a deploy. Hard-coding s3://.../model_v7.pkl is the thing to avoid.

Make every step idempotent and partition-scoped. Airflow will retry, and you will backfill. Each task must be safe to run twice for the same logical date: scoring writes to a partition keyed by run date with a delete-then-insert (or MERGE) so a retry overwrites rather than duplicates; dbt incremental models key on the same partition. Idempotency is what makes retries and backfills trustworthy instead of a source of double-counted scores.

Provision with IaC and wire identity in from day one. Use Terraform for the warehouse objects, the Airflow environment (MWAA, Cloud Composer, or Astronomer), the training compute, the registry, and all RBAC — the pipeline is then reproducible and reviewable, not a hand-clicked snowflake (lower-case). CI runs in GitHub Actions or Jenkins: on every pull request it runs dbt build against a CI schema with a seeded sample, runs unit tests on the training code, and runs dbt test and SQL linting as required checks, so a broken feature definition or a failing assertion never merges. Okta (or Entra ID) provides SSO and group-based RBAC across Airflow, dbt, the registry, and the BI tool — analysts get read access; only the pipeline’s service identity may write production scores or transition a model to Production.

Enterprise considerations

Security & secrets. No long-lived warehouse password should sit in an Airflow connection or a .env. Back Airflow’s connections with the HashiCorp Vault secrets backend and issue dynamic, short-TTL credentials to the warehouse, so a credential is minted for the run and expires after — there is nothing static to leak or rotate by hand. (This matters doubly here: the team has been burned before by credentials committed to git, and dynamic secrets remove the standing target entirely.) Layer cloud-native posture on top: Wiz continuously scans the warehouse and storage for data-posture problems — a publicly exposed bucket holding raw bureau data, an over-permissive role, PII landing where it should not — and CrowdStrike Falcon covers runtime threat detection on the Airflow workers and training nodes. Access to PII feature columns is governed by warehouse RBAC and column masking; the scoring identity can read features and write scores but cannot read raw application narratives.

Cost optimization. Batch is the cheap quadrant of ML and you should keep it that way. (1) Separate the transform and scoring warehouses and let them auto-suspend the moment a step finishes — you pay for the warehouse only while a query runs, and a nightly DAG leaves it idle 22 hours a day. (2) Right-size, don’t over-provision — a transform that runs in eight minutes on a Medium does not need an X-Large; measure and step down. (3) Schedule training to the actual decay of the model, not reflexively nightly — if the model is stable week-to-week, weekly retraining is a large compute saving over daily with no accuracy loss. (4) Use spot/preemptible instances for training clusters, since a batch job that fails on a reclaimed node just retries — there is no user waiting. (5) Keep Airflow small: because it orchestrates rather than computes, a modest environment suffices, and deferrable operators stop you from paying for workers that are merely waiting. The table below frames the common decision.

Decision Cheaper / simpler More robust / costly When to step up
Retraining cadence Weekly or monthly Daily Model decays fast (volatile population, concept drift detected)
Warehouse sizing One auto-suspending warehouse Dedicated transform + scoring warehouses Concurrency contention or SLA misses appear
Training compute Spot / preemptible On-demand reserved Job is on the critical path with a hard SLA
dbt dbt Core + your own CI dbt Cloud (hosted scheduler, docs, CI) Team wants managed lineage/docs and less ops
Airflow Self-managed MWAA / Composer / Astronomer (managed) You don’t want to run the scheduler/metadata DB yourself

Reliability & failure modes. Enumerate what breaks and what the DAG does about it. Upstream load is late or empty → the freshness/row-count sensor blocks, the SLA fires, on-call is paged via Datadog or Dynatrace, and crucially nothing downstream runs on stale data. A dbt test fails (a feature’s null rate spikes, a category appears that the model never saw) → dbt build fails the task, the DAG halts before scoring, and you have caught a data-quality regression before it became a wrong score. A training job fails or produces a worse model → the conditional-registration branch simply does not promote; last night’s approved model keeps scoring, so a bad training run is a non-event, not an outage. A scoring run is interrupted → idempotent, partition-scoped writes mean the retry overwrites the partition cleanly with no duplicates. The most dangerous failure is the silent one — a pipeline that succeeds while producing subtly wrong numbers — which is precisely why dbt tests and freshness gates exist: they convert silent correctness failures into loud red tasks.

Observability. Treat the pipeline as a product with SLOs. Operational telemetry — DAG success rate, task durations, SLA misses, warehouse credits burned — flows to Datadog or Dynatrace, with the on-call alert wired to the batch landing SLA (e.g., “scores must be in the warehouse by 06:00”) rather than to individual task failures, because a task can fail-and-retry within budget and nobody needs to wake. Data-quality telemetry — dbt test pass/fail, freshness, row counts — is its own dashboard. And ML telemetry is the one batch teams forget: track feature drift (population stability index on key features, run as a dbt test or a monitoring task) and prediction drift (is the score distribution shifting?), because a model that silently degrades as the population changes is the batch equivalent of an outage — slow, quiet, and expensive. Lineage from OpenLineage closes the loop: when a metric looks wrong, the lineage graph tells you instantly which upstream model or source moved.

Scalability. Each tier scales independently. The warehouse scales by sizing up the virtual warehouse (more compute per query) or out (more concurrent warehouses for parallel feature groups); dbt parallelizes model execution by threads. Training scales by moving to a bigger cluster or distributed training (Spark/Databricks, or distributed frameworks on K8s) when the feature matrix outgrows one node. Airflow scales by adding workers, but because it orchestrates rather than computes, this is rarely the bottleneck — the warehouse is. The honest ceiling on a single nightly window is the batch window itself: everything must finish before the 06:00 SLA, so as data grows you parallelize the dbt graph, partition the scoring job, and, if needed, split one giant DAG into per-segment DAGs that run concurrently. Most enterprises hit the window limit long before they hit any Airflow limit.

Governance & lineage. For a regulated lender this is the point of the whole exercise. Pin everything: the dbt project is git-versioned and tagged per run; the model artifact is immutable in the registry; the training environment is a pinned container image — so any past run is reproducible bit-for-bit. Gate promotions through ServiceNow: moving a model to Production raises a change request that the model-risk committee approves, creating the paper trail examiners expect. Capture lineage automatically via OpenLineage from Airflow and dbt into a catalog (Marquez, DataHub, or Unity Catalog), giving column-level traceability from a production score back to source rows and the exact model version. Keep an immutable audit log of every run, every model promotion, and every approver, with appropriate retention — this is the artifact that turns a model-risk exam from a fire drill into a query.

Reference enterprise example

Cascade Lending, a fictional consumer lender (~1,400 employees, ~4.1 million active accounts across nine states), rebuilt its overnight risk pipeline on this pattern after a regulatory exam flagged that it could not reliably reproduce historical default scores — the scoring logic lived in a tangle of cron’d Python on a single analyst’s schedule, with feature definitions duplicated between training and scoring.

Decisions they made. They put Snowflake at the center, with separate auto-suspending warehouses for transforms and scoring. All ~140 features were rewritten as a dbt project with 300+ tests; dbt build ran inside the nightly DAG and twice caught real upstream regressions (a bureau-field format change, a duplicated-row load) before they reached a score. Airflow on Astronomer ran two DAGs — a nightly scoring DAG (sense → dbt build → score → publish, finishing by 05:40 against a 06:00 SLA) and a weekly retraining DAG that registered a candidate to the MLflow Model Registry only on AUC/KS uplift. Scoring resolved the model from the Production stage at runtime; promotion was a registry transition gated through ServiceNow and signed by model risk. HashiCorp Vault issued short-TTL Snowflake credentials to Airflow; Okta provided SSO and RBAC (analysts read-only; only the pipeline identity wrote prod scores). OpenLineage fed DataHub for column-level lineage. Wiz watched data posture, CrowdStrike Falcon the runtime, Datadog the SLAs, and Terraform + GitHub Actions made the whole stack reproducible with dbt build/dbt test as required PR checks.

The numbers. ~4.1 million accounts re-scored nightly plus ~9,000 new applications triaged; the nightly DAG ran ~28 minutes of warehouse compute. Monthly run cost landed near ₹6.1 lakh (~$7,300): Snowflake credits ~$3,800 (kept low by auto-suspend and weekly — not nightly — retraining on spot training clusters), Astronomer ~$1,500, training compute ~$900, lineage/observability/the rest ~$1,100. Moving retraining from a reflexive nightly cadence to weekly, once drift monitoring showed the population was stable, cut training compute roughly fourfold with no measurable accuracy loss.

The outcome. The exam finding closed: any historical score is now reproducible from the pinned dbt tag, the immutable registry artifact, and the lineage graph — a query, not a project. Train/serve skew disappeared the day features moved into shared dbt models. And the metric that won over the CRO: the time to answer an examiner’s “how was this score produced” question dropped from roughly two days of manual reconstruction to under ten minutes of querying lineage and the registry.

When to use it

Use this architecture when predictions are needed on a schedule rather than in real time; correctness, reproducibility, and auditability matter more than millisecond latency; your features and data already live in (or can land in) a warehouse; and you need one governed path from raw data to scored output that a regulator, an auditor, or your own model-risk function can trace. That covers a large share of enterprise ML: credit and fraud scoring, churn and propensity, demand and inventory forecasting, lead scoring, and any “score everything overnight” workload.

Trade-offs to accept. Batch means latency by design — a score is as fresh as the last run, so this is wrong for anything that must react within seconds (fraud at the point of swipe, real-time bidding); those need a streaming or online-serving architecture. The pipeline has real moving parts — an orchestrator, a transform layer, training, a registry, lineage — which is more to operate than a single script, and that cost is only justified when reproducibility and scale actually matter. And the batch window is a hard ceiling: as data grows, finishing before the SLA becomes an engineering problem you must stay ahead of with parallelism and partitioning.

Anti-patterns. (1) Duplicating feature logic between training and scoring — the silent train/serve skew; put it in one dbt definition. (2) Doing heavy compute inside Airflow workers — push data work to the warehouse and elastic compute; Airflow orchestrates. (3) Hard-coding the model path in the scoring job — resolve from the registry so promotion is a governed transition, not a deploy. (4) No data-quality gates — without dbt tests and freshness sensors, the pipeline happily produces confidently-wrong scores. (5) Non-idempotent writes — retries and backfills double-count; key every write to a partition. (6) Treating lineage as optional — in a regulated context it is the deliverable, and bolting it on after an exam finding is the expensive way to learn that.

Alternatives, and when they win. If you need predictions in real time, you want an online feature store plus a low-latency model-serving tier (the streaming counterpart to this design), with batch retraining still feeding it. If your transformations are genuinely massive and unstructured — not SQL-shaped — a Spark/Databricks-native pipeline may fit better than dbt-in-warehouse, though dbt now runs on Databricks SQL too, so the line is blurring. If you are a small team that values speed over governance, a managed end-to-end MLOps platform (Vertex AI Pipelines, SageMaker Pipelines, or Databricks Workflows) will stand up batch ML with less assembly — graduate to this explicitly-composed Airflow/dbt/registry architecture when audit, lineage, and multi-team scale demand the control. The architecture here is the destination for batch ML that has to answer for itself, not always the starting line.

AirflowdbtMLOpsData WarehouseArchitectureEnterprise
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading