Architecture Azure

Building a Self-Service Data Platform with Snowflake and dbt on Azure

A national health-insurer’s analytics function gets a blunt message from its chief actuary: the team that prices policies and watches for claims fraud waits eleven days for every new dataset, because a single overloaded central data-engineering squad owns every pipeline, every table, and every change. When the fraud-analytics lead needs a joined view of claims, provider networks, and member eligibility, she files a ticket, waits in a queue behind forty other tickets, and by the time the table lands the fraud pattern has moved on. Meanwhile a different team has quietly built the same join in a spreadsheet, with a subtly different definition of “active member,” and now two parts of the business disagree on the membership number in a board deck. The ask from the CDO is “let the teams who understand the data build their own datasets — safely, governed, without waiting on us — and make sure there is one definition of every number.” This is a HIPAA-regulated environment handling PHI, so “safely” is not a slogan: member data cannot leak, every transformation must be auditable, and access has to be provable to an auditor. This article is the reference architecture for that self-service data platform on Azure, built on Snowflake and dbt, that an insurer’s CISO and compliance officer will actually sign.

The pressures stack the way they always do in a regulated data org. Throughput means dozens of analytics teams shipping datasets in parallel, not one squad serializing every request. Governance means a single trusted definition of each metric, lineage you can show an auditor, and PHI access scoped to exactly who is cleared. Cost means compute that scales to a quarter-end actuarial run and then gets out of the way, charged back to the team that spent it. And trust means every published dataset is tested, documented, and discoverable — so nobody rebuilds it in a spreadsheet. The pattern that satisfies all four is a self-service data platform: a central team owns the paved road — the warehouse, the transformation framework, the orchestrator, the catalog, the guardrails — and the analytics teams own their own data products on top of it, within rails they cannot accidentally step outside.

Why not the obvious shortcuts

The naive fixes each fail predictably, and naming why matters because someone on the project will propose all three.

Keep the central team and just hire more engineers scales linearly at best and the queue never empties — every new analytics team adds load to the same bottleneck, and the people who actually understand claims data are not the people writing the pipelines. Give every team raw warehouse access and let them do what they want produces exactly the spreadsheet chaos the CDO is trying to kill: forty definitions of “active member,” no tests, no lineage, and PHI sprayed across ungoverned tables. Buy a no-code ELT tool and point analysts at it moves fast for a quarter and then becomes an unversioned, untestable pile of point-and-click jobs nobody can review, audit, or reproduce — the opposite of what a HIPAA auditor wants to see.

The self-service platform threads the needle. The central data-platform team builds the rails: a Snowflake account with the network, identity, and RBAC locked down; a dbt project structure and CI that every team’s transformations flow through; an Airflow deployment that orchestrates the schedules; a Purview catalog that makes every dataset discoverable with its lineage; and policy guardrails that make the insecure path impossible. The analytics teams build the data products: their own dbt models, tested and documented, published into a curated zone other teams can trust and reuse. Self-service is not “no governance” — it is governance encoded into the platform so teams move fast without being able to do the dangerous thing.

Architecture overview

Building a Self-Service Data Platform with Snowflake and dbt on Azure — architecture

The platform runs along three layers that share infrastructure but live on different schedules: an ingestion layer that lands raw source data, a transformation layer where dbt turns raw data into trusted models on a schedule Airflow controls, and a consumption layer where analytics teams and BI tools read curated, cataloged data products. Keeping the three in your head — and the medallion zones (raw → staging → curated) that run through them — is the first step to operating this well.

The defining property of the topology is the one the security team cares about most: Snowflake’s data plane is reached only over Azure Private Link, and its public endpoint is locked down with a network policy. No query, no result set, and no PHI ever traverses the public internet between Azure and Snowflake — which is what makes a HIPAA story defensible. Snowflake runs on Azure (the same region as the data) precisely so the warehouse sits inside the residency and network boundary the auditor cares about.

Ingestion path, following the data flow:

  1. Source systems — the claims platform, the policy-admin system, the member-eligibility service — drop files and change-data-capture extracts into Azure Data Lake Storage Gen2 (ADLS), the durable landing zone and source of truth. Streaming sources (clickstream from the member portal, real-time eligibility checks) flow through Event Hubs into the same lake.
  2. Snowflake reads the lake through an external stage backed by a storage integration — a Snowflake object bound to an Azure managed identity, so Snowflake assumes a least-privilege identity to read ADLS with no storage account key ever stored in Snowflake. Snowpipe auto-ingests new files as they land, or scheduled COPY INTO loads them in batches into the raw zone tables. The raw zone is append-only and immutable: it is the replayable record of what arrived.

Transformation path, the heart of the platform:

  1. dbt (data build tool) is where every transformation lives — as version-controlled SQL SELECT statements that dbt compiles into tables and views inside Snowflake. Analytics teams write staging models (light cleaning, renaming, type-casting, one model per source table) and then curated/mart models (business logic — the one definition of “active member,” the joined claims-fraud view) on top. dbt handles the dependency graph, materialization, and — critically — tests: not_null, unique, accepted-value, and relationship tests that run as part of every build, plus the dbt docs site that turns the project into browsable, lineage-aware documentation.
  2. Apache Airflow (Azure-hosted, on AKS) orchestrates when dbt runs and what runs around it. A DAG sequences: land/verify source freshness → dbt run the staging layer → dbt testdbt run the marts → dbt test again → publish catalog metadata → trigger downstream refreshes. Airflow owns scheduling, retries, backfills, SLAs, and the cross-team dependencies dbt alone cannot express (e.g., the actuarial mart waits on both the claims and eligibility pipelines).
  3. Each team’s dbt models execute on a dedicated Snowflake virtual warehouse — a separate, independently sized compute cluster — so the fraud team’s quarter-end run cannot starve the actuarial team’s dashboards, and each warehouse’s cost is attributable to one team.

Consumption path: curated models land in a curated zone that other teams and BI tools (Power BI, served over Private Link) read. Every curated dataset is registered in Microsoft Purview, which scans Snowflake and ADLS, captures column-level lineage from the lake through dbt to the mart, classifies PHI columns, and gives analysts a searchable catalog — so a data product is discovered and reused, not rebuilt in a spreadsheet.

Component breakdown

Component Service / tool Role in the platform Key configuration choices
Landing zone Azure Data Lake Storage Gen2 Durable raw-file landing and source of truth Hierarchical namespace; lifecycle tiering; private endpoint
Warehouse Snowflake on Azure Elastic compute + storage; medallion zones; RBAC Private Link; network policy; per-team virtual warehouses
Lake access Snowflake storage integration + external stage Key-less, least-privilege read of ADLS Managed-identity binding; Snowpipe auto-ingest
Transformation dbt Versioned SQL models, tests, docs, lineage Staging→mart layering; CI on every PR; dbt test gate
Orchestration Apache Airflow (on AKS) Scheduling, retries, backfills, cross-team DAGs Source-freshness checks; SLA miss alerts; KubernetesPodOperator
Identity / SSO Okta + SCIM Workforce SSO into Snowflake; group→role provisioning SAML SSO; SCIM auto-provisioning of users and roles
Secrets HashiCorp Vault dbt/Airflow connection creds, key-pair, API tokens Dynamic Snowflake creds; short leases; Agent injection
Catalog & lineage Microsoft Purview Discovery, classification, end-to-end lineage Scheduled scans of Snowflake + ADLS; PHI auto-classification
CSPM / data posture Wiz + Wiz Code Cloud + IaC posture, sensitive-data exposure Agentless scan; Wiz Code scans Terraform/dbt repos in PR
Runtime security CrowdStrike Falcon Runtime protection on the Airflow/AKS nodes Sensor on node pool; detections to the SOC
Observability Datadog Pipeline/freshness telemetry, cost + query metrics Airflow + Snowflake integrations; freshness & cost monitors
ITSM / approvals ServiceNow Onboarding approvals, change gates, incident records Change gate before prod promotion; auto-ticket on SLA breach
CI/CD GitHub Actions + Argo CD Build/test dbt in CI; GitOps-deploy Airflow DAGs OIDC to Azure; dbt build gate; Argo CD syncs DAGs to AKS
IaC Terraform + Ansible Provision Snowflake/Azure; configure Airflow nodes Snowflake provider for RBAC/warehouses; Ansible node config

A few of these choices deserve the why, because they are the ones teams get wrong.

Why dbt is the contract, not just a tool. The reason self-service does not collapse into spreadsheet chaos is that every transformation is a dbt model in git, and therefore reviewable, testable, and reproducible. A team cannot publish a curated dataset without a pull request, passing tests, and column documentation — the platform enforces it. dbt’s tests are the mechanism that gives the chief actuary her “one definition”: the active_member model has a unique test on member-id and an accepted-values test on status, so a model that would produce duplicate or impossible members fails CI and never ships. dbt’s compiled lineage also feeds Purview, so discovery and impact analysis come for free.

Why per-team virtual warehouses, not one big one. Snowflake separates storage from compute, and a virtual warehouse is just a compute cluster you size and pause independently. Give each analytics team its own warehouse and three things become true at once: their workloads are isolated (the fraud team’s heavy join cannot slow the actuarial dashboards), their spend is attributable (one warehouse, one cost line, one chargeback), and their scaling is independent (size the actuarial warehouse LARGE for quarter-end, keep the dashboard warehouse XSMALL and auto-suspended). One shared warehouse throws all of that away.

Why Okta with SCIM, not just SSO. SSO alone authenticates the human; SCIM keeps the authorization in sync automatically. When HR moves an analyst into the fraud team in Okta, SCIM provisions her Snowflake user and assigns the FRAUD_ANALYST role — and when she leaves, SCIM deprovisions her, instantly revoking PHI access. For a HIPAA shop, that automatic join-mover-leaver flow is the difference between an access review that passes and one that finds a terminated employee who could still query member data.

Implementation guidance

Provision with Terraform, and treat identity and network as the first deliverables. The deployment order matters: get the Private Link and storage integration wrong and either queries hang or Snowflake silently cannot read the lake.

  1. The Azure foundation — VNet with subnets, ADLS Gen2 with a private endpoint, and the managed identity Snowflake’s storage integration will assume.
  2. Snowflake Private Link from the VNet to the Snowflake account, plus a Snowflake network policy restricting logins to the corporate/Private Link ranges.
  3. The Snowflake RBAC model and warehouses via the Terraform Snowflake provider — databases per zone, per-team roles, and a virtual warehouse per team, all as code.
  4. The storage integration binding Snowflake to the Azure managed identity, the external stage, and Snowpipe.
  5. The Airflow deployment on AKS, with Ansible configuring node-level concerns, and Argo CD wired to GitOps-deploy DAGs.

A minimal Terraform shape for a team’s warehouse and role communicates the intent — isolated compute, auto-suspend, least privilege:

resource "snowflake_warehouse" "fraud" {
  name           = "WH_FRAUD"
  warehouse_size = "MEDIUM"
  auto_suspend   = 60          # seconds idle before it stops billing
  auto_resume    = true
  initially_suspended = true
}

resource "snowflake_role" "fraud_analyst" {
  name = "FRAUD_ANALYST"
}

# Read curated, write only the team's own schema — no raw-zone PHI by default
resource "snowflake_grant_privilege_to_role" "fraud_curated_read" {
  role_name  = snowflake_role.fraud_analyst.name
  privileges = ["USAGE"]
  on_schema { schema_name = "ANALYTICS.CURATED" }
}

The pipeline that applies this runs in GitHub Actions, authenticating to Azure via OIDC federation so there is no stored service-principal secret to leak. The same pipeline runs dbt build (run + test in one) against a CI Snowflake schema as a required gate, and Argo CD then syncs the validated Airflow DAGs onto AKS so deployment is GitOps, not a manual copy.

Identity: federate the humans, provision the roles. Analysts log in to Snowflake through Okta with SAML SSO and conditional access; SCIM keeps Snowflake users and role membership in lock-step with Okta groups, automatically provisioning on join and deprovisioning on leave. Service identities — the dbt and Airflow connections into Snowflake — do not use stored passwords; they pull short-lived key-pair credentials from HashiCorp Vault, leased dynamically and injected by the Vault Agent sidecar, so a CI runner or Airflow worker never holds a long-lived Snowflake secret. PHI columns are protected in Snowflake with dynamic data masking and row-access policies keyed to the caller’s role, so an analyst sees a masked SSN unless their role is explicitly cleared.

dbt project wiring. Structure the project in layers other teams can reason about: staging/ (one model per source, light cleaning), intermediate/ (reusable joins), and marts/ (the business-facing curated models). Make tests non-negotiable — a unique + not_null test on every primary key, relationship tests on foreign keys, and dbt source freshness checks so a stale upstream extract fails the run instead of silently producing yesterday’s numbers. Carry an effective_date and a data-classification tag on PHI-bearing models so Purview can classify them and downstream consumers can filter correctly.

Enterprise considerations

Security & Zero Trust. The architecture is Zero Trust by construction: Private Link only, network policy on Snowflake, least-privilege RBAC per team, key-less lake access, and no long-lived service passwords. Layer on top: (a) dynamic masking and row-access policies so PHI is protected at query time by the caller’s role, not by hoping the query is written correctly; (b) Wiz running continuous CSPM and sensitive-data-exposure scanning across ADLS and the Snowflake-adjacent Azure footprint, alerting the moment a storage account drifts to public or an ACL widens — and Wiz Code scanning the Terraform and dbt repos in the pull request, so an insecure grant or a public-bucket change is blocked before merge, not found in production; © CrowdStrike Falcon sensors on the AKS node pool running Airflow for runtime threat detection, feeding the insurer’s SOC; (d) an SLA breach or a failed PHI-access control auto-raises a ServiceNow incident so security has a ticket, not just a log line. Azure Policy denies any storage account created with public access, and Wiz independently verifies the policy is actually holding.

Cost optimization. Snowflake compute dominates and is the easiest place to leak money, so engineer for it from day one.

Lever Mechanism Typical effect
Per-team warehouses Isolate + attribute spend to one cost line each Makes each team own its compute, enables chargeback
Auto-suspend / resume Suspend idle warehouses in 60s; resume on demand You pay only for seconds of actual query time
Right-sizing Size each warehouse to its workload, not the peak Dashboard warehouse XSMALL, actuarial LARGE only at quarter-end
Incremental dbt models Process only new/changed rows, not full rebuilds Cuts compute on large fact tables dramatically
Result/metadata caching Snowflake serves repeat queries from cache for free Deflects re-runs of unchanged dashboards
Resource monitors Hard credit quotas per warehouse, alert/suspend at threshold Caps runaway spend before the invoice does

Meter credits per warehouse and pipe the metric to Datadog, which the platform team uses for the chargeback dashboard the CFO sees and for the alert when a team’s warehouse blows past its monthly budget.

Scalability. Each layer scales independently. Snowflake compute scales up (a bigger warehouse for a heavier query) and out (multi-cluster warehouses that add clusters under concurrency — perfect for many analysts hitting dashboards at 9am Monday), entirely separate from storage. Ingestion scales with Snowpipe’s serverless auto-load. Airflow scales workers on AKS via the cluster autoscaler and runs each task as a KubernetesPod, so a quarter-end backfill of hundreds of DAG runs gets compute and then releases it. The natural ceiling is account-level credit budget, not a hardware limit — which is why resource monitors and per-team chargeback go in early.

Failure modes, and what each one looks like. Name them before they page you.

Reliability & DR (RTO/RPO). Decide the numbers per layer. ADLS is geo-redundant (GZRS) and is the durable source of truth — the entire warehouse can be rebuilt from it by replaying ingestion and dbt build. Snowflake’s Time Travel recovers dropped or corrupted tables within the retention window, and replication to a secondary Snowflake account in a paired region gives cross-region DR for the warehouse itself. Because every model is code in git and every input is in geo-redundant ADLS, the platform is rebuildable by design — the strongest recovery guarantee there is. A pragmatic target: RTO 4 hours, RPO 1 hour for the curated layer, with raw data effectively never lost (it is in GZRS Blob), and the dbt project re-runnable into a fresh Snowflake account from version control.

Observability. Instrument the pipeline end to end in Datadog: the Airflow integration for DAG run status, durations, retries, and SLA misses; the Snowflake integration for query performance, credit consumption, and warehouse queue depth. Emit the metrics the business actually cares about — data freshness per product (how old is the newest row), dbt test pass-rate, pipeline success-rate and lateness, and credits/cost per team. Run dbt build inside the GitHub Actions pipeline so a model change is tested before it ships, and surface the dbt docs lineage site so any analyst can trace a number from mart back to source. A new data product passes through a ServiceNow change approval before promotion to the curated zone, giving compliance a documented gate.

Governance. Pin dbt and Snowflake artifacts to version control — models, tests, and the RBAC defined in Terraform are all reviewable and instantly revertable. Microsoft Purview is the governance backbone: scheduled scans classify PHI columns, capture column-level lineage from ADLS through dbt to the mart, and give every dataset an owner and a description so it is discoverable and trusted instead of rebuilt. Apply Azure Policy to deny public storage and require diagnostic logging, with Wiz as the independent check that the controls are real. Log every query against PHI for audit, and let SCIM-driven deprovisioning prove to an auditor that access ends the day employment does.

Explicit tradeoffs

Accept these or do not build it. A self-service platform adds real moving parts — a dbt project and CI to maintain, an Airflow deployment to operate, a catalog to keep populated, and RBAC that someone must curate. You are trading a single team’s bottleneck for a platform team’s operating burden: the queue goes away, but the paved road needs an owner who keeps the guardrails, the templates, and the onboarding sharp. Self-service also demands discipline from the analytics teams — they must write tests and docs, and the platform has to enforce that in CI or you are back to spreadsheet chaos with extra steps. The per-team warehouse model that gives you isolation and chargeback also means more objects to manage and a real RBAC design up front. And the Private Link, masking, and SCIM posture that makes the security team sign costs setup complexity a quick POC would skip — and absolutely cannot skip for PHI at scale.

The alternatives, and when they win. If you are a small shop with one analytics team and a modest data volume, a single warehouse with a lightweight ELT tool is simpler and you do not need this. If your transformations are genuinely real-time and event-driven rather than batch/micro-batch, a streaming-first stack (Flink/ksqlDB into the lake) fits better than dbt-on-a-schedule — though the two compose, with streaming feeding the raw zone. If you are all-in on Azure-native and want to avoid a second vendor, Microsoft Fabric / Synapse with dbt covers much of this ground, trading Snowflake’s compute-isolation and cross-cloud portability for tighter Azure integration. And if you only need to hand a few analysts governed access to existing tables, a warehouse plus Purview and RBAC — without the full dbt-and-Airflow platform — is a smaller first step. Graduate to this full self-service platform when the queue, the duplicate definitions, the governance load, or the team count demand it.

The shape of the win

For the insurer’s analytics function, the payoff is not “another warehouse.” It is that the fraud-analytics lead, on the day she spots a new pattern, opens a pull request with a dbt model, watches CI run her tests and the platform publish a curated, documented, lineage-tracked dataset into a zone the actuarial team can discover in Purview and reuse the same afternoon — no ticket, no eleven-day queue, and exactly one definition of “active member” that every board deck now agrees on. And because the data never left the Private Link boundary, PHI is masked by role, and SCIM proves access ends with employment, compliance approved the teams to build their own data products — which a free-for-all of raw access could never have cleared. That last sentence is the one that funds the platform. Everything upstream — the storage integration, the per-team warehouses, the dbt test gate, the Vault-held credentials, the Wiz Code scans, the Purview lineage, the Datadog freshness monitors — exists to let dozens of teams move fast while a CISO, an auditor, and a CFO each say yes. The architecture here is the destination; start narrower if you must, but this is where governed, at-scale self-service data has to land.

AzureSnowflakedbtAirflowData PlatformEnterprise
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading