Data Multi-cloud

A Modern Data Warehouse: Snowflake, dbt and BI

A modern data warehouse looks deceptively simple from the outside: load the data, transform it, point a dashboard at it. The reason it goes wrong is that every one of those verbs hides a decision that compounds. Where transformation logic lives decides whether five teams report five different revenue numbers. How compute is sized decides whether your cloud bill grows linearly with usage or exponentially. Who can see which rows decides whether a self-service analytics rollout passes audit or gets shut down by the regulator. This article is a reference architecture for an enterprise analytics platform built on Snowflake for storage and compute, dbt for transformation, a semantic layer to make metrics consistent, and BI tools on top — with the identity, secrets, observability, and especially the cost-governance controls that separate a platform from a science project. It is written for the intermediate practitioner who has built a warehouse before and watched one or more of those decisions bite.

The business scenario

The driver here is a mid-size payments company — call it a fintech processing card and ACH transactions for a few thousand merchants. The business has three pressures pulling in different directions, and they are the pressures that make this architecture necessary rather than optional.

The first is regulatory. As a money-movement business they sit under PCI-DSS for cardholder data, and their merchant agreements and banking partners impose SOC 2 controls on top. Auditors do not ask “do you have a data warehouse” — they ask “who queried cardholder data last Tuesday, what could they see, and can you prove a marketing analyst never saw a PAN.” That question has to have a clean answer or the platform is a liability.

The second is scale and latency. Transaction volume is spiky — Black Friday, payroll Fridays, the end of the month — and the analytics that matter most (fraud scoring backtests, settlement reconciliation, merchant-facing dashboards) are needed during those spikes, not the morning after. A warehouse that is sized for the average will fall over at the peak; one sized for the peak will burn money for the other 27 days of the month.

The third is cost. The previous generation of this platform was a fixed-size cluster the finance team understood: a known monthly number, painful to scale, but predictable. Snowflake’s consumption model is the opposite — infinitely elastic, billed by the second of compute — and the first month a few analysts discovered SELECT * on a billion-row table with auto-suspend misconfigured, the bill tripled and the CFO asked, reasonably, why moving to the cloud made costs less predictable. Cost governance is not a footnote in this architecture; it is a first-class design pillar, because a consumption warehouse with no guardrails is a credit card with no limit handed to every team that can write SQL.

The naive approaches fail predictably. Transformations in stored procedures and hand-written SQL scripts leave you with no lineage, no tests, and logic nobody can safely change six months later. Letting each BI tool define its own metrics gives you the classic disaster where Finance’s “active merchant” and Product’s “active merchant” differ by 8% and nobody can say which is right in the board meeting. One giant shared warehouse for everything means a runaway analyst query starves the executive dashboard, and you cannot attribute a single dollar of spend to a single team. The architecture below threads all three needles.

Architecture overview

A Modern Data Warehouse: Snowflake, dbt and BI — architecture

The platform separates cleanly into five planes that you should keep mentally distinct: ingestion (getting raw data in), transformation (turning raw into trustworthy), serving (the semantic layer and BI), identity & security (who can do what), and operations (CI, secrets, observability, cost). Conflating any two of them is how these projects rot.

Ingestion. Source data lands in cloud object storage first — the transaction database changes flow through a CDC pipeline (Fivetran or Debezium into a Kafka topic, then to Amazon S3 / Azure ADLS / Google Cloud Storage depending on your cloud), and event streams (clickstream, app telemetry) land the same way. Snowflake ingests from that storage in two modes: Snowpipe for continuous, file-arrival-triggered micro-batches (settlement files, hourly extracts) and bulk COPY INTO for large historical backfills. Critically, raw data lands in a RAW database untouched — schema-on-read, no transformation — so the warehouse is the single source of truth and any transformation bug is replayable from raw without re-pulling from sources.

Transformation. This is dbt’s domain, and the architecture’s center of gravity. dbt runs inside Snowflake — it compiles models to SQL and executes them on Snowflake compute, so there is no separate transformation cluster to operate. Models are organized into the medallion-style layers most teams converge on: staging (one model per source table, light cleaning and renaming), intermediate (business logic, joins, the hard CTEs), and marts (the wide, conformed tables BI actually queries — fct_transactions, dim_merchant, fct_settlements). dbt’s killer features here are not the SQL templating; they are lineage (a DAG of every model’s dependencies, auto-generated), tests (not_null, unique, relationships, plus custom assertions like “settlement amount must reconcile to transaction sum within tolerance”), and documentation generated from the code itself.

Serving. On top of the marts sits the semantic layer — the contract that makes “revenue” mean one thing. Rather than each dashboard re-implementing the metric in its own SQL, metrics are defined once (in dbt’s MetricFlow semantic models, or a dedicated layer like Cube) as revenue = sum(amount) where status = 'settled', with their dimensions and time grains. BI tools — Tableau, Power BI, Looker, or Sigma — query through the semantic layer (or against the conformed marts when the tool has its own modeling layer, as Looker does with LookML). The result: the number on the executive dashboard and the number in the analyst’s ad-hoc exploration are computed from the same definition.

The control flow, end to end: CDC and batch loads land raw in object storage → Snowpipe/COPY ingests to RAWGitHub Actions (or Jenkins) triggers a dbt build on merge to main, which runs staging → intermediate → marts on a dedicated transform warehouse, running tests at each layer → the semantic layer exposes governed metrics over the marts → BI tools, authenticated via Okta SSO, query their own dedicated BI warehouses. Identity, secrets, posture, and cost controls wrap the whole thing.

Component breakdown

Concern Tool / Snowflake feature Role in the platform Key configuration choices
Raw landing S3 / ADLS / GCS + Snowpipe / COPY INTO Durable raw zone, single source of truth, replayable External stage with storage integration (no static keys); Snowpipe auto-ingest on object events
Compute Snowflake virtual warehouses Isolated, per-workload elastic compute Separate warehouses per team/workload; auto-suspend 60s; resource monitors; multi-cluster for concurrency
Storage Snowflake (columnar, micro-partitioned) Compressed, time-travel-enabled table storage Time Travel retention by tier; Fail-safe for recovery; clustering keys only where measured
Transformation dbt (Core or Cloud) Versioned, tested, lineage-tracked SQL transforms Staging/intermediate/marts layering; tests at each layer; dbt build in CI; slim CI on PRs
Semantic layer dbt MetricFlow / Cube / LookML Single definition of every business metric Metrics defined once; dimensions + grains declared; BI queries through it
BI Tableau / Power BI / Looker / Sigma Dashboards, self-service exploration SSO via Okta; row-level security mapped to Snowflake roles; dedicated BI warehouse
Identity Okta (SAML/SCIM) → Snowflake; Entra ID alt. SSO, MFA, lifecycle provisioning SCIM provisions users/roles; SAML for login; SSO into BI tools too
Secrets HashiCorp Vault Service-account creds, storage keys, dbt profiles Dynamic short-lived DB creds; no secrets in CI or repos
Data posture Wiz (DSPM) Find unprotected sensitive data, misconfig Scans Snowflake for ungoverned PII, public shares, over-grants
Runtime security CrowdStrike Falcon Endpoint/workload protection on CI runners, analyst machines Detects credential theft, anomalous access from compromised hosts
Observability Datadog / Dynatrace Query telemetry, freshness, cost dashboards Snowflake ACCOUNT_USAGE + dbt artifacts ingested; SLO alerts
CI/CD GitHub Actions / Jenkins Run/test dbt, gate merges, deploy dbt build + tests required to merge; environments per branch
IaC Terraform Provision Snowflake objects, roles, monitors Warehouses, databases, roles, resource monitors as code
ITSM / approval ServiceNow Access requests, change approval, audit trail Role-grant requests routed for approval; ties grant to a ticket

A few choices deserve the why, because they are the ones teams get wrong.

Why separate virtual warehouses per workload, not one big one. Snowflake decouples storage from compute, and a “virtual warehouse” is just a compute cluster you can spin up, size, and suspend independently. The instinct is to run everything on one large warehouse for simplicity. Do not. Give the dbt transform job its own warehouse, each BI tool its own, ad-hoc analysts their own, and data science theirs. This buys you three things at once: isolation (a runaway analyst query cannot starve the executive dashboard), right-sizing (the transform job might want an L for an hour while BI runs fine on S), and — the one finance cares about — cost attribution, because every credit Snowflake bills is tagged to the warehouse that burned it, so you know exactly which team spent what.

Why the semantic layer is non-negotiable at this scale. Without it, a metric like “active merchant” gets re-implemented in Tableau, in Power BI, and in three different analysts’ notebooks, and they drift. The semantic layer makes the definition code: active_merchant = count(distinct merchant_id) where last_transaction_at > current_date - 30, defined once, versioned in git, queried by every tool. When Finance and Product disagree in a meeting, the answer is “open the semantic model” rather than a week of reconciliation. It is the difference between self-service analytics that scales trust and self-service analytics that scales chaos.

Why dbt runs transformations in Snowflake. dbt does not have its own compute — it generates SQL and pushes it down to the warehouse. This is a feature, not a limitation: there is no second cluster to operate, scale, or secure; transformation runs on the same governed, audited engine as everything else; and you get Snowflake’s elasticity (size the transform warehouse up for the nightly heavy build, suspend it the rest of the day) for free.

Implementation guidance

Provision everything as code. Use Terraform with the Snowflake provider so warehouses, databases, schemas, roles, and — critically — resource monitors are version-controlled and reviewable, not clicked into existence by whoever had ACCOUNTADMIN that week. A minimal shape communicates the intent: a dedicated transform warehouse with aggressive auto-suspend and a hard monthly credit ceiling.

resource "snowflake_warehouse" "transform" {
  name                = "WH_TRANSFORM_PROD"
  warehouse_size      = "LARGE"
  auto_suspend        = 60      # seconds idle before it stops billing
  auto_resume         = true
  initially_suspended = true
}

resource "snowflake_resource_monitor" "transform_budget" {
  name            = "RM_TRANSFORM"
  credit_quota    = 4000        # monthly credit cap
  frequency       = "MONTHLY"
  warehouse {
    warehouse_name = snowflake_warehouse.transform.name
  }
  notify_triggers = [50, 75, 90]   # alert thresholds (% of quota)
  suspend_trigger = 100            # stop the warehouse at 100%
}

That suspend_trigger = 100 is the single most important line in the file: it is the hard stop that turns “the bill tripled overnight” into “the transform warehouse paused and paged us at 90%.”

Layer the dbt project, and test at every layer. Staging models are thin (stg_payments__transactions), one per source, doing only renames, type casts, and light cleaning. Intermediate models hold the genuinely hard business logic. Marts are the wide conformed tables BI queries. Put tests everywhere — not just not_null/unique on keys, but business assertions:

models:
  - name: fct_settlements
    columns:
      - name: settlement_id
        tests: [not_null, unique]
    tests:
      - dbt_utils.expression_is_true:
          expression: "settled_amount <= gross_transaction_amount"

A test failure should fail the build in CI, so bad data never reaches a mart and never reaches a dashboard. Use dbt’s slim CI on pull requests — build and test only the models that changed and their downstream dependents, against a CI schema, so a PR check takes minutes not the full nightly hour.

Kill the static credentials. dbt, Snowpipe, and the BI tools all need to authenticate to Snowflake; none of them should hold a long-lived password in a config file. Use HashiCorp Vault to issue dynamic, short-lived database credentials to the CI runner at build time — Vault’s database secrets engine creates a Snowflake user scoped to the transform role, valid for the length of the job, then revoked. Storage access uses Snowflake storage integrations (an IAM role trust relationship) rather than access keys, so there is no S3/ADLS key to leak. The dbt profiles.yml in CI references a Vault-injected credential, never a hardcoded one.

Wire identity through Okta. Configure Okta as the SAML identity provider for Snowflake login and as the SCIM provisioning source, so when HR onboards an analyst in Okta, the Snowflake user and their role assignments are created automatically — and, just as importantly, deprovisioned automatically when they leave (the access that auditors most often find dangling). Map Okta groups to Snowflake roles: an analytics-marketing Okta group maps to a Snowflake role that can read marketing marts but cannot see the RAW.PAYMENTS schema where cardholder data lives. Federate the BI tools to Okta too, so a single SSO login with MFA flows from the dashboard down to the warehouse, and the audit trail is continuous.

Enterprise considerations

Security & data governance. This is where the regulated-fintech scenario earns its keep. Cardholder data (PAN) is tokenized at ingestion so the raw PAN never lands in the warehouse at all; what does land is governed with Snowflake’s native controls: dynamic data masking (a masking policy returns ****-****-****-1234 to anyone without the pci_reader role and the full value to those with it), row access policies (a merchant-facing analyst sees only their assigned merchants’ rows), and object tagging to label every column carrying PII so the controls are discoverable. Wiz runs continuous data security posture management (DSPM) over the Snowflake account — it scans for the failure modes humans miss: a table of PII with no masking policy, a Snowflake share accidentally exposed to another account, a role granted SELECT on RAW it should never have. CrowdStrike Falcon protects the runtime around the warehouse — the CI runners that hold transient credentials and the analyst workstations — catching credential theft or anomalous access from a compromised host before it becomes a breach. Every access request to a sensitive role routes through ServiceNow, so a grant is tied to an approved ticket and the auditor’s “who could see cardholder data and why” question has a clean paper trail.

Cost optimization — the pillar this architecture is built around. Consumption billing rewards engineering discipline and punishes its absence. The controls, in priority order:

  1. Auto-suspend aggressively. Set auto_suspend = 60 on every warehouse. An idle warehouse bills nothing; the default looser settings quietly bill for the gap between queries. This one setting is often a double-digit-percent saving.
  2. Resource monitors with hard caps. Every warehouse gets a monitor with notify_triggers and a suspend_trigger (the Terraform above). This is the circuit breaker that makes a runaway query a page instead of a quarter-end surprise.
  3. Right-size per workload. A warehouse is sized XS to 4XL, and each step roughly doubles the credit burn. Bigger is faster but linearly more expensive — and a query that finishes in 8 s on an L costing twice an M that runs it in 12 s is usually the wrong trade for BI. Size to the workload, not to “fast.”
  4. Multi-cluster for concurrency, not size. When 200 analysts hit a BI warehouse at 9 a.m., the fix is multi-cluster mode (Snowflake adds identical clusters to absorb concurrent queries, then removes them) — not a bigger single warehouse, which makes each query cheaper-per-second but does nothing for the queue.
  5. Make spend observable and attributable. Stream Snowflake’s ACCOUNT_USAGE views into Datadog (or Dynatrace) for per-warehouse, per-query cost dashboards, and tag warehouses by team so finance can do real chargeback. The most expensive queries are almost always a handful of un-tuned SELECT * joins a dashboard reruns hourly; you cannot fix what you cannot see.

The economics of a “transform warehouse that runs heavy for one hour nightly, suspended otherwise” versus “an always-on cluster sized for that one hour” is the whole reason Snowflake’s model wins here — if you actually suspend it.

Scalability. Each plane scales independently. Compute scales two orthogonal ways and you must not confuse them: warehouse size (XS→4XL) scales single-query speed and the size of data one query can chew through; multi-cluster scales concurrency, the number of simultaneous queries. Storage scales transparently — Snowflake’s micro-partitioned columnar store grows without your involvement and you pay only for compressed bytes. The transformation plane scales by sizing the transform warehouse to the nightly build and by dbt’s incremental models (re-process only new/changed rows, not the full history) on the billion-row fact tables. BI concurrency scales via multi-cluster on the BI warehouses. The architecture has no single bottleneck because the planes do not share compute.

Reliability & recovery (RTO/RPO). Snowflake’s Time Travel lets you query or restore a table as of any point in a retention window (1 day on standard, up to 90 on Enterprise) — so an analyst who DELETEs the wrong rows, or a bad dbt deploy that corrupts a mart, is a CREATE TABLE ... CLONE ... AT(timestamp) away from recovery, with effectively zero RPO inside the window. Fail-safe adds a further 7-day Snowflake-managed recovery for disaster cases. Because raw data is preserved untouched and dbt transformations are deterministic code in git, the entire transformed layer is reproducible: lose every mart and you rebuild them with dbt build from raw. For region-level DR, Snowflake database replication and failover maintains a synchronized secondary in another region/cloud; a pragmatic enterprise target is RTO under 1 hour, RPO under 15 minutes for the serving layer, with marts fully rebuildable from raw if needed. The combination — immutable raw + versioned transforms + Time Travel — is a stronger recovery story than most traditional warehouses ever had.

Failure modes to design for. (1) Source schema drift — an upstream table adds or renames a column and staging breaks; dbt’s source freshness checks and schema tests catch it at build, before a dashboard shows nulls. (2) The runaway query — handled by resource monitors and per-warehouse isolation, so blast radius is one team’s warehouse, not the platform. (3) Stale data — a Snowpipe or dbt failure leaves marts hours behind; dbt source freshness thresholds plus a Datadog freshness SLO alert on the marts catch it before a user reports a wrong number. (4) Semantic drift — someone bypasses the semantic layer and writes a metric by hand; guard this by making the conformed marts the only sanctioned BI source and reviewing metric changes in PRs. (5) Credential leak — mitigated by Vault’s short-lived creds (a leaked credential expires in minutes) and Wiz/Falcon detection.

Observability. Instrument three things, because they are the three questions you will be asked. Freshness — how current is each mart, fed from dbt source-freshness and surfaced as a Datadog/Dynatrace SLO. Quality — dbt test pass/fail per build, with a failed test paging and blocking the deploy. Cost — per-warehouse, per-team, per-query credit burn from ACCOUNT_USAGE, with anomaly alerts so a 3x day pages on day one, not at the month-end invoice. dbt’s run artifacts (run_results.json, manifest.json) feed model timing and lineage into the same observability stack, so a slow nightly build is diagnosable to the exact model.

Reference enterprise example

Cobalt Pay, a fictional payments processor (~1,400 employees, ~6,000 active merchants), rebuilt its analytics platform on this architecture after its legacy fixed-cluster warehouse could neither keep up with month-end settlement load nor satisfy a SOC 2 auditor’s access questions.

Decisions they made. Raw card and ACH transaction CDC flowed via Fivetran to S3 and into a Snowflake RAW database, with PAN tokenized at the ingestion boundary so no clear cardholder number ever entered the warehouse. They ran four warehouse tiers: WH_TRANSFORM (L, auto-suspend 60s, the nightly and hourly dbt builds), WH_BI (M, multi-cluster 1–4 for the 9 a.m. dashboard rush), WH_ADHOC (S, analyst exploration), and WH_DATASCI (L, fraud-model backtesting). Every warehouse carried a resource monitor with a hard suspend_trigger at 100% and alerts at 75/90. dbt organized ~340 models across staging/intermediate/marts, with business tests (settlement-to-transaction reconciliation chief among them) failing the GitHub Actions build on breach. Metrics — “net revenue,” “active merchant,” “settlement lag” — were defined once in the dbt semantic layer and consumed by Tableau (exec and merchant-facing dashboards) and Sigma (analyst self-service). Okta drove SSO and SCIM, mapping groups to roles so marketing analysts physically could not query the RAW.PAYMENTS schema; dynamic masking and row access policies enforced the rest. HashiCorp Vault issued short-lived Snowflake creds to CI; Wiz scanned for ungoverned PII and stray shares; CrowdStrike Falcon covered the runners and analyst laptops; sensitive role grants routed through ServiceNow.

The numbers. Compute settled around 9,500 credits/month across the four warehouses, roughly ₹19 lakh (~$22,800) including storage and Snowpipe — against a legacy cluster that had been costing more and falling over at peak. The decisive move was auto-suspend plus right-sizing: the transform warehouse, which ran heavy for about 90 minutes nightly and was suspended the rest of the day, would have cost ~3x as an always-on L. Resource monitors caught two incidents in the first quarter — an unbounded ad-hoc join and a mis-scheduled backfill — at the 90% alert, before either reached the cap. The semantic layer ended an 8% discrepancy between Finance’s and Product’s “active merchant” count that had previously consumed a recurring meeting.

The outcome. Month-end settlement reporting that used to slip into the next business day now completed inside the nightly window. The SOC 2 audit passed on the access-control evidence: Okta SCIM proved provisioning and deprovisioning, masking policies proved no marketing analyst could see a PAN, and ServiceNow tied every sensitive grant to an approved ticket — the clean answer to “who could see cardholder data” that the legacy platform never had. And finance got its predictability back in a better form than before: not a fixed number, but a governed one, with per-team chargeback and a hard cap that made a runaway query a page rather than an invoice surprise.

When to use it

Use this architecture when you have multiple teams consuming shared data, you need a single trusted definition of your key metrics, your data carries regulatory or sensitivity obligations, and your workload is spiky enough that elastic consumption beats a fixed cluster. That covers most mid-to-large enterprise analytics: fintech and payments, retail and e-commerce, SaaS product analytics, logistics, anywhere “self-service BI that the auditor and the CFO both trust” is the goal.

Trade-offs to accept. Consumption billing means cost is a continuous engineering responsibility, not a fixed line — the governance controls here are mandatory, not optional, and a team that ignores them will be surprised by the bill. The dbt + semantic-layer discipline adds upfront modeling work before the first dashboard ships; you trade speed-to-first-chart for trust-at-scale, which is the right trade for an enterprise and the wrong one for a weekend prototype. And the multi-tool stack (Snowflake, dbt, BI, Okta, Vault, Wiz, observability) is more moving parts to operate than a single bundled platform — justified by the governance and isolation, but real.

Anti-patterns. (1) One giant shared warehouse — kills cost attribution and lets one query starve everyone; isolate per workload. (2) No resource monitors — a consumption warehouse with no hard cap is an unbounded liability. (3) Metrics defined in each BI tool — guarantees the “whose number is right” board-meeting disaster; centralize in the semantic layer. (4) Transformations in stored procedures and scripts — no lineage, no tests, no safe change; that is what dbt exists to replace. (5) Static credentials in dbt profiles or CI — leakable and unrotated; use Vault dynamic creds and storage integrations. (6) Skipping data tests — bad data reaches dashboards silently and erodes the trust the platform exists to build.

Alternatives, and when they win. If your data is small, your team is two analysts, and governance is not yet a concern, a single DuckDB or Postgres warehouse with dbt is dramatically simpler — graduate to this when scale, concurrency, or audit demand it. If you are all-in on one cloud and want tighter native integration, BigQuery or Databricks SQL / lakehouse occupy the same role as Snowflake here and the rest of the pattern (dbt, semantic layer, Okta, cost governance) ports almost unchanged — the choice is about your cloud, your lakehouse-vs-warehouse preference, and your team’s skills, not about the shape of the architecture. And if your dominant need is streaming sub-second analytics rather than the minutes-to-hours batch this serves, a real-time engine (Materialize, ClickHouse, Snowflake’s own Dynamic Tables for the near-real-time middle ground) belongs alongside, not instead of, this warehouse. The architecture here is the governed analytical backbone; specialized engines hang off it where latency demands.

SnowflakedbtData WarehouseAnalyticsCost GovernanceReference Architecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading