Architecture AWS

Snowflake and dbt on AWS: A Governed Analytics Engineering Stack

A national health-insurance carrier’s data team gets a blunt directive from the new Chief Data Officer: the actuarial, claims, and fraud teams each maintain their own copy of “the truth,” every quarterly loss-ratio number is argued over in a meeting before it reaches the board, and the last regulatory filing slipped because three teams could not reconcile member-month counts. Worse, an auditor flagged that analysts were running ad-hoc SQL with broad table grants against a replica of the claims database, with no record of who transformed what. The ask is “one governed source of truth, with every number reproducible from raw to report, and an audit trail a regulator will accept.” The constraint is the one health-insurance always carries: this is PHI under HIPAA, the data cannot traverse the public internet, and access has to be least-privilege and provable. A pile of stored procedures and a shared analyst login is not the answer. This article is the reference architecture for building that platform properly: Snowflake as the governed warehouse on AWS, dbt as the transformation and contract layer, and a delivery pipeline disciplined enough that the CDO, the CISO, and the auditor each sign.

The pressures here are the ones every regulated analytics shop hits. Governance means every metric must trace back through a versioned transformation to a raw source, with who-changed-what recorded. Trust means one definition of “active member” and “incurred claim,” not one per team. Compliance means PHI never leaves the carrier’s network boundary and access is role-scoped and auditable. And cost means warehouse spend that quietly triples when someone leaves an analysis running over the weekend has to be capped before finance notices. The pattern that satisfies all four is analytics engineering: ingest raw data once, transform it with version-controlled, tested SQL that ships through CI, and serve governed data products that everyone queries instead of reinventing. The raw data lands once; the transformations are code; the definitions live in one place.

Why not the obvious shortcuts

The naive fixes each fail predictably, and naming why matters because someone on the project will propose all three.

Stored procedures in the database are untested, unversioned, and undiscoverable — nobody can tell you what a 600-line proc actually computes, or reproduce last quarter’s number after it has been edited in place. A pile of scheduled SQL scripts has no dependency graph, so a change to an upstream table silently breaks six downstream reports and you find out from the board deck. Each team transforming in its own BI tool is exactly the “three versions of the truth” problem the CDO was hired to kill — the logic for “loss ratio” gets re-implemented, slightly differently, in Tableau, in a notebook, and in a spreadsheet.

Analytics engineering threads the needle. Transformations are dbt models — SQL SELECT statements that dbt turns into tables and views, with an explicit dependency graph (ref()), built-in tests (uniqueness, not-null, referential, accepted-values), and documentation generated from the code. Everything lives in Git, ships through CI, and produces a single governed layer. The warehouse, Snowflake, is the governed store where access is role-based and every query is logged — which turns “who transformed what” from an unanswerable question into a row in query_history.

Architecture overview

Snowflake and dbt on AWS: A Governed Analytics Engineering Stack — architecture

The platform runs three distinct flows that share governance but live on different schedules: an ingestion flow that lands raw source data, a transformation flow that turns raw into governed data products, and a delivery flow — the CI/CD pipeline — that ships and tests the transformation code. Keeping them separate in your head is the first step to operating this well.

The defining property the security team cares about most: all data movement stays on private networking. Snowflake is reached over AWS PrivateLink, so the carrier’s VPCs and the ingestion service talk to the warehouse over a private endpoint, never the public internet. Source databases sit in private subnets. The medallion layers (raw → staging → marts) all live inside one Snowflake account where access is RBAC-controlled. No PHI traverses a public endpoint, which is what makes the HIPAA story defensible.

Ingestion flow, following the data:

  1. The operational systems — a claims platform on Amazon RDS for PostgreSQL and a member-management system on RDS MySQL — are the sources of truth for live data. They sit in private subnets and are never queried directly by analysts.
  2. Fivetran runs managed, log-based change-data-capture connectors against the RDS instances, reaching them over AWS PrivateLink so replication traffic never leaves AWS’s private network. Fivetran handles schema drift, incremental syncs, and historical re-syncs, landing the data into Snowflake’s RAW database with each source in its own schema. This is the “Fivetran-style ingestion from RDS” the platform standardizes on, so analysts stop building bespoke extract scripts.
  3. Batch and third-party files (enrollment files from brokers, provider rosters) land in an S3 bucket and are loaded into RAW via Snowpipe, Snowflake’s auto-ingest service, triggered by S3 event notifications. PHI files are KMS-encrypted at rest in S3 and loaded over the Snowflake-S3 private path.

Transformation flow, the heart of analytics engineering:

  1. dbt Cloud connects to Snowflake over PrivateLink and runs the transformation project on a schedule (and on every merge). dbt builds the medallion layers as a DAG: staging models clean and rename raw columns one-to-one; intermediate models join and reshape; marts models are the governed business entities — dim_member, fct_claim, fct_premium, and the metric models behind loss ratio and member-months.
  2. dbt tests run as part of every build — uniqueness on claim_id, not-null on member_id, referential integrity between claims and members, accepted-values on claim status. A failed test fails the build, so a broken assumption never silently propagates to a report.
  3. dbt snapshots capture slowly-changing dimensions (a member’s plan tier over time) using Type-2 history, so “what was true on the filing date” is reproducible — exactly the audit requirement that started the project. dbt also generates documentation and lineage, published so any analyst can trace a number from mart back to raw column.

Delivery flow (CI/CD), independent and event-driven:

  1. Analytics engineers develop models in a Git repo. On every pull request, GitHub Actions runs the CI pipeline: dbt build against a transient CI schema, Slim CI (state:modified+) so only changed models and their children rebuild, plus SQL linting and the full test suite. Wiz Code scans the Terraform that provisions the platform for misconfigurations on the same PR. A failing test or a critical IaC finding blocks the merge.
  2. On merge to main, the change is promoted; dbt Cloud deploys it to production on the next scheduled run. Snowflake credentials for the pipeline are issued by HashiCorp Vault (key-pair auth, leased), so no warehouse password is stored in GitHub.

Component breakdown

Component Service / tool Role in the platform Key configuration choices
Operational sources Amazon RDS (PostgreSQL, MySQL) Systems of record for claims and members Private subnets; logical replication enabled for CDC
Ingestion (CDC) Fivetran Managed log-based replication RDS → Snowflake RAW PrivateLink connectors; incremental syncs; schema-drift handling
File ingestion S3 + Snowpipe Auto-load broker/provider files into RAW S3 event-triggered; KMS-encrypted; private S3 integration
Warehouse Snowflake on AWS Governed store; medallion layers; compute warehouses PrivateLink; RBAC roles; per-team virtual warehouses
Transformation dbt Cloud SQL transformations, tests, snapshots, lineage Medallion project; ref() DAG; deferral to prod state
Identity / SSO Okta + SCIM Workforce SSO and provisioning into Snowflake SAML SSO; SCIM group → Snowflake role mapping
Secrets HashiCorp Vault Key-pair auth for CI/dbt service accounts Snowflake-issued key-pair; dynamic leases; no stored passwords
CI/CD GitHub Actions PR build/test, Slim CI, lint; promote on merge state:modified+; transient CI schema; OIDC where possible
IaC scanning Wiz Code Scan Terraform on PR for misconfig before apply Policy gate on critical findings
CSPM / data posture Wiz Continuous posture + sensitive-data exposure scan Snowflake + S3 + RDS connectors; public-exposure drift alerts
Cost guardrails Snowflake resource monitors Cap credit spend per warehouse; suspend on threshold Monthly quotas; notify → suspend actions
Observability Datadog Warehouse/query metrics, dbt run status, cost dashboards Snowflake integration; dbt webhook; per-team cost monitors
ITSM / approvals ServiceNow Access-request approval, change records for prod model deploys Change gate on schema-affecting releases
IaC Terraform Provision Snowflake objects, PrivateLink, IAM, monitors Snowflake + AWS providers; state in S3 + DynamoDB lock

A few of these choices deserve the why, because they are the ones teams get wrong.

Why medallion layering, not one big transformation. It is tempting to write one query that goes from raw claims to the loss-ratio number. Do not — it is untestable and unreusable. The staging layer (one model per source table, light cleaning, no joins) gives every downstream model a stable, renamed interface so a source-column change is absorbed in one place. The marts layer is the governed business vocabulary everyone queries. The intermediate layer holds the reusable join logic in between. The discipline pays off the first time a source system renames a column: you fix one staging model, not forty reports.

Why ingestion (EL) is separate from transformation (T). Fivetran’s job is to land raw source data faithfully — same shape as the source, no business logic. dbt’s job is all the business logic, in version control. Mixing them (transforming during ingestion) is how you end up with un-versioned logic hidden in a connector config that nobody can audit. Keeping EL and T separate — the ELT pattern — means the raw layer is a replayable source of truth and every transformation is reviewable code.

Why CI on every pull request. The whole value of analytics engineering is that a transformation change is reviewed and tested like application code. GitHub Actions builds the changed models into a throwaway CI schema and runs the tests before merge, using Slim CI so it rebuilds only what changed (and its descendants) rather than the entire warehouse — which keeps CI minutes and Snowflake credits sane. A pull request that breaks a referential-integrity test simply cannot reach production.

Implementation guidance

Provision with Terraform, and treat the network and RBAC as the first deliverables. Get PrivateLink and the role hierarchy right before any model is written, because retrofitting least-privilege onto a live warehouse full of analysts is painful.

  1. PrivateLink between the carrier’s AWS account and the Snowflake account (and Fivetran’s PrivateLink to both RDS and Snowflake), with the Snowflake account’s network policy set to allow only the private endpoint and block public access.
  2. A role hierarchy built on Snowflake’s RBAC: functional access roles (claims_reader, marts_reader) grant privileges on objects; team roles (actuarial, fraud, analytics_eng) inherit the access roles they need; service roles (dbt_prod, fivetran_loader, ci_runner) are scoped to exactly their job. Analysts are granted team roles, never raw-table grants directly — the auditor’s specific complaint.
  3. Per-team virtual warehouses so compute (and therefore cost) is attributable: wh_actuarial, wh_fraud, wh_dbt, each auto-suspending after a short idle window so an abandoned query stops burning credits.
  4. Resource monitors (below) attached to each warehouse.
  5. The dbt Cloud connection and the Fivetran destination, both over PrivateLink.

A minimal Terraform shape for the warehouse, a service role, and a cost guardrail communicates the intent — least-privilege compute that suspends itself when it overspends:

resource "snowflake_warehouse" "dbt" {
  name           = "WH_DBT"
  warehouse_size = "MEDIUM"
  auto_suspend   = 60          # seconds idle before suspend — stop paying for nothing
  auto_resume    = true
  initially_suspended = true
}

resource "snowflake_resource_monitor" "dbt_guardrail" {
  name         = "RM_DBT_MONTHLY"
  credit_quota = 400           # hard monthly credit cap for this warehouse
  frequency    = "MONTHLY"
  notify_triggers   = [75, 90] # email finance at 75% and 90%
  suspend_trigger   = 100      # stop new queries at 100% — finance never gets surprised
  suspend_immediate_trigger = 110
  warehouse_id = snowflake_warehouse.dbt.id
}

The pipeline that applies this runs in GitHub Actions, with Wiz Code scanning the Terraform on the same pull request so a misconfiguration — a warehouse with no monitor, a role with ACCOUNTADMIN granted too broadly — is caught before apply.

Identity: federate the humans, key-pair the services. Human access flows Okta → Snowflake over SAML SSO, and SCIM provisioning maps Okta groups to Snowflake roles so a joiner in the actuarial Okta group automatically receives the actuarial role and a leaver is deprovisioned the same day — no orphaned analyst logins, the lifecycle gap auditors love to find. Service accounts (dbt Cloud’s dbt_prod, Fivetran’s loader, the CI runner) authenticate with key-pair authentication, and the private keys are issued and leased by HashiCorp Vault rather than pasted into a connector config or a GitHub secret. Password authentication for service accounts is disabled at the account level.

dbt project conventions that keep it governed. Adopt a few non-negotiable patterns from day one. Name layers by folder (staging/, intermediate/, marts/) and materialize them deliberately — staging as views (cheap, always fresh), marts as tables (fast for BI). Put tests in every model’s YAML — at minimum a unique + not-null test on each primary key and a relationship test on each foreign key. Use dbt snapshots for any dimension where history matters (member plan tier, provider contract status) so point-in-time questions are answerable. Define metrics in one place (a metrics or semantic layer model) so “loss ratio” has exactly one SQL definition, ending the per-team reinvention. And use deferral in CI (--defer --state prod) so a PR build references unchanged production models instead of rebuilding the world.

# models/marts/_marts.yml — tests turn assumptions into enforced contracts
models:
  - name: fct_claim
    columns:
      - name: claim_id
        tests: [unique, not_null]
      - name: member_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_member')
              field: member_id
      - name: claim_status
        tests:
          - accepted_values: { values: ['submitted','adjudicated','paid','denied'] }

Enterprise considerations

Security & governance. The platform is least-privilege by construction: PrivateLink-only data movement, RBAC with no direct raw-table grants, SCIM-driven role lifecycle. Layer on top: (a) column-level security on PHI — Snowflake dynamic data masking so a member’s SSN or full name is masked unless the querying role is explicitly cleared, and row access policies so the fraud team sees only flagged cohorts; (b) object tagging on every PHI column so the masking policies attach by tag and coverage is auditable; © Wiz running continuous CSPM and sensitive-data-exposure scanning across Snowflake, S3, and RDS, alerting the moment a share is created to an unknown account or an S3 bucket drifts toward public — the posture backstop behind the in-warehouse controls; (d) Snowflake’s access_history and query_history retained and exported for the audit trail, so every read and transformation of PHI is on record; (e) a sensitive-data-exposure finding or a failed compliance check auto-raises a ServiceNow incident so security has a ticket, not just an alert. Snowflake’s native logging plus Wiz’s independent posture view means the controls are both enforced and verified.

Cost optimization. Snowflake spend is the line item finance watches, and it grows quietly, so engineer guardrails from day one.

Lever Mechanism Typical effect
Resource monitors Per-warehouse credit quota; suspend at 100% Hard cap — no surprise overruns
Auto-suspend Short idle timeout on every warehouse Stops paying for idle/abandoned queries
Right-sized warehouses Match warehouse size to workload; scale only the heavy ones Avoids paying XL for staging refreshes
Slim CI state:modified+ rebuilds only changed models Cuts CI credits and minutes sharply
Per-team warehouses Each team’s compute is separately metered Makes each team own its spend (chargeback)
Incremental models is_incremental() on large facts; process only new rows Avoids full-table rebuilds of huge claim tables

Meter credits per warehouse and pipe the metric to Datadog, which the platform team uses for the cost-by-team dashboard the CDO and finance both watch. Resource monitors are the hard stop; Datadog is the early-warning radar.

Scalability. Each layer scales independently. Snowflake separates storage from compute, so a warehouse can be resized (bigger node) for a heavy quarter-close job and shrunk after, or set to multi-cluster to fan out under concurrent query load at month-end without queries queuing. Ingestion scales with Fivetran’s managed connectors (add a source, not infrastructure) and Snowpipe’s serverless auto-ingest. The dbt DAG scales by adding models; CI stays fast because Slim CI only rebuilds the changed subgraph. The natural ceiling is concurrency on a single warehouse, which is why heavy BI workloads get multi-cluster warehouses and batch transformation runs on its own WH_DBT.

Failure modes, and what each one looks like. Name them before they page you.

Reliability & DR (RTO/RPO). Decide the numbers per layer. Snowflake’s Time Travel lets you query or restore a table as of a point in the recent past (up to 90 days on Enterprise edition), which covers “an analyst dropped the wrong table” and “last night’s load corrupted a mart.” Fail-safe adds a further recovery window Snowflake operates. For regional resilience, database replication to a Snowflake account in a second AWS region with failover groups gives a warm standby; the raw layer is fully rebuildable from the RDS sources and S3 (the durable sources of truth) if needed. A pragmatic target for this platform: RTO 1 hour, RPO 15 minutes for the governed marts via replication, with the warehouse fully reconstructable from source within a day in a true regional loss — acceptable because the systems of record are the operational databases, not the warehouse.

Observability. Instrument the platform in Datadog: the Snowflake integration streams warehouse credit usage, query counts, queue depth, and storage; a dbt Cloud webhook posts run status so a failed or long-running production build surfaces as a monitor, not a Slack message someone missed. Emit the metrics the business actually cares about — credits per team, model build duration trend, test pass-rate, data freshness (lag from source to mart), and failed-build count. dbt’s freshness checks on source tables catch a stalled Fivetran connector before stale data reaches a report. Schema-affecting production releases pass through a ServiceNow change record, giving governance a documented gate on changes that alter the contract downstream consumers depend on.

Governance. Pin dbt and connector versions explicitly so behavior does not drift, and promote model changes only through the CI gate. Keep the entire transformation layer — models, tests, metric definitions — in version control, reviewable and instantly revertable, which is the whole point: the answer to “who changed the loss-ratio definition and when” is a Git blame. Publish dbt docs and lineage so any analyst can trace a number end to end. Apply Terraform as the single source of truth for Snowflake objects, roles, and monitors, with Wiz as the independent check that the controls are real. Tag PHI columns and govern access by tag so a new mart inherits the right masking automatically.

Explicit tradeoffs

Accept these or do not build it. This stack adds real moving parts — an ingestion service to operate, a dbt project to maintain, a CI pipeline to keep green, and a role hierarchy to administer. The ELT pattern means raw data lands before it is cleaned, so the RAW layer holds messy source data you must not let analysts query directly — governance depends on people respecting the layering. dbt’s value is proportional to test coverage: a project with models but no tests is just SQL in Git and catches nothing. Snowflake’s pay-per-second compute is wonderfully elastic and just as wonderfully easy to overspend, which is exactly why the resource monitors and auto-suspend are not optional. And the PrivateLink posture that makes the security team sign costs you setup complexity and rules out public-internet debugging shortcuts; the price of a DNS misconfig is a silent hang, not a clear error.

The alternatives, and when they win. If your data is small and the transformations are simple, a single managed database with scheduled SQL is less machinery and perfectly adequate — graduate to this when teams start arguing over numbers. If you are all-in on one cloud and want a tighter native stack, Redshift with dbt (or BigQuery, Databricks SQL) trades Snowflake’s cross-cloud portability for deeper integration with the rest of that provider’s services. If your problem is heavy data-science and ML feature engineering rather than governed BI metrics, a lakehouse (Databricks, or Snowflake’s own Snowpark) fits better than a pure SQL-transformation stack — and the two compose. And if you genuinely need only one team’s dashboards with no reconciliation problem, the full governance apparatus here is overhead you can defer; it earns its keep the moment multiple teams must agree on one number a regulator will read.

The shape of the win

For the carrier’s data org, the payoff is not “a warehouse.” It is that the loss-ratio number in the board deck is produced by one versioned dbt model, traces cleanly from fct_claim back through staging to the raw RDS column, was tested for uniqueness and referential integrity before it shipped, and was computed under role-scoped access that the audit log can prove — so the quarterly filing stops slipping and the meeting that used to argue over reconciliation is now spent on what the number means. That last sentence is the one that funds the platform. Everything upstream — the PrivateLink, the Okta-to-Snowflake SCIM, the Vault-issued key-pairs, the dbt tests and snapshots, the Slim CI gate, the resource monitors, the Wiz posture scanning, the Datadog cost radar — exists to make a CDO, a CISO, and an auditor each say yes to one governed source of truth. The architecture here is the destination; start with a few marts and a handful of tests if you must, but this is where governed, regulated analytics engineering has to land.

AWSSnowflakedbtAnalytics EngineeringPrivateLinkData Platform
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading