A data lake is the single architecture that almost every enterprise eventually needs and almost every enterprise gets wrong on the first attempt. The pull is obvious: finance, marketing, operations, and data science all want the same raw transactional and event data, joined and queryable, without each team running its own brittle export jobs against production. The failure mode is just as obvious in hindsight — the lake fills with undocumented Parquet files nobody trusts, costs creep because every query scans terabytes, and the security team discovers six months in that customer PII is sitting in a container with allow blob public access quietly enabled.
This article lays out a reference architecture that avoids those traps. It is built on four Azure services that are designed to work together: Azure Data Lake Storage (ADLS) Gen2 as the storage substrate, Azure Data Factory (ADF) as the ingestion and orchestration plane, Azure Synapse Analytics as the serving and compute layer, and Microsoft Purview as the governance, catalog, and lineage backbone. It scales down to a single business unit ingesting a dozen sources and up to a federated data mesh across an enterprise, and the same control-plane decisions hold at both ends.
The business scenario
Picture a mid-market enterprise — a few thousand employees, a handful of acquired subsidiaries, the usual sprawl of an ERP, a CRM, a clutch of SaaS tools, and a transactional database behind the customer-facing product. The data lives in silos. The finance team reconciles revenue by exporting CSVs from the ERP and the payment processor and stitching them together in spreadsheets. Marketing pulls campaign data from the CRM and the ad platforms into a separate BI tool. The product team has clickstream events landing in a message queue that nobody outside engineering can see. Data scientists who want to build a churn model spend 70% of their time just finding and cleaning data, and every one of those export jobs hits production systems during business hours.
The asks that finally force the issue are concrete. The CFO wants a single revenue number that ties out across every source, refreshed daily, with an audit trail. The CMO wants customer-level attribution that joins ad spend to product usage to subscription revenue. The Chief Risk Officer, after a near-miss audit, wants to know — for any column in any report — where the data came from, who can see it, and whether it contains regulated personal data. And the data science team wants a place to train models on years of history without anyone asking them to “please not run that during the day.”
A data lake and analytics platform solves all four, but only if it is built with governance and cost discipline from day one rather than bolted on later. The remainder of this article is that platform.
Architecture overview
The architecture is organized around the medallion pattern — three logical zones (Bronze, Silver, Gold) that data flows through, each adding structure, quality, and trust — overlaid on a clear separation between a storage plane, an ingestion/orchestration plane, a compute/serving plane, and a governance plane that spans all three.
Data enters from two directions. Batch sources — the ERP, CRM, on-prem SQL databases, SaaS APIs, flat-file drops — are pulled by Azure Data Factory on a schedule or on a trigger. ADF connects through a Self-hosted Integration Runtime for anything behind a corporate firewall and through managed connectors for cloud SaaS. Streaming sources — clickstream, IoT telemetry, application events — flow through Event Hubs and are captured to the lake either by Event Hubs Capture (the cheap, code-free path) or by a Synapse/Spark structured-streaming job when transformation-on-arrival is needed.
Everything lands first in the Bronze zone of ADLS Gen2 as raw, immutable, append-only data — exactly as the source produced it, partitioned by ingest date, in Parquet or the source’s native format. Bronze is the system of record for “what arrived”; you never edit it, and you can always replay from it.
From Bronze, Synapse Spark pools (or Data Flows in ADF, or increasingly Synapse’s integration of Spark) clean, deduplicate, conform schemas, and apply data-quality rules, writing the result to the Silver zone as Delta Lake tables. Silver is the trusted, query-ready, enterprise-conformed layer — one customer table, one product table, deduplicated and typed, slowly-changing-dimension history preserved. Most data engineering effort lives in the Bronze-to-Silver hop.
Business logic — aggregations, joins, KPI definitions, star schemas — runs Silver-to-Gold. Gold tables are purpose-built for consumption: the finance revenue mart, the marketing attribution mart, feature tables for ML. These are served two ways. Synapse Serverless SQL exposes Gold (and Silver) directly over the lake files via external tables and views — pay-per-query, no data movement, ideal for ad-hoc analysis and for Power BI in DirectQuery. For high-concurrency BI and sub-second dashboards, curated Gold marts are loaded into a Synapse Dedicated SQL pool (a provisioned MPP data warehouse) or served through Power BI Import mode.
Wrapping all of this, Microsoft Purview scans every zone on a schedule, builds a searchable data catalog, classifies columns automatically (detecting PII, financial identifiers, etc.), and — critically — captures end-to-end lineage from source through ADF pipelines and Synapse jobs into the Gold marts and out to Power BI. Purview is also where data owners publish glossary terms, certify datasets, and (via the newer Microsoft Purview unified governance and Data Use Management) attach access policies to lake paths.
The request path for a typical analytical query is therefore: a Power BI report issues a query against a Synapse Serverless SQL view → the view resolves to Delta/Parquet files in the Gold zone of ADLS Gen2 → Synapse reads only the partitions and columns needed → results return to Power BI, with Purview having already classified every column the report touches and lineage showing exactly which source rows fed each number.
Identity threads through everything via Microsoft Entra ID. Pipelines and pools authenticate with managed identities, not keys. Storage exposes no public endpoint — all access is over Private Endpoints inside a managed virtual network — and authorization to the lake is governed by a combination of Azure RBAC at the container level and POSIX-style ACLs at the directory level for fine-grained control.
Component breakdown
| Component | Role in the architecture | Why it’s here | Key configuration choices |
|---|---|---|---|
| ADLS Gen2 | Storage substrate for all zones | Hierarchical namespace gives real directories, atomic renames, and POSIX ACLs — none of which flat blob storage offers | Enable Hierarchical Namespace; one storage account, three+ containers (bronze/silver/gold + sandbox); lifecycle policies to tier Bronze to Cool/Cold/Archive; soft delete + versioning on |
| Azure Data Factory | Batch ingestion + orchestration | Managed connectors to 100+ sources, code-free Copy at scale, scheduling, and the Self-hosted IR for on-prem reach | Managed VNet IR for cloud sources; Self-hosted IR for on-prem; parameterized, metadata-driven pipelines; managed identity to storage and Key Vault |
| Event Hubs (+ Capture) | Streaming ingestion | Decouples high-volume producers from the lake; Capture writes Avro/Parquet to Bronze with zero code | Capture enabled to the bronze container; partition count sized to throughput; Schema Registry for event contracts |
| Synapse Spark pools | Heavy transformation (Bronze→Silver→Gold) | Distributed compute for big joins, dedup, SCD, and Delta writes; same engine for batch and structured streaming | Autoscale + auto-pause; Delta Lake as table format; node size matched to workload; small/medium pools per environment |
| Synapse Serverless SQL | Pay-per-query serving over the lake | T-SQL over Parquet/Delta with no infrastructure; perfect for ad-hoc and Power BI DirectQuery | External tables/views over Gold; OPENROWSET for exploration; cost guardrail via the per-query/per-workspace data-processed limits |
| Synapse Dedicated SQL pool | Provisioned MPP warehouse for curated marts | Predictable sub-second performance for high-concurrency BI and complex star-schema queries | Hash-distribute large facts, replicate small dims; columnstore; pause when idle; right-size DWU (start small, scale on demand) |
| Microsoft Purview | Catalog, classification, lineage, policy | Turns an opaque lake into a governed, discoverable, auditable asset; satisfies the Risk Officer’s requirements | Scheduled scans per zone; auto-classification rules + custom classifiers for domain PII; lineage harvested from ADF + Synapse; glossary + certified datasets; Data Use Management for path-level access policy |
| Azure Key Vault | Secret/key management | Centralizes credentials and customer-managed keys; nothing in connection strings | Managed-identity access policies; CMK for storage encryption; secret rotation |
| Microsoft Entra ID | Identity for users and workloads | Single identity plane; managed identities remove standing secrets | Managed identities for ADF/Synapse; Entra security groups mapped to lake RBAC/ACLs; PIM for privileged data roles |
A few of these choices deserve emphasis.
One storage account, multiple containers — not one account per zone. A single ADLS Gen2 account with bronze/silver/gold containers keeps lineage and cross-zone joins simple while still giving you a clean RBAC and ACL boundary per container. You split into multiple accounts only when you hit account-level throughput limits or need hard blast-radius isolation between, say, a regulated and a non-regulated domain.
Delta Lake in Silver and Gold, raw Parquet in Bronze. Bronze is immutable and append-only, so it does not need ACID updates — plain Parquet (or the source’s native format) keeps it cheap and replayable. Silver and Gold need upserts, schema evolution, time travel, and the MERGE semantics that slowly-changing dimensions require, so they use Delta. Synapse Spark reads and writes Delta natively, and Serverless SQL can query Delta directly.
Serverless first, Dedicated only where it pays. Serverless SQL has no idle cost and bills purely on data scanned, which makes it the right default for most Gold consumption. You introduce a Dedicated SQL pool only for the marts that demand consistent sub-second response under heavy concurrent BI load — and even then you pause it outside business hours.
Implementation guidance
Provision with infrastructure-as-code, not the portal. The entire platform should be a Terraform (or Bicep) module so that dev, test, and prod are byte-for-byte identical and every change is reviewed. The dependency order matters: network and DNS first, then identity and Key Vault, then storage, then the data services, then Purview and the policy layer.
A representative Terraform layout:
azurerm_storage_accountwithis_hns_enabled = true,account_tier = "Standard",account_replication_type = "ZRS"(orGZRSfor cross-region durability),public_network_access_enabled = false, blob versioning and soft-delete on, and acustomer_managed_keyblock pointing at Key Vault.azurerm_storage_containerresources forbronze,silver,gold,sandbox.azurerm_storage_management_policyto move Bronze blobs to Cool after 30 days, Cold after 90, Archive after 365.azurerm_data_factorywithidentity { type = "SystemAssigned" }, plusazurerm_data_factory_integration_runtime_azure(managed VNet) and, where on-prem reach is needed, a self-hosted IR registered on a hardened VM.azurerm_synapse_workspace(which creates its own managed identity and a primary ADLS filesystem),azurerm_synapse_spark_poolwithauto_pause { delay_in_minutes = 15 }andauto_scale, andazurerm_synapse_sql_poolonly if a dedicated warehouse is in scope.azurerm_purview_accountplus role assignments granting its managed identity Storage Blob Data Reader on the lake and Reader on the data services so scans and lineage work.azurerm_private_endpointresources for the storage account (one each for thedfsandblobsub-resources), Synapse (Sql,SqlOnDemand,Dev), Key Vault, and Purview, all wired intoazurerm_private_dns_zonerecords.
Networking and identity wiring. The non-negotiables:
- No public endpoints. Set
public_network_access_enabled = falseon storage, Synapse, and Key Vault. All service-to-service and user traffic flows over Private Endpoints in a hub-and-spoke VNet. Synapse runs in a Managed VNet with Managed Private Endpoints out to the lake and Key Vault, so its compute never traverses the public internet. - Managed identities everywhere. ADF’s and Synapse’s system-assigned identities are granted Storage Blob Data Contributor (or finer ACLs) on the specific containers they touch — never account keys, never SAS tokens baked into pipelines. Key Vault references replace any remaining secrets.
- Layered authorization on the lake. Use Azure RBAC for the coarse grant (a data-engineering group gets Contributor on
silver), then directory-level POSIX ACLs for the fine-grained cut (a finance group getsr-xon/gold/financeonly). Map every grant to an Entra security group, never to individual users, so access reviews and joiner/mover/leaver flows stay manageable. - Private DNS must resolve the private endpoints, or clients silently fall back to (now-blocked) public names and fail. This is the single most common deployment bug — get the DNS zones and VNet links right in IaC.
The metadata-driven ingestion pattern. Do not build one ADF pipeline per source — you will end up with hundreds. Build a small set of parameterized, generic pipelines (one for “copy a SQL table to Bronze,” one for “pull a REST API page,” etc.) driven by a control table that lists each source, its watermark column, its target path, and its schedule. A master pipeline reads the control table and fans out. Adding a new source becomes a row insert, not a development cycle. Watermarks stored in the control table give you incremental loads and restartability.
CI/CD. ADF and Synapse both support Git integration with a collaboration branch and a publish branch; wire them to Azure DevOps or GitHub. Promote ARM/JSON artifacts through dev → test → prod with environment-specific parameter files. Notebooks and Spark jobs live in the same repo and are deployed as part of the Synapse workspace artifacts.
Enterprise considerations
Security and Zero Trust. The architecture is built so that no component trusts the network. Identity is the perimeter: every access is an authenticated, authorized Entra principal, every credential is a managed identity or a Key Vault reference, and every data path is a Private Endpoint. Encryption is on at rest (with customer-managed keys in Key Vault for regulated data) and in transit. Double encryption (infrastructure + service) is available for the most sensitive accounts. Purview’s auto-classification is the Zero-Trust feedback loop on data itself — it continuously discovers where PII and financial data actually live, so access policy can follow the sensitivity rather than guesswork. Privileged data roles (who can read /gold/finance, who can administer the warehouse) go through Entra PIM for just-in-time elevation and audit. Diagnostic logs from storage, ADF, Synapse, and Key Vault flow to a central Log Analytics workspace feeding Microsoft Sentinel for threat detection on the data estate.
Cost optimization. A data lake’s cost story is almost entirely about idle compute and wasteful scans:
- Auto-pause and autoscale Spark pools. A pool left running overnight is pure waste;
auto_pauseafter 15 idle minutes and demand-based scaling typically cut Spark spend by more than half versus a fixed cluster. - Pause the Dedicated SQL pool when idle. A provisioned DWU pool bills whether or not it serves a query. Pausing it outside business hours (and using Serverless for off-hours ad-hoc work) is the single biggest warehouse saving.
- Serverless SQL’s per-scan billing rewards good file layout. Partition Gold by the common filter columns, compact small files (Delta
OPTIMIZE), and store as columnar Parquet/Delta so queries read megabytes, not terabytes. Set the Serverless data-processed cost-control limits as a guardrail against a runawaySELECT *. - Storage lifecycle tiering. Bronze is written once and read rarely after its first transformation; tiering it Cool → Cold → Archive over time, while keeping Silver/Gold Hot, can cut storage cost dramatically with zero impact on the active analytics path.
- Reserved capacity for the steady-state Dedicated SQL DWUs and committed storage where usage is predictable.
Scalability. Storage scales effectively without limit; the levers are on compute. Spark pools scale out by node count for bigger transformations; Serverless SQL scales transparently per query; the Dedicated pool scales by DWU and can be resized online. The metadata-driven ingestion framework scales by adding control-table rows, and ADF’s parallel Copy activities and Integration Runtime sizing absorb growth in source count and volume. When a single team’s lake outgrows central ownership, the same per-domain container/ACL/Purview model extends naturally into a data mesh, with each domain owning its Silver/Gold and publishing certified products into a shared Purview catalog.
Reliability and DR (RTO/RPO). Set targets explicitly. For most analytical platforms an RPO of 24 hours and an RTO of a few hours is appropriate — analytics is not a 24/7 transactional system, and Bronze’s replayability is a safety net. To meet it: use GZRS/RA-GZRS replication so lake data survives a region loss with a readable secondary; keep all IaC in Git so the entire platform can be redeployed into the paired region from code; and, because Bronze is immutable and Silver/Gold are deterministically derived from it, you can rebuild Silver and Gold by replaying transformations rather than backing them up separately. For tighter RPO, replicate Gold marts and the Dedicated pool on a schedule. Test the failover — an untested DR plan is a hope, not a control.
Observability. Centralize everything in Log Analytics: storage transaction/throttling metrics, ADF pipeline run status and durations, Synapse Spark application logs and SQL request history, and Key Vault access logs. Build alerts on pipeline failure, on Spark job duration regressions, on Serverless data-scanned anomalies (cost spikes), and on storage throttling. Data-quality observability is distinct from infra observability — bake quality checks (row counts, null-rate thresholds, referential checks) into the Bronze→Silver step and surface failures the same way you surface a pipeline error. Purview’s lineage doubles as operational observability: when a Gold number looks wrong, lineage tells you exactly which upstream job and source to inspect.
Governance. This is where the architecture earns its keep. Purview gives the enterprise a single catalog where any analyst can search for “revenue” and find the certified Gold table, see its owner, its glossary definition, its classification, and its full lineage back to source. Auto-classification flags regulated data wherever it lands. Glossary terms tie business language to physical columns. And Data Use Management lets owners attach access policy to lake paths centrally, so “finance data is restricted to the finance group” is enforced and auditable rather than tribal knowledge. The combination directly answers the Risk Officer’s three questions — where did it come from, who can see it, is it regulated — for every column in every report.
Reference enterprise example
NorthForge Logistics is a fictional freight and supply-chain company: ~4,000 employees, three acquired regional carriers each with its own systems, and a customer-facing shipment-tracking product generating heavy clickstream and IoT telemetry from vehicle sensors. Their data was scattered across an SAP ERP, a Salesforce CRM, three on-prem SQL Server databases (one per acquired carrier), a payments processor, and an Event Hubs stream of tracking events. Finance closed the books with a 9-day manual reconciliation; the data science team’s ETA-prediction model was stalled because nobody could assemble clean historical shipment data; and a customer audit had flagged that NorthForge couldn’t prove where a regulated shipper-identity field flowed.
What they built. A single ADLS Gen2 account (ZRS, HNS on, CMK, private endpoints, public access off) with bronze/silver/gold/sandbox containers. A metadata-driven ADF framework with one self-hosted IR reaching the three on-prem SQL Servers and managed-VNet connectors for SAP, Salesforce, and the payment API — 47 sources driven by a 47-row control table. Event Hubs Capture landed tracking telemetry straight into Bronze. Synapse Spark (small autoscaling pool, auto-pause 15 min) ran Bronze→Silver dedup and conformance into Delta, and Silver→Gold built three marts: a finance revenue mart, a customer-360 mart, and a shipment feature table for ML. Power BI served finance and ops in DirectQuery over Synapse Serverless; one high-concurrency executive dashboard ran on a small Dedicated SQL pool paused nightly. Purview scanned all zones nightly, auto-classified the shipper-identity and payment fields as sensitive, and harvested lineage from ADF and Synapse end to end.
The numbers. Steady-state monthly Azure spend landed around $8,400: roughly $1,100 storage (most of Bronze tiered to Cool/Cold), $2,600 Synapse Spark, $1,900 Serverless SQL, $1,500 for the paused-when-idle Dedicated pool, $600 ADF, $400 Purview, and the remainder networking and Key Vault. The decisions that drove that figure were deliberate: Serverless-first (a fixed Dedicated pool sized for the same workload would have roughly doubled the warehouse line), aggressive auto-pause on Spark, and Bronze lifecycle tiering.
The outcome. Finance close dropped from 9 days to under 1, because the revenue mart tied out across SAP, the three carriers, and the payment processor automatically with a daily audit trail. The data science team shipped the ETA model in a quarter instead of stalling, training on years of clean Silver shipment history. And in the follow-up audit, NorthForge answered the regulator’s lineage question in minutes — Purview showed the shipper-identity field’s path from the SAP source table through the exact ADF pipeline and Synapse job into the customer-360 mart, with its classification and the finance-group-only access policy attached. The platform paid for itself on the close-time reduction alone.
When to use it
Use this architecture when you have multiple data sources that need to be joined and served to more than one consumer (BI, ad-hoc SQL, data science) and where governance — lineage, classification, access control — is a real requirement, not an afterthought. It is the right backbone for enterprise analytics, regulated-data reporting, and ML feature platforms, and it scales cleanly from one business unit to a federated data mesh.
Trade-offs to go in with eyes open. This is a platform, not a weekend project — it carries real operational ownership (pipeline monitoring, data-quality SLAs, catalog curation). The medallion discipline and metadata-driven ingestion are upfront investments that pay off only at a certain scale of sources and consumers.
Anti-patterns to avoid:
- The “data swamp.” Skipping Silver/Gold conformance and letting consumers query raw Bronze directly. It feels faster on day one and is unmaintainable by month three.
- One pipeline per source. Hand-built, non-parameterized pipelines multiply into an unmaintainable thicket. Go metadata-driven from the start.
- Governance later. Standing up the lake and deferring Purview is how you get the audit finding. Classification and lineage are cheapest when they’re there from the first scan.
- Dedicated SQL pool as the default. Provisioning a warehouse you leave running for workloads Serverless would handle is the most common cost mistake in this stack.
- Account keys and public endpoints “just for now.” They never get cleaned up. Start with managed identities and private endpoints.
Alternatives. If your organization is standardizing on a lakehouse-first, notebook-centric data-engineering culture, Azure Databricks over the same ADLS Gen2 lake is a strong alternative to (or complement alongside) Synapse, with the same medallion and Purview governance model. Microsoft Fabric is the newer SaaS-unified option that folds OneLake, Data Factory, Synapse engines, and Power BI into a single capacity-based product — compelling when you want less infrastructure to manage and a tighter Power BI integration, and worth evaluating against this composed-services approach when greenfield. For purely streaming-analytics needs without a broad lake, Azure Stream Analytics or Event Hubs + a real-time warehouse may be lighter-weight. But when the requirement is a governed, multi-consumer enterprise data platform with strong cost control and auditability, the ADLS Gen2 + ADF + Synapse + Purview architecture described here remains the proven, reusable Azure reference.