Data Azure

Azure Enterprise Architecture: Intelligent Data Warehousing & Lakehouse

Most enterprises do not get to design a data platform on a blank page. They inherit one: a SQL Server data warehouse that loads overnight, a pile of CSV extracts on a file share, a “data lake” that turned into a data swamp, three teams each running their own copy of the same customer table, and a finance leader asking why two dashboards built from “the same data” show different revenue. The intelligent lakehouse on Azure is the architecture that consolidates all of that into one governed copy of the data, serves it to both SQL analysts and data scientists from the same storage, and puts a catalog and lineage layer over the top so you can answer where did this number come from without a war room.

This article is a reference architecture in the spirit of the Azure Architecture Center: it picks real Azure services, wires them end to end, and makes the trade-off calls explicit. The non-negotiable design choice is that ADLS Gen2 holding Delta tables is the single source of truth, and every compute engine — Azure Databricks, Microsoft Fabric, Azure Synapse, Power BI — reads and writes that same storage rather than keeping its own copy. The running example is a mid-market retailer, but the pattern scales down to a single team and up to a regulated bank without changing shape.

The business scenario

Picture a company — call the segment “small-to-large enterprise” because the architecture genuinely spans it. The symptoms are remarkably consistent regardless of size:

The business outcome the lakehouse is chartered to deliver: one governed copy of the data, fresh enough for operational decisions, queryable by every persona in their preferred tool, with auditable lineage and a cost model that scales with usage. Concretely — month-end close that does not require a 6-hour batch window, a fraud signal visible in minutes not hours, a single certified revenue figure, and the ability to hand an auditor a lineage graph from raw file to the cell in the board deck.

A lakehouse hits this because it puts a transactional table layer (Delta Lake) over cheap object storage (ADLS Gen2), so the same files behave like a database to SQL engines and like a dataset to Spark — no second copy, no second pipeline.

Architecture overview

The end-to-end data path moves left to right through five planes — ingest, store, transform, serve, govern — with governance and security wrapping the whole thing rather than sitting at the end.

1. Ingest. Source data arrives by three doors. Batch and CDC from operational databases (Azure SQL, on-prem SQL Server, SAP, Oracle) lands via Azure Data Factory or Fabric Data Factory pipelines and mirroring. High-volume events (clickstream, IoT, application telemetry) stream through Azure Event Hubs (Kafka-compatible). Files from partners and SaaS exports drop into a landing container. All three doors write into the lake; nothing transforms data on the way in.

2. Store — the source of truth. Everything lands in Azure Data Lake Storage Gen2 (a storage account with hierarchical namespace enabled), organized as Delta Lake tables in a medallion layout: Bronze (raw, append-only, faithful to source), Silver (cleaned, conformed, deduplicated, typed), Gold (business-level aggregates and dimensional models curated for consumption). Delta’s transaction log under _delta_log/ gives ACID commits, so a reader never sees a half-written partition and streaming and batch writers can hit the same table without corrupting each other. This storage account is the centre of gravity; every other component points at it.

3. Transform. The transformation engine is the one genuine fork in this architecture, and it is Azure Databricks or Microsoft Fabric (or both, sharing storage — covered below). Spark structured streaming and batch jobs read Bronze, apply quality rules and MERGE upserts to build Silver, and aggregate Silver into Gold. Crucially the same transformation code services both the streaming and batch path, so the two cannot drift. Orchestration is Databricks Workflows / Delta Live Tables or Fabric Data Pipelines and Spark notebooks.

4. Serve. Consumers read Gold (and selectively Silver) through the engine that suits them, all pointed at the same Delta tables:

5. Govern (the wrapper). A single governance plane spans all of the above. Unity Catalog (Databricks) and/or OneLake Catalog (Fabric) provide the technical catalog, fine-grained access control, and column/row security on the tables. Microsoft Purview sits above as the enterprise data map: it scans the lake, classifies PII, captures end-to-end lineage from source through Bronze/Silver/Gold to the Power BI report, and hosts the business glossary and data-quality/access policies. Microsoft Entra ID is the identity backbone — every engine authenticates with Entra, and storage access flows through managed identities and Entra groups, never storage keys.

If you sketch it: a horizontal pipeline (Event Hubs / ADF on the left → ADLS Gen2 Delta medallion in the centre → Synapse / Fabric / Power BI on the right), with Databricks or Fabric Spark as the engine bolted onto the lake in the middle, and two horizontal bands underneath everything — Entra ID + Private Link (identity and network) and Purview + Unity/OneLake Catalog (governance and lineage) — that touch every box.

Azure lakehouse reference architecture: events and CDC ingest through Event Hubs and Data Factory into ADLS Gen2 Delta tables in a Bronze/Silver/Gold medallion, transformed by Databricks or Fabric Spark and served to Synapse SQL, Databricks notebooks and Power BI Direct Lake, with Purview, Unity/OneLake Catalog and Entra ID + Private Link governing every layer

Component breakdown

Component Azure service What it does Why it’s here / key config
Lake storage ADLS Gen2 (storage account, HNS on) Single source of truth for all Delta tables Hierarchical namespace required for directory-atomic operations Delta relies on; one container per medallion zone (bronze/silver/gold) or per domain; lifecycle rules to cool/archive old Bronze; soft delete + versioning on
Batch & CDC ingest Azure Data Factory / Fabric Data Factory Pulls from databases, SAP/Oracle, SaaS; CDC and mirroring 100+ connectors; Mapping Data Flows or copy-into-Bronze only (no business logic on ingest); self-hosted IR for on-prem sources
Streaming ingest Azure Event Hubs (Kafka API) Ingests high-volume events with replay Capture to ADLS or structured-streaming read into Bronze; partition count sized to throughput units; 1–7 day retention as the replay buffer
Table format Delta Lake (open format) ACID transactions, time travel, schema enforcement, change data feed Snapshot isolation via optimistic concurrency lets streaming + batch share a table; OPTIMIZE/VACUUM for file sizing; liquid clustering or Z-ORDER on high-cardinality filter columns
Transformation engine Azure Databricks or Microsoft Fabric Spark Builds Silver/Gold; same code for stream + batch Databricks for deep ML/heavy Spark; Fabric for Power BI-centric, SaaS-simple shops; DLT / pipelines for declarative ETL with expectations
SQL serving Synapse serverless SQL, Fabric Warehouse / SQL endpoint, Databricks SQL T-SQL/ANSI SQL over Delta for BI and ad-hoc Serverless = pay per TB scanned, no cluster to manage; dedicated pools only if you need predictable, heavily concurrent warehouse SLAs
BI semantic layer Power BI + Fabric Direct Lake Dashboards, certified datasets, self-service Direct Lake reads Delta from OneLake with no import/refresh and no DirectQuery latency; certified/promoted datasets enforce one version of the metrics
Technical catalog & ACL Unity Catalog / OneLake Catalog Catalog, fine-grained grants, row/column security, lineage within the engine Catalog → schema → table hierarchy; ABAC/row filters and column masks; managed identities for storage credentials
Enterprise governance Microsoft Purview Cross-service data map, PII classification, end-to-end lineage, glossary, policy Scheduled scans of ADLS/Synapse/Databricks/Power BI; auto-classification rules; lineage stitched source→report; DLP and access policies
Identity Microsoft Entra ID AuthN for every engine; group-based authZ Managed identities for compute→storage; Entra groups map to catalog roles; Conditional Access and PIM for privileged data access
Networking Private Link, VNet, NSGs Keeps the data plane off the public internet Private endpoints for the storage account, Databricks (VNet injection / NCC), Synapse, Event Hubs; storage firewall denies public by default

A few configuration choices carry disproportionate weight:

Implementation guidance

Storage and layout. Provision one ADLS Gen2 account (or a small number, split by sensitivity/domain) with HNS enabled, soft delete, blob versioning, and Deny public network access. Create containers bronze, silver, gold, plus landing (raw drops) and checkpoints (streaming state). Within each, partition Delta tables by a low-cardinality column (typically date) and let liquid clustering handle high-cardinality filters. Tier Bronze to Cool after 30–90 days and Archive after a year via lifecycle management — Silver/Gold stay Hot.

Engine wiring (the two-engine “one copy” pattern). The strongest version of this architecture refuses to copy data between Databricks and Fabric. Two viable patterns:

  1. Fabric-led: OneLake is the ADLS Gen2 layer. Databricks (or Fabric Spark) writes Delta into OneLake; the Fabric SQL endpoint and Power BI Direct Lake read it with zero copy. Use OneLake shortcuts to surface existing ADLS Gen2 containers or even S3 buckets into OneLake without moving bytes.
  2. Databricks-led: Databricks + Unity Catalog own the lake on a standalone ADLS Gen2 account. Fabric reads those Delta tables via a OneLake shortcut to the ADLS Gen2 path, and Power BI uses Direct Lake over the shortcut. Unity Catalog remains the access-control authority; Purview scans both.

Either way the rule holds: one set of Delta files, many engines reading it. Resist the gravitational pull toward “just export a copy for the BI team.”

Orchestration. Use Delta Live Tables (Databricks) or Fabric Data Pipelines + notebooks to express the medallion graph declaratively, with built-in expectations (e.g., expect not null email, expect order_total >= 0) that quarantine or drop bad rows instead of failing the run silently. Schedule batch and run structured streaming continuously off the same notebook logic so the two paths cannot diverge.

Infrastructure as Code. Treat the platform as code from day one. Terraform is the pragmatic default because it spans both the Azure control plane and the Databricks workspace (Unity Catalog metastore, catalogs, grants) through the databricks provider, which Bicep cannot reach. Reserve Bicep/ARM for pure-Azure resource groups and policy if your org standardizes there. A sane module split:

Keep state in a remote backend (an ADLS Gen2 container with a lease lock), gate apply behind a pipeline with plan review, and never put secrets in .tfvars — pull them from Key Vault at deploy time.

Identity and networking wiring. No storage account keys, anywhere. Compute reaches storage via managed identity: the Databricks access connector / Fabric workspace identity is granted Storage Blob Data Contributor on the account (or, better, on specific containers), and Unity Catalog storage credentials + external locations scope which paths a catalog can touch. Human access is Entra-group basedgrp-data-analyst, grp-data-engineer, grp-pii-reader — mapped to Unity Catalog roles and Power BI workspace roles, with PIM for just-in-time elevation to PII. Lock the data plane down with Private Link: private endpoints for the storage account, Synapse, Event Hubs, and the Databricks control/data plane (VNet injection + NCC), the storage firewall set to deny public, and Conditional Access requiring compliant devices for anyone touching sensitive catalogs.

Enterprise considerations

Security & Zero Trust. The Zero Trust posture rests on four pillars. Identity: every engine authenticates with Entra; no shared keys or SAS tokens for the data plane; Conditional Access + MFA + PIM gate privileged data. Network: Private Link end to end, storage firewall denying public, no public IPs on Databricks/Synapse. Data: encryption at rest (Microsoft- or customer-managed keys in Key Vault), TLS in transit, and column masking + row filters in Unity Catalog so a grp-analyst sees a tokenized SSN while grp-pii-reader sees the value — enforced at the table, not the report. Classification: Purview auto-classifies PII on scan and drives DLP/access policies, so sensitivity labels travel with the data into Power BI. Assume breach: audit logs from Storage, Databricks, Synapse, and Purview flow to a Log Analytics workspace feeding Microsoft Sentinel for anomaly detection (e.g., a service principal suddenly reading the entire customer table).

Cost optimization. The lakehouse’s economic advantage is decoupled storage and compute — exploit it. Keep Bronze on lifecycle tiers (Hot→Cool→Archive); it is the bulk of the bytes and rarely re-read. Use serverless/auto-stopping SQL and job clusters so you pay for seconds of query, not idle warehouses, and set aggressive auto-termination (10–15 min) on interactive clusters. Right-size the transformation: OPTIMIZE and good file sizing cut scan cost on every downstream query, and Direct Lake eliminates the compute of Power BI import refreshes entirely. Buy reserved capacity / pre-purchase DBUs or Fabric capacity for the steady-state baseline and burst on demand above it. Tag every resource by domain/cost-centre and watch it in Cost Management; the failure mode is a runaway notebook or an un-terminated GPU cluster, so set budgets and alerts.

Scalability. Compute scales horizontally and independently of storage: Spark autoscaling for transformation, serverless SQL scaling per query, Event Hubs throughput units for ingest spikes. Storage scales effectively without limit. The discipline that preserves performance at scale is table hygiene — partition by date, cluster on the columns you filter on, compact small files, and prune with VACUUM — because a lake of millions of tiny files will bring any engine to its knees regardless of cluster size.

Reliability & DR (RTO/RPO). Choose targets by layer, not one number for the platform. Storage is the crown jewel: use GZRS (geo-zone-redundant) on the account for zone + region durability, which gives a low RPO on the data itself. Delta time travel is your fast recovery from logical corruption — a bad MERGE is undone with RESTORE TABLE ... VERSION AS OF, RPO of one commit and RTO of minutes, no restore-from-backup. For regional DR, the pragmatic pattern is active/passive: storage geo-replicates, and the pipelines and compute (Databricks workspace, Fabric capacity, ADF) are re-deployable from Terraform into the paired region — RTO measured in the hour or two it takes to spin compute back up, because the expensive, stateful part (the data) is already there. Set RPO ≈ minutes for streaming Bronze (Event Hubs replay covers the gap) and RTO ≈ 1–4 hours for full analytics service, and write it down per layer.

Observability. Three layers of monitoring. Platform: Azure Monitor + Log Analytics for storage, Event Hubs, ADF, Databricks, and Synapse metrics and logs. Pipeline: DLT/Fabric run history, expectation pass/fail rates, and freshness SLAs (alert when Gold is older than N minutes). Data quality and lineage: Purview holds the lineage graph and data-quality scores, so when a number looks wrong you trace it from the Power BI visual back through Gold → Silver → Bronze → source file in a few clicks. Surface a freshness and quality dashboard the business actually looks at — it is the difference between trusting the platform and quietly rebuilding extracts on the side.

Governance. This is the pillar that makes it enterprise. Unity Catalog (or OneLake Catalog) is the access-control authority — grants live in code, reviewed in PRs. Purview is the system of record for what data exists, what it means, where it came from, and who may use it: the business glossary defines “active customer,” classification tags PII, lineage proves the board number’s provenance, and access policies are auditable. Certified Power BI datasets enforce one version of the metrics so two dashboards cannot disagree again.

Reference enterprise example

NorthPeak Outfitters is a fictional mid-market outdoor-gear retailer: 180 stores, an e-commerce site, ~₹4,200 crore annual revenue, ~40 million transactions a year, and a data team of nine (four engineers, three analysts, two data scientists). Their starting point was the scenario above — a SQL Server warehouse loading at 02:00, three copies of the customer table, a board revenue figure two dashboards disagreed on, and a CFO who had stopped trusting both.

Decisions they made.

Numbers and outcome.

The platform paid for itself not on cost alone but on trust — the CFO using one number again was the real ROI.

When to use it

Use this architecture when you have more than one analytics persona (SQL and data science), data arriving in both batch and streaming, a governance/audit obligation, and a desire to stop paying for compute to hold cold data. It is the right default for any organization consolidating multiple warehouses, lakes, and extract-copies into one governed platform — and it scales from a single team on a modest Fabric capacity up to a regulated enterprise on Databricks + Unity Catalog without changing shape.

Trade-offs and anti-patterns.

Alternatives to weigh. A classic dedicated Synapse / SQL warehouse if your world is genuinely all-SQL, all-batch, with predictable high-concurrency and no data science — simpler, but you pay for peak-sized coupled compute and lose the open-format flexibility. A pure streaming architecture (Azure Stream Analytics / Flink into a serving store) if your problem is overwhelmingly real-time with little historical/batch analytics — but you then lack the cheap, queryable history a lake gives you. Snowflake or BigQuery as a managed warehouse if you want to outsource more of the platform and accept a different (often higher and more opaque) cost and lock-in profile. For most enterprises spanning SQL, ML, batch, and streaming with a governance mandate, the Azure lakehouse on ADLS Gen2 + Delta with Databricks or Fabric is the architecture that satisfies the whole org from one copy of the data — which is exactly why the official architecture centres keep converging on it.

AzureArchitectureEnterpriseReference Architecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading