Azure Data

Azure Data Integration & Analytics: Data Factory, Synapse & Microsoft Fabric

Almost every “AI” or “analytics” project you will ever touch rests on the same unglamorous foundation: getting data out of the systems where it is created, into a place where it can be cleaned and combined, and out again in a shape that a dashboard, a report, or a model can use. That pipeline — ingest, store, transform, serve, govern — is the analytics data platform, and on Azure it is built from a small set of services you must know cold as an architect: Azure Data Factory, Azure Synapse Analytics, Azure Data Lake Storage Gen2, Microsoft Fabric, and Microsoft Purview.

This lesson is the senior-architect’s map of that landscape. We will walk the full data journey one stage at a time, build the mental model of the medallion architecture that organises a modern lake, and — most importantly for interviews and real projects — give you a clear, defensible answer to the question everyone is asking in 2026: “Synapse, Fabric, or Databricks — which do we use, and why?” Microsoft has made its strategic direction explicit, and you need to understand both where the platform is going and what to do with the large estates already running on the previous generation.

This is an Advanced lesson. We assume you already know the Azure basics (subscriptions, resource groups, regions, storage accounts, Entra identity) and have at least passing familiarity with SQL. You do not need to be a data engineer — we define every term — but by the end you should be able to design a platform, defend the choices, and speak fluently to the DP-900 and DP-700 certifications.

Learning objectives

By the end of this lesson you can:

Prerequisites & where this fits

You should be comfortable with the Azure fundamentals — the tenant → subscription → resource group → resource hierarchy, storage accounts (Data Lake Gen2 is a storage-account feature), and Entra identity and RBAC, because a data platform is, at heart, identity plus storage plus compute. Basic SQL helps. This lesson sits in the Data module of the Azure Zero-to-Hero course and is deliberately broad-but-deep: rather than exhausting one service, it joins the whole platform together so you can reason about the system, not just its parts. It is the natural companion to the database deep-dives (Azure SQL, Cosmos DB) and the streaming services (Event Hubs).

Core concepts: the shape of a data platform

Before any service, fix the mental model. An analytics platform moves data through five stages. Everything Microsoft sells maps onto one or more of them.

Stage What it does Azure services
Ingest Copy/move data from sources (databases, SaaS apps, files, APIs, streams) into the platform Data Factory, Synapse pipelines, Fabric Data Factory, Event Hubs/IoT (streaming)
Store Hold raw and refined data cheaply and durably Data Lake Storage Gen2, Fabric OneLake, Blob
Transform Clean, join, aggregate, reshape — turn raw data into trustworthy tables Spark pools, Synapse/Fabric SQL, Databricks, Data Flows, dbt
Serve Expose modelled data for queries, reports and models with good performance Dedicated SQL pool / Fabric Warehouse, Power BI, Analysis Services
Govern Catalogue, classify, secure, and trace lineage across all of the above Microsoft Purview, Entra, RBAC, Private Link

Two cross-cutting ideas underpin everything:

A few more terms you will use constantly: ETL (Extract-Transform-Load — transform before loading, the classic warehouse pattern) versus ELT (Extract-Load-Transform — land raw data first, transform inside the platform using its compute, the modern lake pattern). Batch (process data on a schedule, e.g. nightly) versus streaming (process events continuously as they arrive). OLTP (transactional systems — your apps, optimised for many small writes) versus OLAP (analytical systems — your warehouse/lake, optimised for large aggregating reads). The whole platform exists to ferry data from OLTP systems into an OLAP store without slowing the apps down.

Ingest: Azure Data Factory and Synapse pipelines

Azure Data Factory (ADF) is Azure’s cloud data-integration (ETL/ELT orchestration) service. Think of it as a managed, serverless conductor: it does not store your data — it moves and orchestrates it. You define what to do and when, and ADF runs it on managed compute, scaling to zero when idle (you pay per activity run and per data-movement hour, not for an always-on server).

The same engine is embedded inside Azure Synapse Analytics as Synapse pipelines (and again inside Microsoft Fabric as Fabric Data Factory / Dataflows Gen2). The authoring experience and concepts are nearly identical across all three — learn ADF and you know all of them. The building blocks:

Concept What it is
Pipeline A logical container of activities that run as a unit — your workflow/DAG
Activity A single step. Three families: data movement (Copy), data transformation (Data Flow, Spark/SQL/Stored Proc, Databricks notebook), and control flow (ForEach, If, Until, Lookup, Execute Pipeline, Wait)
Dataset A named, typed view of data inside a store (a table, a folder of CSVs) — the what
Linked service A connection string / credential to a store or compute — the where/how to connect (200+ connectors)
Integration runtime (IR) The compute that actually executes the activity — the engine (see below)
Trigger What starts a pipeline: schedule, tumbling window (fixed, contiguous, back-fillable time slices), event (a blob lands / a custom event), or manual

The Copy activity is the workhorse — it bulk-moves data from a source to a sink, optionally with light mapping and type conversion, and is highly parallelised. For heavier reshaping without writing code, Mapping Data Flows give you a visual transformation canvas (joins, aggregates, derived columns, pivots) that ADF compiles to Apache Spark and runs on a managed cluster — so you get Spark power without managing Spark.

Integration runtimes — the most-tested ADF concept

The integration runtime (IR) is where an activity runs, and choosing the right one is a classic exam and interview question because it is where networking and security live.

IR type Where it runs Use it for
Azure IR Fully managed, serverless, in Azure (a region you pick or auto-resolved) Cloud-to-cloud copy and Data Flows when all endpoints are public/Azure. The default.
Self-hosted IR (SHIR) Software you install on a VM/machine inside your own network (on-prem or another VNet) Reaching private/on-prem sources (SQL Server, files, private endpoints) — it makes an outbound connection to ADF, so no inbound firewall holes
Azure-SSIS IR A managed cluster of VMs that runs the SSIS engine Lift-and-shift of existing SQL Server Integration Services packages to the cloud unchanged

The mental shortcut: cloud-only → Azure IR; on-prem or private data → Self-hosted IR; migrating old SSIS packages → Azure-SSIS IR. A self-hosted IR is also how you connect to data behind a private endpoint or a corporate firewall, because it dials out to Azure rather than requiring inbound access. For VNet-native private connectivity entirely within Azure, the Azure IR also supports a managed virtual network with managed private endpoints to your data stores.

ADF/Synapse pipelines also give you the production-grade plumbing you need around the moving of data: parameters and variables for reuse, integration with Git (Azure Repos/GitHub) for source-controlled pipelines with a publish branch, CI/CD via ARM templates, built-in monitoring of every run with retry and alerting, and managed-identity authentication so pipelines reach Key Vault, storage and databases without secrets in the definition.

Store: Data Lake Storage Gen2 and the medallion architecture

The store layer is Azure Data Lake Storage Gen2 (ADLS Gen2) — which is not a separate product but a capability of a standard storage account: you enable the hierarchical namespace (HNS) at creation and a flat blob store becomes a true file system with real directories, atomic directory rename/delete, and POSIX-style ACLs on files and folders. That hierarchy is what makes lake-scale analytics efficient (listing and securing a folder of a million files is cheap) and is the single setting that distinguishes a “data lake” from ordinary Blob storage. You get all the storage-account redundancy and tiering options (LRS/ZRS/GRS, Hot/Cool/Cold/Archive) underneath, so lake storage is cheap and durable by default.

A lake without organisation becomes a data swamp — a dumping ground nobody trusts. The industry-standard cure is the medallion architecture: organise the lake into three quality layers, each consuming the one before and improving it.

Layer Also called Contents Format Who uses it
Bronze Raw / landing Exact copies of source data, as ingested, append-only, immutable history Often source format (CSV/JSON) or Parquet Data engineers (reprocessing)
Silver Cleansed / conformed De-duplicated, validated, typed, joined into clean entity tables Delta/Parquet Engineers, data scientists
Gold Curated / serving Business-level aggregates and dimensional models, ready for reports Delta/Parquet, often a star schema Analysts, Power BI, business

The flow is always bronze → silver → gold, and each transition is a transformation job (Spark or SQL). Bronze preserves an immutable record of what arrived (so you can always reprocess if a bug is found downstream); silver is the clean, trustworthy single source of truth; gold is shaped for consumption — denormalised star schemas (fact and dimension tables) tuned for fast aggregation in dashboards. This separation is what keeps a lake usable for years: raw history is never lost, cleansing logic lives in one place, and consumers only ever see curated gold.

The de-facto open format for silver and gold is Delta Lake — Parquet files plus a transaction log that adds ACID transactions, time-travel (querying old versions), schema enforcement and evolution, and efficient upserts (MERGE) to plain files in the lake. Delta is what turns a “lake” into a “lakehouse”, and it is the native table format of both Microsoft Fabric and Databricks, which is why those two interoperate so cleanly over the same files.

Transform & serve: the engine landscape (Synapse, Fabric, Databricks)

Storage is cheap and standard; the interesting — and most-asked — decisions are about the engine that transforms and serves the data. There are two engine families, and the major platforms package them differently.

Azure Synapse Analytics (the previous-generation unified service)

Azure Synapse Analytics was Microsoft’s bid to put all of these under one roof in Synapse Studio: pipelines (the ADF engine), dedicated SQL pools (the former SQL Data Warehouse — an MPP, massively-parallel-processing warehouse you provision in DWUs), serverless SQL pools (query lake files on demand, billed per TB scanned), and Apache Spark pools. It is a genuinely capable platform and powers thousands of production estates. But Microsoft’s investment has clearly shifted: Synapse is now in maintenance mode and its capabilities are being delivered through Microsoft Fabric instead. You will still meet Synapse constantly in existing estates and on the DP-900/DP-700 exams, so you must understand it — but you would not start a new platform on it today.

Microsoft Fabric (the strategic direction)

Microsoft Fabric is Microsoft’s unified, SaaS analytics platform and the explicit strategic direction for data and analytics on Microsoft’s stack. It folds the whole platform — Data Factory, data engineering (Spark), data warehouse, real-time intelligence, data science, and Power BI — into a single product with one capacity-based licence (Fabric F-SKUs, billed by capacity units, with pause/resume). The keys to understanding Fabric:

Azure Databricks (the best-in-class Spark lakehouse)

Azure Databricks is a first-party Azure service (co-engineered with Databricks, the company that created Apache Spark, Delta Lake, and the lakehouse concept). It is the premier platform for large-scale Spark and machine-learning/AI workloads, with Unity Catalog for governance, Photon (a vectorised query engine), and MLflow for the ML lifecycle. It is multi-cloud (the same product runs on AWS and GCP), which is a deciding factor for organisations that refuse to be locked to one cloud. It excels where the work is heavy data engineering and data science; for self-service BI deeply integrated with Power BI and Office, Fabric is the more natural home.

Choosing between them — the decision an architect must defend

Dimension Microsoft Fabric Azure Synapse Azure Databricks
Strategic status Strategic direction; actively invested Maintenance/legacy; superseded by Fabric Premier Spark/ML platform; multi-cloud
Delivery model SaaS — minimal provisioning, capacity (F-SKU) PaaS — provision pools (DWU/Spark/serverless) PaaS — provision workspaces/clusters
Sweet spot Unified analytics + Power BI for business; analyst-friendly lakehouse Existing MPP warehouse + lake estates Large-scale Spark/ML/AI, advanced data science
Storage OneLake (one copy, Delta) ADLS Gen2 + dedicated pool storage ADLS Gen2 (Delta)
Governance Purview + Fabric-native Purview Unity Catalog (+ Purview)
Multi-cloud No (Azure/M365) No Yes (AWS/GCP/Azure)
Best for BI Excellent (DirectLake, native Power BI) Good (via Power BI) Good (via Power BI/Databricks SQL)

The architect’s heuristics:

Note these are not mutually exclusive: a very common 2026 architecture is Databricks for engineering/ML + Fabric/Power BI for serving, both reading the same Delta tables in the lake. Open formats are precisely what make that possible.

Serve: warehouse and Power BI

“Serving” is making curated gold data fast and friendly for consumers. Two layers:

Mode How it works Trade-off
Import Data is copied into Power BI’s in-memory (VertiPaq) engine on refresh Fastest queries, but data is a snapshot and limited by memory; needs scheduled refresh
DirectQuery Every visual issues a live query to the source Always current, no size limit, but slower and load on the source
DirectLake (Fabric) Reads Delta tables directly from OneLake in memory, no import, no per-query round-trip Import-grade speed and live data — the best of both, Fabric-only

DirectLake is the reason Fabric is so compelling for BI: it removes the historic choice between “fast but stale” (Import) and “fresh but slow” (DirectQuery). Power BI content is organised into datasets/semantic models (the modelled tables + relationships + DAX measures), reports (the visuals), and dashboards, published to workspaces and shared via apps, with row-level security (RLS) controlling who sees which rows.

Azure data & analytics platform

The diagram traces a record’s whole journey: sources flow through Data Factory ingestion (with the right integration runtime) into the bronze layer of a Data Lake Gen2 / OneLake medallion, are refined bronze → silver → gold by Spark and SQL engines, then surface through a warehouse and Power BI — with Purview governing and cataloguing everything end to end.

Govern: Microsoft Purview

A platform that moves an organisation’s data is worthless — and dangerous — without governance: knowing what data you have, where it is, who can see it, whether it is sensitive, and where it came from. Microsoft Purview is Azure’s unified data governance and compliance suite, and it spans the whole platform (and beyond Azure).

The architect’s takeaway: wire Purview in from day one. Retrofitting a catalogue and lineage onto a sprawling, ungoverned lake is painful; building governance in from the start keeps the platform trustworthy as it grows. In Fabric, governance is increasingly native and Purview-integrated out of the box.

Hands-on lab: a minimal lake + a Data Factory copy pipeline

We will build the smallest meaningful slice of the platform with the az CLI: a Data Lake Gen2 account (the store), a Data Factory (the ingest engine), and a Copy pipeline that lands a public sample file into a bronze container. It is essentially free (a few KB of storage and one tiny pipeline run) and we clean up at the end.

Prerequisites: an Azure subscription and Cloud Shell (or local az). Ensure the Data Factory CLI extension is present:

az extension add --name datafactory --upgrade

1. Set variables and create a resource group.

RG=rg-dataplatform-lab
LOC=centralindia
SA=lakedemo$RANDOM           # storage account names must be globally unique, lowercase
ADF=adf-dataplatform-$RANDOM

az group create --name $RG --location $LOC

2. Create the Data Lake Gen2 account (hierarchical namespace ON) and the medallion containers.

az storage account create \
  --name $SA --resource-group $RG --location $LOC \
  --sku Standard_LRS --kind StorageV2 \
  --hierarchical-namespace true            # <-- this is what makes it a Data Lake

# create bronze / silver / gold filesystems (containers)
for layer in bronze silver gold; do
  az storage fs create --name $layer --account-name $SA --auth-mode login
done

--hierarchical-namespace true is the single setting that turns a storage account into ADLS Gen2. Expected: three filesystems listed by az storage fs list --account-name $SA --auth-mode login -o table.

3. Create the Data Factory.

az datafactory create --resource-group $RG --factory-name $ADF --location $LOC

4. Author the linked service, datasets and a Copy pipeline. ADF objects are JSON. We copy a public sample CSV (an HTTP source) into the bronze container — a tiny end-to-end ingest.

# Linked service -> our lake (managed identity auth)
SA_DFS="https://$SA.dfs.core.windows.net/"
az datafactory linked-service create -g $RG --factory-name $ADF --name lsLake --properties "{
  \"type\": \"AzureBlobFS\",
  \"typeProperties\": { \"url\": \"$SA_DFS\" }
}"

# Grant the factory's managed identity write access to the lake
ADF_MI=$(az datafactory show -g $RG -n $ADF --query identity.principalId -o tsv)
SA_ID=$(az storage account show -g $RG -n $SA --query id -o tsv)
az role assignment create --assignee $ADF_MI \
  --role "Storage Blob Data Contributor" --scope $SA_ID

# Linked service -> public HTTP source
az datafactory linked-service create -g $RG --factory-name $ADF --name lsHttp --properties '{
  "type": "HttpServer",
  "typeProperties": { "url": "https://raw.githubusercontent.com/", "enableServerCertificateValidation": true, "authenticationType": "Anonymous" }
}'

Authoring full dataset + pipeline JSON is verbose on the CLI; in practice you would use the Data Factory Studio visual authoring (Author → new pipeline → drag a Copy data activity, pick source = HTTP dataset, sink = the bronze filesystem, then Debug to run). The CLI above provisions the linked services and identity; finish the Copy activity in Studio and click Debug, or import the pipeline JSON via az datafactory pipeline create.

5. Validate. In Studio, open Monitor → Pipeline runs and confirm the run Succeeded, then check the file landed:

az storage fs file list --file-system bronze --account-name $SA --auth-mode login -o table

You have now exercised the core loop: ingest (ADF) → store (bronze in the lake). From here, a real platform would add a Spark/SQL job to refine bronze→silver→gold, a serving warehouse, and Power BI — but the orchestration and storage pattern is exactly what you just built.

Cleanup — delete the whole resource group so nothing lingers on the bill:

az group delete --name $RG --yes --no-wait

Cost note: at lab scale this is effectively free — a Data Factory has no standing charge (you pay per activity run, ~₹0.001s of rupees here, and per data-movement DIU-hour), and a few KB in Standard LRS storage is negligible. The expensive parts of a real platform are always-on compute (a dedicated SQL pool, a running Spark cluster, a Fabric capacity) — so the cardinal cost rule is pause/stop compute when idle and prefer serverless/consumption for spiky work.

Common mistakes & troubleshooting

Symptom Likely cause Fix
ADF Copy from on-prem SQL fails / times out Using the Azure IR to reach a private source Install a Self-hosted IR on a machine inside that network and bind the linked service to it
Lake has millions of tiny files; queries crawl “Small-file problem” — many KB-sized files kill Spark/SQL performance Compact to larger Parquet/Delta files (OPTIMIZE); partition sensibly; avoid one-file-per-row writes
Storage account behaves like flat blob, ACLs missing Hierarchical namespace not enabled at creation HNS can’t be toggled later — create a new ADLS Gen2 account with HNS on and migrate
Pipeline can’t authenticate to storage/SQL Secrets in JSON, or managed identity lacks a role Use the factory’s managed identity + Key Vault references; grant Storage Blob Data Contributor/SQL role
Power BI report is always stale Import mode without a scheduled refresh Schedule dataset refresh, or switch to DirectLake (Fabric) / DirectQuery for live data
Dedicated SQL pool bill is huge even when unused Pool is provisioned and running 24×7 Pause the dedicated SQL pool when idle; or use serverless for intermittent querying
“Data swamp” — nobody trusts the lake No medallion structure, no catalogue Adopt bronze/silver/gold; scan with Purview so assets are discoverable with lineage
Started a new project on Synapse Unaware Synapse is superseded New work belongs in Fabric; reserve Synapse for operating existing estates

Best practices

Security notes

A data platform concentrates an organisation’s most sensitive information, so security is non-negotiable:

Interview & exam questions

Practise saying these answers out loud — they are the ones that come up.

Quick check

  1. Which single storage-account setting turns a plain blob account into a Data Lake Gen2, and can it be changed later?
  2. You must copy data from an on-prem SQL Server into the lake. Which integration runtime do you use, and why is it firewall-friendly?
  3. Name the three medallion layers in order and say, in one phrase, what each is for.
  4. A Power BI report must show live lake data at near-Import speed. Which storage mode, and on which platform?
  5. For a brand-new, Microsoft-centric, BI-led analytics platform in 2026, which service is the strategic default — and what is the right disposition for an existing Synapse estate?

Answers

  1. Hierarchical namespace (HNS), enabled at creation. It cannot be toggled on later — you create a new ADLS Gen2 account with HNS on and migrate.
  2. A Self-hosted Integration Runtime installed inside the on-prem network. It makes an outbound connection to ADF, so no inbound firewall ports need opening.
  3. Bronze = raw/immutable as-ingested; Silver = cleansed/conformed/trustworthy; Gold = curated business aggregates/star schema for serving. Order: bronze → silver → gold.
  4. DirectLake mode, on Microsoft Fabric (reads Delta directly from OneLake — Import-grade speed with live data).
  5. Microsoft Fabric is the strategic default. Keep operating the existing Synapse estate but plan its migration to Fabric; start no new workloads on Synapse.

Exercise

Design (on paper, then sketch the resources/az outline) an end-to-end platform for a small retailer that has: an on-prem SQL Server of orders, a SaaS CRM with a REST API, and a stream of website clickstream events. Requirements: a single trustworthy analytics store, a daily refreshed sales dashboard in Power BI, and PII (customer email) must be discoverable and protected.

Produce: (a) the ingest design — which sources use which integration runtime, and batch vs streaming for each; (b) the lake design — the medallion containers and what lands in bronze/silver/gold; © the transform/serve engine choice with a one-line justification (Fabric? Synapse? Databricks? serverless SQL?); (d) the Power BI storage mode and why; (e) the governance plan — what Purview scans, classifies and traces; and (f) three security controls (identity, networking, access). Bonus: name one place where the small-file problem could bite and how you’d prevent it.

Certification mapping

DP-900 (Azure Data Fundamentals):

DP-700 (Fabric Data Engineer Associate):

(The platform concepts here also underpin DP-203-style data-engineering and AZ-305 “design data storage solutions” scenarios.)

Glossary

Next steps

AzureData FactorySynapseMicrosoft FabricData LakeDP-700
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading