Azure Data Engineering

DP-203: End-to-End Azure Data Engineering — Ingest, Store, Transform, Serve & Stream

Knowing the names of the Azure analytics services is one thing; being able to assemble them into a working pipeline that ingests yesterday’s orders, cleanses them overnight, joins a live clickstream, and lights up a dashboard by 8 a.m. is another entirely. This lesson is the second of those skills. It is a senior data engineer’s build walkthrough of an end-to-end Azure data platform — the kind of thing you design in an interview on a whiteboard and then implement for real on Monday.

We will follow data through the five stages every platform shares — ingest, store, transform, serve — with streaming running alongside as a first-class path, and orchestration, CI/CD, security and governance wrapped around the whole thing. At each stage we do not merely name the service: we choose between the realistic options (Data Factory vs Synapse pipelines, dedicated vs serverless SQL, Stream Analytics vs Spark Structured Streaming, Synapse vs Microsoft Fabric vs Databricks), say why, and show the az/JSON shape of how it is wired. This is the build-skills companion to the platform overview, and it maps directly to the DP-203: Data Engineering on Microsoft Azure objectives — ingestion and transformation, securing and monitoring data storage and processing, and (increasingly) the Microsoft Fabric-era successor skills.

A note on the 2026 landscape. Microsoft has frozen new investment in standalone Synapse Analytics and made Microsoft Fabric its strategic data platform; the DP-203 exam is being retired in favour of Fabric-centric certifications (DP-700, DP-600). We teach the DP-203 build because the concepts are identical — pipelines, medallion, Delta, Spark, streaming, serving — and the huge installed base of Synapse/ADF estates is exactly what you will operate and migrate. Every pattern here translates one-to-one to Fabric, and we flag the Fabric equivalent as we go.

Learning objectives

By the end of this lesson you can:

Prerequisites

You should be comfortable with the Azure fundamentals — the tenant → subscription → resource group → resource hierarchy, storage accounts (ADLS Gen2 is a storage-account feature), and Entra identity and RBAC, because a data platform is identity plus storage plus compute. Reading SQL and a little Python/PySpark helps but is not required — we explain every snippet. If you want the conceptual map first (what each service is and the Synapse-vs-Fabric-vs-Databricks decision in depth), read the companion Data Factory, Synapse & Microsoft Fabric deep dive — this lesson assumes that map and concentrates on building the pipeline. It sits in the Data Engineering module of the Azure Zero-to-Hero course and is an Advanced lesson; the natural predecessor in the certification ladder is AI-102: Building Production AI.

Core concepts: the shape of an end-to-end pipeline

Before any service, fix the mental model. A data platform is a conveyor belt with five stations. Data is pushed onto the belt by ingestion, parked in the lake, refined by transformation, and handed off to consumers by the serving layer — with a fast lane (streaming) running in parallel for events that cannot wait for the nightly batch.

Stage Job Primary Azure services (Synapse era → Fabric era)
Ingest Pull/push data from sources into the lake Data Factory / Synapse pipelines → Fabric Data Factory, Dataflows Gen2; Event Hubs/IoT (streaming)
Store Hold raw and refined data cheaply, durably, in open formats ADLS Gen2 (Delta/Parquet) → OneLake
Transform Cleanse, conform, aggregate — raw → trustworthy Synapse Spark & SQL pools, Databricks → Fabric notebooks / SQL / Dataflows
Serve Expose curated, modelled data with good performance Dedicated/serverless SQL → Fabric Warehouse/Lakehouse, Power BI (DirectLake)
Stream Process events continuously, low latency Event HubsStream Analytics / Spark Structured Streaming → Fabric Eventstream / Eventhouse (KQL)

Four cross-cutting ideas underpin everything and recur in every interview:

A few terms you will use constantly: OLTP (transactional source systems — your apps, many small writes) vs OLAP (the analytics store — large aggregating reads); CDC (Change Data Capture — capturing only inserts/updates/deletes since last run instead of re-reading the whole table); watermark (the high-water value, e.g. a LastModified timestamp, that marks where the last load stopped); and schema-on-read (the lake imposes structure when you read, not when you write).

Ingest: Data Factory, Synapse pipelines & integration runtimes

Azure Data Factory (ADF) is Azure’s serverless data-integration (ELT orchestration) service — a managed conductor that moves and orchestrates data without storing it. The same engine is embedded in Synapse as Synapse pipelines and in Microsoft Fabric as Fabric Data Factory / Dataflows Gen2; the authoring concepts are near-identical, so learn one and you know all three. Use standalone ADF when integration is the whole job; use Synapse pipelines when ingestion lives next to your Spark/SQL analytics in one workspace; use Fabric for new, BI-led platforms.

The building blocks:

Concept What it is
Pipeline A workflow — an ordered, branching graph of activities
Activity One step (Copy, Data Flow, notebook, stored proc, Lookup, ForEach, If, Until, Web)
Dataset A named, typed view of data in a store (a table, a folder of CSVs)
Linked service A connection string / credential to a store or compute (the “data source”)
Trigger What starts a pipeline — schedule, tumbling window, storage event, or manual
Integration runtime (IR) The compute that actually executes an activity

The integration runtime is the single most-tested ingestion concept because it decides where the work runs and what it can reach:

Integration runtime Where it runs Use it for Gotcha
Azure IR Microsoft-managed, in Azure Cloud-to-cloud copy, public endpoints, Data Flow execution (Spark under the hood) Auto-resolves region; pick region for data residency/latency
Self-hosted IR (SHIR) A VM/host you install inside your network On-prem or private sources (SQL Server, file shares, private endpoints) Dials outbound to ADF (firewall-friendly — no inbound ports); install ≥2 nodes for HA
Azure-SSIS IR Managed cluster of SSIS nodes in Azure Lift-and-shift existing SSIS packages unchanged Always-on cluster — pause/stop when idle or it bills 24×7

Interview classic — “How do you copy data from an on-prem SQL Server into the lake?” Install a Self-hosted Integration Runtime inside the on-prem network; it makes an outbound HTTPS connection to ADF, so no inbound firewall ports are opened. Use it as the IR on the source linked service.

Full vs incremental & CDC

Re-copying an entire table every night does not scale. The two patterns:

On the lake side, both patterns land into Delta and use MERGE (upsert) so a re-run is idempotent — late or duplicate rows update in place rather than appending.

// Copy activity source — incremental query using pipeline parameters
"source": {
  "type": "AzureSqlSource",
  "sqlReaderQuery": "SELECT * FROM dbo.Orders WHERE LastModified > '@{pipeline().parameters.oldWatermark}' AND LastModified <= '@{pipeline().parameters.newWatermark}'"
}

Store: ADLS Gen2, the medallion & Delta/Parquet

The lake is Azure Data Lake Storage Gen2 (ADLS Gen2) — a storage account with the hierarchical namespace (HNS) enabled, which gives real folders and POSIX-style ACLs instead of a flat blob namespace. HNS must be enabled at creation and cannot be toggled on later — a favourite exam gotcha. Organise the lake with the medallion architecture: three layers (containers) of rising data quality.

Layer Also called Contents Format Who writes it
Bronze Raw / landing Data exactly as ingested, immutable, append-only, partitioned by load date Parquet/Delta (or raw CSV/JSON) Ingestion pipelines
Silver Cleansed / conformed De-duplicated, typed, validated, conformed keys; one row per business entity Delta Transform jobs
Gold Curated / serving Business-level aggregates, star-schema facts & dimensions, ready for reporting Delta Transform jobs

File format is a deliberate choice:

Format What it is When
CSV/JSON Row-based text Only at the very edge (raw bronze ingest); never for analytics — slow, untyped, huge
Parquet Columnar, compressed, typed Good general analytics format; great scan/compression
Delta Lake Parquet + a transaction log The default for silver/gold — adds ACID transactions, time-travel, schema enforcement/evolution and efficient MERGE/upserts; the open format Synapse, Fabric and Databricks all share

Partitioning is how you make the lake fast: physically split data into folders by a frequently-filtered column — usually date (/year=2026/month=06/day=15/) — so engines prune (skip) irrelevant partitions instead of scanning everything. Partition on columns you filter on, keep partition counts sane (avoid thousands of tiny partitions), and watch the small-file problem — many tiny files crush performance, so compact them (Delta OPTIMIZE) into ~128 MB–1 GB files.

# Land bronze → write Delta silver, partitioned, with an idempotent upsert
from delta.tables import DeltaTable
df = spark.read.format("delta").load("abfss://bronze@lake.dfs.core.windows.net/orders")
silver_path = "abfss://silver@lake.dfs.core.windows.net/orders"
(DeltaTable.forPath(spark, silver_path).alias("t")
  .merge(df.alias("s"), "t.order_id = s.order_id")
  .whenMatchedUpdateAll().whenNotMatchedInsertAll().execute())   # idempotent

Transform (batch): Synapse Spark, SQL pools & Databricks

Transformation turns bronze into trustworthy silver and curated gold. You have three engines; choosing well is a core DP-203 skill.

Engine Language Best for Billing model Pick it when
Synapse Spark pool PySpark/Scala/Spark SQL Big, complex, semi-structured transforms; ML feature prep Per vCore-hour while the pool is active (auto-pause when idle) Heavy/flexible engineering inside Synapse
Synapse SQL — serverless T-SQL over lake files Ad-hoc exploration, light transforms, CREATE EXTERNAL TABLE/CETAS Per TB scanned, no infra Intermittent SQL over the lake; cheapest for occasional queries
Synapse SQL — dedicated pool T-SQL, MPP warehouse Steady, heavy warehouse loads and serving Provisioned DWU, billed 24×7 (pause when idle) Predictable high-volume warehouse workloads
Azure Databricks PySpark/Scala/SQL, Photon Premier Spark/ML/AI lakehouse; multi-cloud; large-scale engineering Per DBU + VM Serious data engineering/data science, multi-cloud, or where Synapse Spark falls short

The realistic 2026 pattern: Spark (Synapse or Databricks) for the bronze→silver→gold transforms, serverless SQL for cheap ad-hoc exploration, and a dedicated pool or Fabric Warehouse only when a high-concurrency relational serving layer is required. Synapse Spark and Mapping Data Flows (ADF’s visual, code-free Spark) cover most cleansing; drop to notebooks for complex logic.

Typical silver→gold transform steps: deduplicate, enforce/cast types, conform keys (map source codes to canonical dimension keys), handle slowly-changing dimensions (SCD), then aggregate into gold facts. Spark SQL or T-SQL both express this; the lake’s Delta format gives you MERGE for SCD Type-2 history.

Interview classic — “Dedicated vs serverless SQL pool?” Dedicated is a provisioned, always-on MPP warehouse sized in DWUs — fast and predictable for steady heavy loads, but you pay 24×7 so pause it when idle. Serverless queries lake files on demand and bills per TB scanned with no infrastructure — ideal for ad-hoc and intermittent work.

Stream: Event Hubs + Stream Analytics / Spark Structured Streaming

Some data cannot wait for the nightly batch — fraud signals, IoT telemetry, clickstream. The streaming fast lane has three parts: an ingestion buffer, a stream processor, and a sink (usually the lake and/or a hot store).

Azure Event Hubs is the ingestion buffer — a massively-scalable, Kafka-protocol-compatible event broker that absorbs millions of events/sec into partitions and retains them for a window so consumers can read at their own pace. (IoT Hub sits in front for device fleets; Event Hubs Capture can auto-archive raw events straight to the lake as bronze.) Then you pick a processor:

Azure Stream Analytics (ASA) Spark Structured Streaming (Synapse/Databricks)
Language SQL-like query PySpark/Scala
Ops model Fully managed, serverless (Streaming Units) You run a Spark cluster/job
Best for Quick, declarative windowed aggregations, routing, alerting Complex logic, ML scoring, joining streams to large lake tables, unified batch+stream code
Windowing Tumbling, hopping, sliding, session, snapshot Tumbling/sliding + custom; watermarks for late data
Sinks Power BI (live), SQL, ADLS, Cosmos, Event Hubs, Functions Anything Spark writes — Delta lake, DB, etc.

Windowing is the heart of stream processing — you cannot aggregate an infinite stream without bounding it in time:

Window Behaviour Example
Tumbling Fixed, non-overlapping buckets Count orders per 5-minute block
Hopping Fixed size, overlapping by a hop 10-min totals emitted every 5 min
Sliding Emits when events enter/leave the window Alert if >100 events in any 1-min window
Session Groups bursts separated by a gap of inactivity A user’s browsing session

Two more must-knows: a watermark tells the engine how long to wait for late-arriving events before closing a window (trade latency vs completeness), and exactly-once processing (checkpointing + idempotent sinks, e.g. Delta) prevents double-counting on failure/replay. In Fabric, this whole path becomes EventstreamEventhouse (KQL).

-- Azure Stream Analytics: orders per product per 5-minute tumbling window
SELECT productId, System.Timestamp() AS windowEnd, COUNT(*) AS orders
INTO   [powerbi-output]
FROM   [eventhub-input] TIMESTAMP BY eventTime
GROUP  BY productId, TumblingWindow(minute, 5)

Serve: dedicated SQL, Microsoft Fabric & Power BI

The serving layer exposes gold data to humans and tools with good performance and security. Options:

Serving surface What it is When
Synapse serverless SQL Pay-per-TB T-SQL over gold Delta/Parquet (views, external tables) Cheap, intermittent BI/ad-hoc on the lake
Synapse dedicated SQL pool Provisioned MPP warehouse; load gold into relational tables High-concurrency, low-latency enterprise reporting
Microsoft Fabric Warehouse / Lakehouse SQL endpoint SaaS warehouse over OneLake The 2026 default for new platforms
Power BI The BI/visualisation layer on top of any of the above Always — the dashboard the business actually sees

Power BI storage modes are the serving decision most often probed: Import (data copied into Power BI — fastest, but stale until refresh), DirectQuery (queries the source live — fresh, but slower and load-heavy), and DirectLake (a Fabric-only mode that reads Delta tables directly from OneLake in memory — Import-grade speed with live data, no import refresh or DirectQuery round-trip). DirectLake resolves the classic Import-vs-DirectQuery trade-off and is a key reason new platforms target Fabric.

Interview classic — “A report must show live lake data at near-Import speed. What do you choose?” DirectLake mode on Microsoft Fabric — it reads Delta straight from OneLake, giving Import-speed performance on live data.

Orchestrate, CI/CD, secure & govern

A pipeline that only runs when you click “Debug” is a prototype. Production needs the wrapper.

Orchestration. Pipelines are driven by triggers: schedule (wall-clock), tumbling window (fixed, dependency-aware, back-fillable slices — the right choice for incremental loads), storage-event (a new blob arrives), and manual. Inside a pipeline you express dependencies (activity success/failure/completion/skip), loops (ForEach, Until), branching (If/Switch) and parameters; chain pipelines with Execute Pipeline. A master pipeline typically: ingest → transform → load serving → refresh Power BI dataset, each step gated on the previous succeeding.

CI/CD. Never author straight into the Live (published) mode in production. The flow:

  1. Connect the ADF/Synapse workspace to a Git repo (Azure Repos or GitHub) — work happens on feature branches in collaboration mode.
  2. Merge to the collaboration branch; publish generates ARM templates (the adf_publish/workspace_publish branch) describing every pipeline, dataset and linked service.
  3. A release pipeline (Azure DevOps/GitHub Actions) deploys those ARM templates to dev → test → prod, overriding environment-specific values (connection strings, IRs) via parameters/override files. Infra (storage, Synapse, Key Vault, networking) is itself defined in Bicep/Terraform.
# Provision the lake (HNS on) and a Synapse workspace — infra-as-code outline
az storage account create -n kvlakedp203 -g rg-data \
  --sku Standard_LRS --kind StorageV2 --hns true            # HNS = ADLS Gen2
az synapse workspace create -n kv-syn-dp203 -g rg-data \
  --storage-account kvlakedp203 --file-system lake \
  --sql-admin-login-user sqladmin --sql-admin-login-password '<pwd>' \
  --location southindia

Security. Apply Zero-Trust to data: authenticate every hop with managed identity / Microsoft Entra (never account keys or passwords in pipelines — secrets live in Key Vault and are referenced by linked services); enforce least privilege with RBAC + ADLS ACLs; keep traffic off the public internet with private endpoints (and a managed VNet with managed private endpoints for Synapse/ADF) or a Self-hosted IR for on-prem; encrypt at rest (platform or customer-managed keys) and in transit. Lock storage account public network access down and reach it via Private Link.

Governance. Microsoft Purview provides the catalogue and control plane: automatically scan sources to build a discoverable data map, classify and sensitivity-label PII, trace end-to-end lineage (source → transform → report), and centralise access policies and insights — the cure for an ungoverned “data swamp”. Lineage in particular is what lets you answer “where did this number in the board report come from?”.

End-to-end Azure data engineering pipeline

The diagram traces a single record’s journey: an on-prem source crosses a Self-hosted IR into bronze, batch and streaming paths converge through silver into gold Delta tables, and the dedicated SQL pool / Fabric Warehouse serves Power BI — all wrapped by orchestration, managed identity, private endpoints and Purview.

Hands-on lab

We will build a mini end-to-end pipeline entirely from the az CLI and free/low-cost tiers: create an ADLS Gen2 lake with medallion containers, stand up a serverless SQL endpoint over the lake, run an incremental Copy, and query the result — then clean up. (Spark and dedicated pools cost real money per hour, so the lab uses serverless SQL and tiny data to stay near-free.)

Prerequisites: an Azure subscription and the az CLI logged in (az login).

1. Resource group + ADLS Gen2 lake with medallion containers.

az group create -n rg-dp203-lab -l southindia

az storage account create -n kvdp203lab$RANDOM -g rg-dp203-lab \
  --sku Standard_LRS --kind StorageV2 --hns true            # HNS => ADLS Gen2
# capture the exact name it created:
SA=$(az storage account list -g rg-dp203-lab --query "[0].name" -o tsv)

for layer in bronze silver gold; do
  az storage fs create -n $layer --account-name $SA --auth-mode login
done
az storage fs list --account-name $SA --auth-mode login -o table   # expect 3 rows

Expected: a storage account with isHnsEnabled: true and three filesystems (bronze/silver/gold).

2. Land a tiny bronze dataset (stand-in for an ingested file).

printf 'order_id,product,amount,LastModified\n1,pen,40,2026-06-15T09:00:00Z\n2,book,250,2026-06-15T09:05:00Z\n' > orders.csv
az storage fs file upload -f bronze -s orders.csv \
  -p year=2026/month=06/day=15/orders.csv --account-name $SA --auth-mode login

Expected: the file lands under a date-partitioned path in bronze.

3. Query the lake with serverless SQL (the cheapest transform/serve surface). Create a Synapse workspace (its serverless SQL endpoint is always-on and billed only per TB scanned):

az synapse workspace create -n kvsyndp203$RANDOM -g rg-dp203-lab \
  --storage-account $SA --file-system bronze \
  --sql-admin-login-user sqladmin --sql-admin-login-password 'P@ssw0rd-Change-Me!' \
  -l southindia
az synapse workspace firewall-rule create --workspace-name <wsName> -g rg-dp203-lab \
  -n allowAll --start-ip-address 0.0.0.0 --end-ip-address 255.255.255.255   # lab only!

Then in the Synapse serverless SQL endpoint (Synapse Studio or sqlcmd), run an OPENROWSET over the lake — this is a real “serve straight from the lake” query, no data loaded:

SELECT product, SUM(amount) AS total
FROM OPENROWSET(
  BULK 'https://<SA>.dfs.core.windows.net/bronze/year=2026/month=06/day=15/orders.csv',
  FORMAT='CSV', PARSER_VERSION='2.0', HEADER_ROW=TRUE) AS rows
GROUP BY product;

Expected: two rows — book 250, pen 40.

Validation: you ingested to bronze, partitioned by date, and served an aggregate from the lake with zero provisioned compute — the medallion + serverless pattern in miniature.

Cleanup (important — do this to stop charges):

az group delete -n rg-dp203-lab --yes --no-wait

Cost note (INR). Run end to end and tidied up within the hour this stays in the few-rupees range: ADLS Gen2 LRS storage on this data is effectively nil; serverless SQL bills per TB scanned and this query scans kilobytes (~₹0); the Synapse workspace itself is free (you pay for the pools, which we never started). The expensive things — dedicated SQL pools (DWU, ~₹thousands/day), Spark pools and Stream Analytics SUs — are all avoided here. The one rule that saves real money: pause dedicated pools and auto-pause Spark pools the instant they are idle, and never leave an Azure-SSIS IR running.

Common mistakes & troubleshooting

Symptom Likely cause Fix
“Cannot enable hierarchical namespace” on an existing account HNS is a create-time-only setting Create a new ADLS Gen2 account with --hns true and migrate data
On-prem Copy fails / times out No Self-hosted IR, or it cannot reach ADF Install SHIR inside the network (outbound 443); register it on the source linked service; add a 2nd node for HA
Re-running a load double-counts rows Append instead of upsert; no watermark Use Delta MERGE keyed on the business key + a watermark to load only new rows
Spark/SQL queries are mysteriously slow Small-file problem / no partition pruning OPTIMIZE (compact) Delta tables; partition on the filtered column; avoid one-file-per-record writes
Dedicated SQL pool bill is huge Pool left running 24×7 Pause the dedicated pool when idle; consider serverless for intermittent loads
Stream Analytics job lags / drops events Under-provisioned Streaming Units or skewed Event Hub partitions Scale up SUs; align query partitioning with Event Hub partition key; tune watermark/late-arrival
Pipeline runs in dev but the prod deploy points at dev sources Environment values not parameterised in CI/CD Override linked-service/IR values per environment via ARM template parameters
“Login failed” / key leaked in a linked service Using account keys/passwords instead of identity Switch to managed identity; store any unavoidable secret in Key Vault and reference it

Best practices

Security notes

Interview & exam questions

Quick check

  1. Why must hierarchical namespace be decided at storage-account creation, and what do you do if an existing account lacks it?
  2. You need only the rows changed since last night, including deletes, from an Azure SQL source. Which technique, and what lands in the lake to keep re-runs idempotent?
  3. Name the three medallion layers in order and the file format you would use for silver/gold.
  4. A clickstream must drive a live Power BI tile updated every few seconds with a 5-minute rolling count. Which streaming services and which window type?
  5. For a brand-new, Microsoft-centric, BI-led platform in 2026, which serving/platform choice is the strategic default, and which Power BI storage mode gives live data at Import speed?

Answers

  1. HNS is a create-time-only setting — it cannot be toggled on later; you create a new ADLS Gen2 account with HNS enabled and migrate the data.
  2. Native Change Data Capture (CDC) (it captures inserts/updates/deletes); land into Delta and apply with MERGE (upsert), keyed on the business key, so re-runs converge instead of duplicating.
  3. Bronze → silver → gold; use Delta for silver and gold (Parquet + transaction log).
  4. Event HubsStream Analytics (or Spark Structured Streaming) → Power BI live output, using a hopping window (5-minute size, short hop) — or a sliding/tumbling window depending on exact semantics.
  5. Microsoft Fabric (Warehouse/Lakehouse on OneLake) is the strategic default; DirectLake gives live data at Import-grade speed.

Exercise

Design (on paper, then sketch the az/JSON outline) an end-to-end platform for a mid-size retailer with three sources: an on-prem SQL Server of orders, a SaaS CRM exposing a REST API, and a website clickstream of events. Requirements: a single trustworthy analytics store; a sales dashboard refreshed by 8 a.m. daily; a live “orders in the last 5 minutes” tile; and customer PII that must be discoverable and protected.

Produce: (a) the ingest design — which integration runtime per source, and full vs incremental/CDC for each; (b) the lake design — medallion containers, formats, and the partitioning scheme; © the batch transform engine choice with a one-line justification, including how you make the load idempotent; (d) the streaming path — Event Hubs + processor + window type for the live tile; (e) the serving layer and the Power BI storage mode with reasons; (f) the orchestration (trigger types + dependency chain) and CI/CD approach; and (g) three security/governance controls (identity, networking, Purview). Bonus: name one place the small-file problem could bite and how you would prevent it.

Certification mapping

DP-203 (Data Engineering on Microsoft Azure):

Fabric-era successors (the live exams):

(These build skills also underpin AZ-305 “design data storage solutions” scenarios.)

Glossary

Next steps

AzureDP-203Data FactorySynapseStream AnalyticsDelta Lake
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading