Azure Data Fundamentals

DP-900: Non-Relational Data & Analytics on Azure

In the previous lesson we put data into neat, predictable tables and queried it with SQL. That model is superb when your data is regular and your relationships are well known — but a great deal of the world’s data is neither. A product catalogue where every item has different attributes, a stream of clickstream events arriving a million an hour, a social graph of who-follows-whom, a folder of PDFs and videos — none of these fit comfortably into rows and columns. This is the territory of non-relational (often called NoSQL) data stores, and Azure has a rich family of them.

There is also a second, separate question this lesson answers: once an organisation has data scattered across dozens of relational databases, NoSQL stores, files and SaaS apps, how does it bring everything together to make decisions? That is the job of an analytics pipeline — ingest the data, land it cheaply in a data lake, transform it into something clean and joined-up, serve it in a query-friendly shape, and finally visualise it in a report a human can act on. These two themes — non-relational storage and the analytics pipeline — are exactly the two halves of this lesson, and together they round out the storage-and-analytics portion of the DP-900: Azure Data Fundamentals certification.

We assume you have read the core data concepts lesson (structured vs semi-structured vs unstructured, OLTP vs OLAP, batch vs streaming) and the relational data on Azure lesson. Everything here is taught from first principles, but those two give you the vocabulary the rest of this builds on.

Learning objectives

By the end of this lesson you can:

Prerequisites & where this fits

You need only basic IT literacy and the two earlier Data Fundamentals lessons noted above; no Azure account is required to follow the concepts, though the optional lab uses a free Cosmos DB tier and a few rupees of Blob storage. This is the third Data Fundamentals lesson in the Azure Zero-to-Hero course. It deliberately stays at fundamentals depth — broad coverage, every term defined — and then hands off to the advanced Data Factory, Synapse & Fabric deep-dive and the Cosmos DB partition-key & RU optimisation deep-dive for engineers who need to build, not just understand.


Part 1 — Non-relational data

When NoSQL beats relational

A relational database earns its keep when data is structured and relationships are stable: every row has the same columns, foreign keys join tables reliably, and you need ACID transactions (the guarantee that a multi-step change either fully happens or not at all). For an order-entry system or a bank ledger, that is exactly right.

Non-relational stores trade some of those guarantees for flexibility and scale. They shine when one or more of the following is true:

Signal in your workload Why relational struggles Why NoSQL fits
Variable / evolving schema Adding columns and migrating is painful; sparse rows waste space. Each item carries its own fields; no migration to add one.
Massive horizontal scale A single SQL server has a ceiling; sharding is hard to bolt on. Built to partition (shard) data across many nodes from day one.
Very high write throughput / low latency Locking and joins add overhead. Simple lookups by key are extremely fast and predictable.
Global users needing local latency Geo-replication is add-on and usually single-write-region. Some stores replicate multi-region with multi-write natively.
Naturally hierarchical or connected data Deep joins to model trees/graphs get ugly. Document and graph models store the shape directly.

The trade-off is real: many NoSQL stores favour availability and partition tolerance over strict consistency (the famous CAP theorem — you cannot have perfect consistency, availability and partition tolerance all at once), and they generally do not do rich cross-entity JOINs or multi-table transactions the way SQL does. The senior-architect’s rule of thumb: use relational by default for transactional business data; reach for NoSQL when schema flexibility, horizontal scale, global low latency, or a graph/document shape is the dominant requirement.

The four NoSQL data models

“NoSQL” is an umbrella over four quite different shapes. Knowing them is a guaranteed DP-900 question:

Model Stores data as… Good for Azure service
Key-value A dictionary: a unique key → an opaque value. Caching, session state, simple lookups by id. Azure Cosmos DB for Table, Azure Table storage
Document Self-describing documents (usually JSON), each with its own fields. Catalogues, user profiles, content, event payloads. Azure Cosmos DB for NoSQL / for MongoDB
Column-family (wide-column) Rows that can each have different, very many columns, grouped into families. Time-series, IoT, huge sparse tables. Azure Cosmos DB for Apache Cassandra
Graph Nodes (entities) joined by edges (relationships), both with properties. Social networks, recommendations, fraud, knowledge graphs. Azure Cosmos DB for Apache Gremlin

Notice that Azure Cosmos DB appears in every row. That is the headline: Cosmos DB is a single, fully managed, multi-model database service that can present any of these models through a choice of API.

Azure Cosmos DB

Azure Cosmos DB is Azure’s flagship globally distributed, multi-model NoSQL (and now also relational-via-PostgreSQL) database. As a managed PaaS service you never patch a server or manage a cluster; you pick an API, set your throughput, choose your regions, and Azure runs the rest with guaranteed single-digit-millisecond latency and SLAs covering availability, latency, throughput and consistency.

The five APIs

You choose an API when you create the account, and it is effectively permanent for that account. The API decides the data model, the wire protocol, and the query language your application speaks — which matters enormously, because it lets you lift an existing app onto Cosmos DB with little or no code change.

API Data model Speaks the protocol of… Pick it when…
API for NoSQL (formerly SQL/Core) Document (JSON) Cosmos DB’s native SQL-like query over JSON It is a new project — gets every new feature first; the default and recommended choice.
API for MongoDB Document (BSON) MongoDB You are migrating a MongoDB app or your team knows MongoDB tooling/drivers.
API for Apache Cassandra Column-family Cassandra (CQL) You are moving a Cassandra workload and want a managed, elastic backend.
API for Apache Gremlin Graph Apache TinkerPop / Gremlin You have graph data — relationships are the point (social, recommendations, fraud).
API for Table Key-value Azure Table storage You want a premium, globally distributed, low-latency upgrade for an Azure Table storage app.

The exam framing to remember: the API is chosen at account creation, lets you reuse existing drivers/skills, and you cannot mix models in one account. For greenfield work the answer is almost always API for NoSQL.

Global distribution and multi-region writes

The defining Cosmos DB capability is turnkey global distribution. With a tick-box on a world map you replicate your data to any number of Azure regions, and Cosmos transparently routes each client to the nearest one for low latency. Two modes matter:

This is what people mean when they call Cosmos DB “globally distributed”: data and write capability follow your users around the planet, with a 99.999% availability SLA when configured multi-region with multi-write.

Consistency levels

Distributed systems must trade consistency (always seeing the latest write) against latency and availability. Cosmos DB is unusual in offering five well-defined consistency levels as a simple setting, from strongest to weakest:

Level Guarantee (plain English) Trade-off
Strong Every read sees the most recent committed write, everywhere. Highest latency; limits multi-region writes.
Bounded staleness Reads lag the latest write by at most K versions or T seconds — a bounded “behind”. Tunable freshness vs performance.
Session (default) Within a single client session you always read your own writes, in order. Best balance for most apps — the sensible default.
Consistent prefix You never see writes out of order, but may see an older snapshot. Cheaper, lower latency.
Eventual Replicas converge “eventually”; reads may be stale and unordered. Lowest latency and cost; weakest guarantee.

You rarely need Strong globally; Session is the workhorse default and the one to quote in an interview.

Request Units (RUs) — the throughput and cost currency

This is the single most important Cosmos DB concept for DP-900. Cosmos DB does not bill you per CPU or per query type. Instead, every operation — a read, a write, a query — costs some number of Request Units (RUs), a normalised currency that blends CPU, memory and IOPS into one number. Reading a 1 KB item costs 1 RU; writes and complex queries cost more.

You provision throughput as RUs per second (RU/s) on a container (or database), and that is what you pay for. There are three capacity modes:

Mode How you pay Best for
Provisioned throughput You reserve a fixed RU/s (e.g. 400 RU/s), billed whether used or not. Steady, predictable traffic.
Autoscale You set a maximum; Cosmos scales between 10% and 100% of it automatically. Spiky or unpredictable traffic — no manual tuning.
Serverless You pay per RU consumed, nothing when idle. Dev/test, intermittent or low-traffic workloads.

If you exceed your provisioned RU/s, requests are throttled (HTTP 429 “request rate too large”) and must back off and retry — so RU planning and good partition-key design (so load spreads evenly) are the heart of running Cosmos DB well. That depth is the subject of the dedicated Cosmos DB partition-key & RU optimisation lesson; for DP-900 you simply need to know RUs are the currency, you provision RU/s, and over-running them causes throttling.

Azure Storage

Before Cosmos DB existed, and still for an enormous range of jobs, the workhorse non-relational store on Azure is the humble storage account. A single storage account is a namespace that bundles four distinct data services, each a different shape of unstructured or semi-structured storage:

Service What it stores Typical use Access pattern
Blob storage Binary Large OBjects — any file: images, video, backups, logs, Parquet. The default place for unstructured data and the data lake (see Part 2). REST/HTTPS, SDKs; URL per blob.
Table storage A simple key-value / wide-column NoSQL store (partition key + row key). Cheap, massive, schemaless lookup tables; metadata. Key lookups; no joins.
File storage (Azure Files) Fully managed SMB/NFS file shares in the cloud. Lift-and-shift file shares; shared config; mount as a drive. Mounted like a network drive.
Queue storage A simple message queue for asynchronous work. Decoupling app tiers; buffering work items. Put/get messages, ~64 KB each.

A few fundamentals to know for the exam:

Azure Storage is covered exhaustively in the storage accounts deep-dive; here, the fundamentals point is simply one account, four services — Blob, Table, File, Queue — and Blob is where your data lake lives.


Part 2 — Analytics on Azure

We now switch from storing operational data to making sense of it all together. This is analytics: turning raw, scattered data into insight a human or a model can act on.

The modern analytics pipeline

Every analytics platform on Azure — whatever the branding — implements the same five-stage pipeline. Fix this mental model and the services fall into place:

Stage What happens Azure services that own it
1. Ingest Pull/copy data from many sources (databases, APIs, files, streams) into the platform. Azure Data Factory, Synapse pipelines, Microsoft Fabric Data Factory; Event Hubs / Stream Analytics for streaming.
2. Store Land it cheaply and at scale in a data lake — raw, before any cleaning. Azure Data Lake Storage Gen2 (Blob + hierarchical namespace); Fabric OneLake.
3. Transform Clean, join, deduplicate, aggregate — turn raw into trustworthy. Synapse Spark/SQL pools, Azure Databricks, Data Factory data flows, Fabric.
4. Serve Present the cleaned data in a query-friendly shape (a warehouse / model). Synapse dedicated SQL pool / Fabric Warehouse / a relational warehouse.
5. Visualise Build reports and dashboards humans read and act on. Power BI (reports, dashboards).

Read it as a sentence: ingest the data, store it in a lake, transform it into something clean, serve it in a warehouse, and visualise it in Power BI. Almost every interview question about “the Azure data platform” is really asking you to recite and place services onto this pipeline.

The data lake (stage 2 in depth)

A data lake is a single, massively scalable store that holds data of any structure — structured tables, JSON, images, Parquet — in its raw form, cheaply, before you decide what to do with it. On Azure the lake is Azure Data Lake Storage Gen2 (ADLS Gen2), which is simply a Blob storage account with a “hierarchical namespace” turned on (giving it real directories and file-level security, which big-data engines need).

The lake is usually organised into the medallion architecture — three quality layers data flows through:

Contrast the lake with a data warehouse: a lake stores raw, any-shape data cheaply and applies structure on read (“schema-on-read”); a warehouse stores cleaned, structured data with structure defined on write (“schema-on-write”) for fast SQL analytics. Modern platforms use both — lake for cheap raw storage and flexibility, warehouse to serve curated data fast — and the blend is increasingly called a lakehouse.

ETL vs ELT

The transform stage comes in two flavours, and the difference is a perennial exam favourite. Both move data from sources into a destination and clean it; they differ in when the transform happens relative to the load.

ETL (Extract → Transform → Load) ELT (Extract → Load → Transform)
Order Transform data before loading it into the destination. Load raw data first, then transform inside the destination.
Where transform runs A separate processing engine en route. The powerful destination (data lake / warehouse) itself.
Raw data kept? Often not — only the transformed result lands. Yes — raw lands first (great for re-processing and audit).
Best when Sensitive data must be cleansed/masked before it lands; smaller, on-prem-style volumes. Big data and the cloud — scale the cheap lake, transform with elastic compute.
Classic on Azure Data Factory mapping data flows / SSIS. Land in ADLS Gen2, transform with Spark/Synapse/Databricks.

Why the cloud pushed everyone to ELT: cloud storage is cheap and effectively limitless, and cloud compute is elastic, so it is now cheaper and more flexible to dump everything raw into the lake first and transform later with on-demand power — keeping the raw copy so you can always re-derive results when requirements change. ETL still wins when governance or compliance demands that data be masked or cleansed before it ever lands in the platform. The one-line answer: ETL transforms before load (clean-then-store); ELT loads then transforms (store-then-clean); the cloud favours ELT because cheap, scalable storage and elastic compute make load-first the natural pattern.

The Azure analytics services

Here are the services you must recognise at DP-900 level, each placed on the pipeline:

Which tool for which job?

This table is the payoff — the one to internalise for both the exam and real architecture conversations:

You need to… Use Why
Copy / move / orchestrate data from many sources on a schedule Azure Data Factory Purpose-built, low-code ingest and pipeline orchestration with 90+ connectors.
Store raw data of any shape, cheaply, at scale Azure Data Lake Storage Gen2 The lake — cheap, limitless, big-data-engine friendly.
Run a unified analytics platform on the previous generation Azure Synapse Analytics Pipelines + Spark + SQL warehouse in one studio.
Start a new analytics project with everything unified as SaaS Microsoft Fabric All-in-one over OneLake; Microsoft’s strategic direction.
Do heavy Spark data engineering / data science / ML Azure Databricks Best-in-class managed Spark + collaborative notebooks.
Build reports and dashboards for business users Power BI The visualisation and self-service BI layer.
Process real-time streams Azure Stream Analytics / Event Hubs / Fabric Real-Time Continuous query over data in motion (the streaming counterpart).

The architect’s summary: Data Factory ingests, Data Lake Gen2 stores, Synapse/Fabric/Databricks transform and serve, and Power BI visualises — with Fabric the unified, SaaS, go-forward choice for greenfield. The full depth lives in the Data Factory, Synapse & Fabric deep-dive.

Non-relational data & analytics on Azure

The diagram above stitches both halves together: the non-relational stores on the left (Cosmos DB’s five APIs and the four Azure Storage services) feeding the five-stage analytics pipeline on the right — ingest, lake, transform, serve, and finally Power BI.

Hands-on lab

A tiny, free hands-on to make both halves concrete: create a free-tier Cosmos DB account and a Blob container (your “data lake” landing zone), and confirm both work. We use the Azure CLI so the steps are copy-pasteable; everything here stays inside free limits or costs a few rupees.

1. Create a resource group

az group create \
  --name rg-dp900-nosql \
  --location centralindia

2. Create a free-tier Cosmos DB account (API for NoSQL)

The --enable-free-tier true flag gives you the first 1000 RU/s and 25 GB free on one account per subscription — perfect for learning.

az cosmosdb create \
  --name kvcosmos$RANDOM \
  --resource-group rg-dp900-nosql \
  --locations regionName=centralindia \
  --enable-free-tier true \
  --default-consistency-level Session

Note the account name printed in the output. Expected output: a JSON object with "provisioningState": "Succeeded" and "enableFreeTier": true.

3. Create a database and a container with autoscale throughput

# Replace <account> with the name from step 2
az cosmosdb sql database create \
  --account-name <account> \
  --resource-group rg-dp900-nosql \
  --name RetailDB

az cosmosdb sql container create \
  --account-name <account> \
  --resource-group rg-dp900-nosql \
  --database-name RetailDB \
  --name Products \
  --partition-key-path "/category" \
  --max-throughput 1000

Here --partition-key-path "/category" chooses category as the partition key (how data is spread for scale), and --max-throughput 1000 sets autoscale up to 1000 RU/s — inside the free allowance.

4. Create a Blob storage “data lake” landing container

az storage account create \
  --name kvlake$RANDOM \
  --resource-group rg-dp900-nosql \
  --location centralindia \
  --sku Standard_LRS \
  --kind StorageV2

# Use the storage-account name printed above
az storage container create \
  --account-name <storage-account> \
  --name bronze \
  --auth-mode login

Validation

# Confirm the Cosmos container and its throughput
az cosmosdb sql container show \
  --account-name <account> --resource-group rg-dp900-nosql \
  --database-name RetailDB --name Products \
  --query "{name:name, pk:resource.partitionKey.paths}" -o table

# Confirm the Blob container exists
az storage container show \
  --account-name <storage-account> --name bronze \
  --auth-mode login --query name -o tsv

You should see the Products container with partition key /category, and bronze returned for the Blob container. You have just built, in miniature, a non-relational store and the landing layer of a data lake.

Cleanup — delete the whole resource group so nothing keeps billing:

az group delete --name rg-dp900-nosql --yes --no-wait

Cost note (INR): the Cosmos DB account is on the free tier (₹0 for the first 1000 RU/s and 25 GB). The Standard LRS storage account costs roughly ₹1.6–₹2 per GB per month for Hot blobs, and you stored nothing, so the lab total is effectively ₹0 if you clean up the same day. Always run the cleanup step — an idle provisioned-throughput Cosmos container (outside free tier) is the classic surprise on a learner’s bill.

Common mistakes & troubleshooting

Symptom Likely cause Fix
HTTP 429 “request rate too large” from Cosmos DB You exceeded provisioned RU/s (often a hot partition). Raise RU/s or switch to autoscale; choose a better-distributed partition key; implement retry-with-backoff.
Want to change a Cosmos account from MongoDB to NoSQL API The API is fixed at account creation. Create a new account with the desired API and migrate the data.
Blob “Archive” data won’t read Archive tier is offline; blobs must be rehydrated first. Rehydrate to Hot/Cool (hours) before reading, or keep frequently read data in Hot/Cool.
Big-data engine can’t do directory operations on your lake You used a plain Blob account, not ADLS Gen2 (no hierarchical namespace). Recreate with hierarchical namespace enabled (it cannot be toggled on later).
Data warehouse load is slow and expensive You forced ETL on huge volumes through a small engine. Switch to ELT: land raw in the lake, transform with elastic Spark/SQL compute.
Power BI report is stale Dataset refresh not scheduled, or source not refreshed. Configure scheduled refresh; for big models consider DirectQuery / DirectLake.
Surprise Cosmos bill on an idle dev database Provisioned RU/s bills whether used or not. Use serverless or autoscale for dev/test; delete idle resources.
Choosing Synapse for a brand-new project Defaulting to the previous generation. For greenfield, evaluate Microsoft Fabric (the strategic direction) first.

Best practices

Security notes

Interview & exam questions

  1. When would you choose a non-relational store over a relational one? When the schema is variable/evolving, you need massive horizontal scale or very low-latency key lookups, you need global multi-write low latency, or the data is naturally a document/graph — and you don’t need rich cross-table JOINs or multi-table ACID transactions.
  2. Name the four NoSQL data models and an Azure service for each. Key-value (Table storage / Cosmos Table), document (Cosmos DB for NoSQL/MongoDB), column-family (Cosmos DB for Cassandra), graph (Cosmos DB for Gremlin).
  3. What are the five Cosmos DB APIs and when is each chosen? NoSQL (new projects, default), MongoDB (migrate Mongo apps), Cassandra (migrate Cassandra), Gremlin (graph data), Table (upgrade Azure Table apps). The API is fixed at account creation.
  4. What is a Request Unit (RU)? A normalised currency blending CPU, memory and IOPS; every operation costs RUs (a 1 KB read = 1 RU). You provision RU/s; exceeding it causes 429 throttling.
  5. Explain Cosmos DB consistency levels. Strong, Bounded staleness, Session (default), Consistent prefix, Eventual — a spectrum trading freshness for latency/availability. Session is the usual default.
  6. What is global distribution / multi-region writes? Replicating data to multiple regions for local latency; multi-region writes let every region accept writes (multi-master), surviving regional outages at the cost of conflict resolution.
  7. What are the four Azure Storage services? Blob (objects/files), Table (key-value NoSQL), File (SMB/NFS shares), Queue (messages).
  8. What are Blob access tiers and why do they matter? Hot, Cool, Cold, Archive — they match storage/access cost to access frequency; Archive is offline and must be rehydrated. They are a major cost lever.
  9. Describe the five stages of an analytics pipeline and name a service per stage. Ingest (Data Factory), store (Data Lake Gen2), transform (Synapse/Databricks/Fabric), serve (warehouse), visualise (Power BI).
  10. ETL vs ELT — what’s the difference and why did the cloud favour ELT? ETL transforms before loading; ELT loads raw then transforms in the destination. The cloud favours ELT because cheap, limitless storage plus elastic compute make “load-first” cheaper and more flexible, while keeping the raw copy.
  11. What is a data lake, and how does it differ from a data warehouse? A lake stores raw, any-shape data cheaply with schema-on-read; a warehouse stores cleaned, structured data with schema-on-write for fast SQL. ADLS Gen2 is the Azure lake; combining both is a lakehouse.
  12. Synapse, Fabric, or Databricks — which for a new project? For greenfield, evaluate Microsoft Fabric first (the unified SaaS, go-forward direction over OneLake); use Databricks for heavy custom Spark/ML; keep Synapse for existing estates.

Quick check

  1. Which Cosmos DB API would you choose for a brand-new application, and why?
  2. True or false: you can change a Cosmos DB account’s API after it is created.
  3. What does a Request Unit measure, and what happens if you exceed your provisioned RU/s?
  4. Put these in pipeline order: serve, ingest, visualise, transform, store.
  5. In ELT, does the transform happen before or after the data is loaded into the destination?

Answers

  1. API for NoSQL — it is the native, default API and receives every new feature first.
  2. False — the API is fixed at account creation; migrate to a new account to change it.
  3. An RU is a normalised throughput currency (CPU + memory + IOPS); exceeding provisioned RU/s causes HTTP 429 throttling, requiring back-off and retry.
  4. Ingest → store → transform → serve → visualise.
  5. After — ELT loads raw data first, then transforms it inside the destination (lake/warehouse).

Exercise

Imagine a retail company with: (a) a fast-growing product catalogue where items have wildly different attributes; (b) a clickstream of millions of website events per hour; © a folder of product images and PDFs; and (d) a need for executives to see daily sales dashboards.

For each of (a)–(d), write down which Azure service you would use and one sentence of justification. Then sketch — in five boxes — the end-to-end analytics pipeline that would carry the clickstream from the website all the way to an executive dashboard, labelling each stage. (Suggested answer: (a) Cosmos DB for NoSQL — flexible document schema; (b) ingest via Event Hubs/Data Factory into ADLS Gen2 — scale and cheap raw storage; © Blob storage — unstructured files; (d) Power BI — interactive dashboards. Pipeline: ingest → data lake (bronze) → transform (Spark/Synapse to silver/gold) → serve (warehouse) → visualise (Power BI).)

Certification mapping

This lesson maps to the DP-900: Microsoft Azure Data Fundamentals certification, principally the exam areas “Describe considerations for working with non-relational data on Azure” (Cosmos DB, its APIs, and Azure Storage — Blob, Table, File, Queue) and “Describe an analytics workload on Azure” (the ingest → store → transform → serve → visualise pipeline, ETL vs ELT, data lakes, Data Factory, Synapse, Fabric, Databricks and Power BI). It is also a useful on-ramp to DP-203 / DP-700 (data engineering) and PL-300 (Power BI data analyst).

Glossary

Next steps

You now understand non-relational data and the analytics pipeline at fundamentals depth — the last big pillar of DP-900’s storage-and-analytics content. To go deeper:

DP-900Cosmos DBAzure StorageData LakeAnalyticsPower BI
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading