Almost every system you will ever build in Azure exists to do one of two things with data: record what happened (a customer placed an order, a sensor took a reading, a payment cleared) or make sense of what happened (which products sell best, which machines are about to fail, which customers are about to leave). Those two jobs pull a design in opposite directions, and almost every data decision you make for the rest of your career — which database, which storage, which file format, which Azure service — flows from understanding the difference.
This is the opening lesson of the DP-900: Azure Data Fundamentals track, and the foundation every later data lesson builds on. It assumes you have never written a line of SQL or designed a database in your life. By the end you will be able to look at any workload and say, with confidence, “that is structured, transactional, relational data — it belongs in a relational database” or “that is unstructured, analytical data — it belongs in a data lake”, and you will know who in a data team is responsible for each piece. Get these mental models right and the rest of DP-900 is largely vocabulary; get them wrong and nothing downstream will make sense.
Learning objectives
After this lesson you will be able to:
- Classify any dataset as structured, semi-structured or unstructured, and explain what each means for storage and querying.
- Distinguish relational from non-relational data and transactional (OLTP) from analytical (OLAP) workloads, and say why one database rarely serves both well.
- Describe the data lifecycle and the difference between batch and stream processing, with realistic Azure examples of each.
- Explain the modern data warehouse pattern and the medallion (bronze/silver/gold) architecture at a concept level.
- Identify the four core data roles — database administrator, data engineer, data analyst and data scientist — and who owns what.
- Recognise common data formats (CSV, JSON, XML, Parquet, Avro, ORC) and choose the right one for transactional versus analytical use.
Prerequisites
None beyond curiosity. It helps to have read Cloud Computing Fundamentals: IaaS, PaaS, SaaS & the Shared-Responsibility Model so that terms like PaaS and managed service are familiar, but this lesson defines everything it needs. This is Module 1, Lesson 1 of the DP-900 Data Fundamentals course in the KloudVin Azure Zero-to-Hero program.
How we classify data: structured, semi-structured, unstructured
The very first question to ask about any data is how regular is its shape? Data falls on a spectrum from rigidly tabular to completely freeform, and where it sits decides how you store it, how you query it and how much it costs to work with.
Structured data has a fixed, predefined schema — rows and columns, where every row has the same fields and every field has a declared type (an integer, a date, a 50-character string). A spreadsheet of employees is the everyday example; a relational database table is the production one. Because the shape is known in advance, structured data is easy to query precisely with a language like SQL and easy for the database to index and optimise.
Semi-structured data carries its own structure with it but does not force every record to look the same. A JSON document describing a product might have a discount field that some products simply omit; another might nest a list of reviews inside it. The structure is there — keys, values, nesting, arrays — but it is flexible and self-describing rather than fixed up front. This flexibility is exactly why semi-structured formats dominate web APIs and document databases.
Unstructured data has no inherent field-and-value model at all: photos, video, audio, PDFs, free-text emails, scanned documents. The bytes mean something to a human or to an AI model, but there is no schema a database can index on directly. You typically store the file as a blob and store metadata about it (name, size, tags, who uploaded it) separately. This is the fastest-growing category by far — most of the world’s data is now unstructured.
| Type | Schema | Examples | Where it lives in Azure | How you query it |
|---|---|---|---|---|
| Structured | Fixed, defined up front | Orders table, ledger, IoT readings in columns | Azure SQL Database, PostgreSQL/MySQL | SQL |
| Semi-structured | Self-describing, flexible | JSON/XML API payloads, product catalogues | Azure Cosmos DB, Azure Table storage | SQL-like / document queries |
| Unstructured | None (raw bytes + metadata) | Images, video, audio, PDFs, free text | Azure Blob Storage, Data Lake Storage Gen2 | Search, AI/ML, full-text |
A useful instinct: if you can draw it as a clean table, it is structured; if it is a document or message with optional, nested fields, it is semi-structured; if it is a file whose meaning lives in the bytes, it is unstructured.
Relational versus non-relational
Closely related to “shape” is the question of how records connect to each other.
A relational database stores data in tables and models relationships between them using keys. A Customers table and an Orders table are linked because each order carries a CustomerID that points back to a customer. The schema is enforced strictly: define a column as a date and the database refuses to store the word “Tuesday”. Relational databases excel when data is well-structured, relationships matter and you need strong consistency and complex joins — think finance, inventory, bookings. On Azure these are Azure SQL Database, Azure SQL Managed Instance, and Azure Database for PostgreSQL and MySQL.
A non-relational (often called NoSQL) database relaxes the fixed-schema rule and stores data in whatever shape suits the access pattern: documents (JSON), key-value pairs, wide columns or graphs. It trades rich cross-table joins and rigid schema for flexibility, massive horizontal scale and very fast reads/writes on known keys. It shines for catalogues, user profiles, IoT telemetry, gaming leaderboards and anything that must scale globally. On Azure the flagship is Azure Cosmos DB, alongside Azure Storage’s Table and Blob services.
Neither is “better” — they answer different questions. The exam loves a scenario that pushes you to pick: strong relationships and transactions → relational; flexible schema and global scale → non-relational.
Transactional versus analytical: OLTP vs OLAP
This is the single most important distinction in the whole lesson, and a guaranteed exam topic.
OLTP — Online Transaction Processing — is the world of recording what happens, right now. Each operation is small and touches a few rows: insert one order, update one balance, read one customer’s profile. There are many such operations per second, they must be fast, and they must be atomic and consistent — money must never disappear between two accounts. OLTP systems are optimised for lots of short read/write transactions and are usually relational, normalised (data stored once, without duplication) to keep writes clean and correct. Your e-commerce checkout, your banking app and your booking system are OLTP.
OLAP — Online Analytical Processing — is the world of understanding what happened, over time. A query here is large and read-only: “total revenue by product category by month for the last three years.” It scans millions or billions of rows but runs rarely compared with OLTP. OLAP systems are optimised for big aggregations and are usually denormalised (data deliberately duplicated and pre-joined into wide tables) and often column-oriented so that summing one column over a billion rows is cheap. Reporting, dashboards and business intelligence are OLAP.
The reason you cannot serve both well from one database is that their optimal designs are opposites: OLTP wants normalised tables and row storage for fast small writes; OLAP wants denormalised, columnar storage for fast huge reads. So in practice data is born in an OLTP system and copied into an OLAP system for analysis — which is exactly what the data lifecycle and the modern data warehouse describe.
| Aspect | OLTP (transactional) | OLAP (analytical) |
|---|---|---|
| Purpose | Run the business (record events) | Understand the business (analyse) |
| Typical operation | Small insert/update/read of a few rows | Large read-only aggregation over millions of rows |
| Frequency | Very high (many per second) | Lower (reports, dashboards) |
| Schema design | Normalised, row-oriented | Denormalised, often column-oriented |
| Consistency need | Strong (ACID transactions) | Read consistency; freshness can lag |
| Azure examples | Azure SQL Database, Cosmos DB, PostgreSQL | Azure Synapse Analytics, Microsoft Fabric, Databricks |
A quick memory hook: OLTP = Production/everyday transactions; OLAP = Analysis.
ACID: the promise that makes transactions safe
When people say a transactional database is “reliable”, they usually mean it guarantees ACID properties for every transaction:
- Atomicity — all the steps in a transaction happen, or none do. Debiting one account and crediting another either both succeed or both roll back; money never vanishes mid-transfer.
- Consistency — a transaction moves the database from one valid state to another, never leaving it half-broken or violating its rules.
- Isolation — concurrent transactions do not tread on each other; the result is as if they ran one after another.
- Durability — once a transaction is committed, it survives even a power cut or crash.
ACID is the hallmark of OLTP relational systems. Many non-relational systems relax some of these guarantees (for example offering eventual consistency) in exchange for scale and availability — a trade-off you choose deliberately based on the workload.
The data lifecycle
Data is not static; it moves through stages from creation to consumption. Understanding this flow is what ties every Azure data service together.
- Generate / capture — data is created by an application, device or user (an order is placed, a sensor emits a reading).
- Ingest — it is brought into a data platform, either in bulk on a schedule or continuously as events arrive.
- Store — it lands in an appropriate store: a relational database for transactions, a data lake for raw analytical data, a warehouse for curated data.
- Process / transform — it is cleaned, joined, reshaped and enriched into a form fit for analysis (this is where ETL/ELT lives — covered in the analytics lesson).
- Serve / analyse — it is queried by analysts, fed to dashboards or used to train machine-learning models.
- Visualise / act — insights reach humans (a Power BI dashboard) or drive automated decisions.
- Archive / retire — old data is moved to cheap cold storage or deleted per retention policy.
The two ingestion styles in step 2 — bulk-on-a-schedule versus continuous — are important enough to have their own names: batch and streaming.
Batch versus stream processing
Batch processing collects data over a period and processes it together in scheduled runs. You might load yesterday’s sales every night at 02:00, or reprocess a month of logs in one large job. Batches handle huge volumes efficiently and tolerate latency — nobody minds that last night’s sales report is a few hours old. The trade-off is freshness: results are only as current as the last run.
Stream processing handles data continuously, as each event arrives, with latency measured in seconds or milliseconds. A fraud-detection system scoring each card swipe, a live operations dashboard, or alerting on a sudden temperature spike are streaming workloads. Streaming gives you near-real-time insight but is more complex to build and operate, and typically processes smaller pieces of data at a time.
| Dimension | Batch | Streaming |
|---|---|---|
| When it runs | Scheduled / on demand, in chunks | Continuously, event by event |
| Latency | Minutes to hours | Seconds to milliseconds |
| Data size per run | Large (the whole batch) | Small (one event / micro-batch) |
| Complexity & cost | Lower, simpler | Higher, always-on |
| Azure services | Azure Data Factory, Synapse/Fabric pipelines, Databricks | Azure Stream Analytics, Event Hubs, Functions |
| Example | Nightly sales load, monthly billing | Live fraud scoring, IoT telemetry alerts |
Many real platforms use both: streaming for the live view and batch for the deep, historical reprocessing — an idea sometimes called a lambda architecture. For DP-900, the key is simply being able to tell a batch scenario from a streaming one by its latency requirement.
The modern data warehouse and the medallion idea
When organisations need to analyse data from many sources together, they build a modern data warehouse. The pattern is consistent regardless of the tool:
Ingest data from operational systems (databases, apps, files, streams) → store the raw data cheaply in a data lake (Azure Data Lake Storage Gen2) → transform it into clean, modelled tables → serve it from an analytical engine (Azure Synapse Analytics or Microsoft Fabric) → visualise it in Power BI. A data lake is simply massive, cheap storage that holds data of any shape — structured, semi-structured and unstructured — in its raw form, so you keep everything and decide how to use it later. A data warehouse is the curated, query-optimised store of clean, structured data ready for reporting.
A popular way to organise the lake is the medallion architecture, which refines data through three quality layers:
- Bronze — raw, untouched data exactly as ingested (your immutable source of truth).
- Silver — cleaned, validated, de-duplicated and conformed data.
- Gold — business-level, aggregated tables shaped for specific reports and dashboards.
Data flows bronze → silver → gold, getting more refined and more valuable at each step. You do not need the implementation detail for DP-900 — just the concept that raw data is progressively cleaned and curated, and that the lake holds everything while the warehouse holds the polished result.
The diagram above ties it together: the same event is captured by an OLTP system, lands raw in the data lake, is refined through the medallion layers, and is finally served for OLAP analysis and visualisation — with each of the four data roles working at a different stage of that journey.
The four data roles: who does what
Data work in any serious organisation is divided among four roles. DP-900 expects you to match a responsibility to the right one.
Database administrator (DBA). Owns the health of operational databases. Provisions and configures them, manages security and access, plans backups and disaster recovery, tunes performance, applies patches and keeps the lights on. The DBA’s world is mostly the OLTP side — keeping production databases fast, secure and available. Azure tools: Azure SQL Database, SQL Managed Instance, backups, security tooling.
Data engineer. Builds the pipelines that move and transform data. They ingest data from many sources, design the data lake and warehouse, write the ETL/ELT that turns raw data into clean, usable datasets, and make sure analysts and scientists have trustworthy data to work with. The data engineer is the plumber of the data world. Azure tools: Azure Data Factory, Synapse/Fabric pipelines, Data Lake Storage Gen2, Databricks.
Data analyst. Turns curated data into insight for business decisions. They explore data, build reports and dashboards, define metrics and tell the story of what the numbers mean — usually for non-technical stakeholders. Azure tools: Power BI, Synapse/Fabric SQL queries.
Data scientist. Applies statistics and machine learning to find patterns and make predictions — forecasting demand, detecting fraud, predicting churn, building models. They work with both raw and curated data and need experimentation tooling. Azure tools: Azure Machine Learning, Databricks, notebooks.
| Role | Core question they answer | Typical tasks | Azure tools |
|---|---|---|---|
| Database administrator | “Is the database healthy, secure and fast?” | Provision, secure, back up, tune, patch | Azure SQL, Managed Instance, backups |
| Data engineer | “How do I move and shape this data reliably?” | Build pipelines, ETL/ELT, lake/warehouse | Data Factory, Synapse/Fabric, Data Lake, Databricks |
| Data analyst | “What is the data telling the business?” | Reports, dashboards, metrics, exploration | Power BI, Synapse/Fabric SQL |
| Data scientist | “What can we predict from this data?” | ML models, statistics, experimentation | Azure ML, Databricks, notebooks |
The roles overlap in practice — in a small team one person may wear several hats — but the exam treats them as distinct, so anchor each to its core question.
Common data formats
How data is stored on disk or in a file matters enormously for cost and performance, especially in analytics. DP-900 expects familiarity with these.
CSV (Comma-Separated Values). Plain text, one row per line, fields separated by commas. Simple, universal and human-readable; great for small exports and exchange. But it has no types (everything is text), no compression and must be read whole — poor for big analytics. Row-oriented.
JSON (JavaScript Object Notation). The standard for semi-structured data and web APIs. Self-describing, supports nesting and arrays, very flexible. More verbose than CSV and still text-based, so not ideal for huge analytical scans, but perfect for documents and message payloads. This is what document databases like Cosmos DB store.
XML (eXtensible Markup Language). An older, tag-based, self-describing format (<order><id>1</id></order>). Still common in enterprise integration and legacy systems, but heavier and more verbose than JSON, which has largely replaced it for new work.
Parquet. A columnar, compressed, binary format built for analytics. Because it stores data by column, a query that needs only two of fifty columns reads only those two — dramatically faster and cheaper for big aggregations. It also carries a schema and compresses extremely well. This is the workhorse of the data lake and the format you reach for when building analytical tables.
Avro. A row-based, compact, binary format with the schema stored alongside the data. It is strong for streaming and message-passing (it handles schema evolution gracefully) and for write-heavy ingestion where you read whole records. Often paired with Event Hubs / Kafka-style pipelines.
ORC (Optimised Row Columnar). Another columnar, compressed analytical format, common in the Hadoop/Hive ecosystem; conceptually similar to Parquet in its benefits.
| Format | Layout | Human-readable | Compressed | Schema | Best for |
|---|---|---|---|---|---|
| CSV | Row | Yes | No | No | Small exports, simple exchange |
| JSON | Row (document) | Yes | No | Self-describing | APIs, semi-structured documents |
| XML | Row (document) | Yes | No | Self-describing | Legacy/enterprise integration |
| Parquet | Column | No (binary) | Yes | Yes | Analytics, data lakes, big scans |
| Avro | Row | No (binary) | Yes | Yes (with data) | Streaming, write-heavy ingestion |
| ORC | Column | No (binary) | Yes | Yes | Analytics (Hadoop/Hive ecosystem) |
The exam-ready instinct: row formats (CSV, JSON, Avro) for transactional / write-heavy / whole-record reads; columnar formats (Parquet, ORC) for analytical / read-heavy / aggregate-a-few-columns workloads.
Hands-on lab: see structured, semi-structured and the OLTP→OLAP split for yourself
This lab needs nothing but a free Azure account and the Cloud Shell, and it uses the always-free Azure Storage tier within a normal subscription. The goal is to feel the three data shapes and the lifecycle, not to build a warehouse.
Open the Azure Cloud Shell (the >_ icon in the portal) and choose Bash. Then create a resource group and a storage account that will act as a tiny data lake:
# Variables
RG="rg-dp900-lab"
LOC="centralindia"
SA="dp900lab$RANDOM" # storage account names must be globally unique, lowercase
# 1. Resource group
az group create --name $RG --location $LOC
# 2. Storage account with hierarchical namespace = Data Lake Storage Gen2
az storage account create \
--name $SA --resource-group $RG --location $LOC \
--sku Standard_LRS --kind StorageV2 --hierarchical-namespace true
# 3. A container to act as our "bronze" raw layer
az storage container create --account-name $SA --name bronze --auth-mode login
Now create one file of each shape locally and upload them, mimicking the ingest → store stages of the lifecycle:
# Structured (CSV) — a tiny orders table
printf "order_id,customer,amount\n1,Asha,1200\n2,Ben,450\n" > orders.csv
# Semi-structured (JSON) — a product document with optional/nested fields
printf '{"id":"p1","name":"Keyboard","price":1999,"tags":["input","usb"]}\n' > product.json
# Upload both to the bronze layer
az storage blob upload --account-name $SA -c bronze -f orders.csv -n raw/orders.csv --auth-mode login
az storage blob upload --account-name $SA -c bronze -f product.json -n raw/product.json --auth-mode login
# List what landed in the lake
az storage blob list --account-name $SA -c bronze --query "[].name" -o table --auth-mode login
Expected output: the final command lists raw/orders.csv and raw/product.json. You have just ingested structured and semi-structured data into a data lake — the bronze layer of a medallion design. (An unstructured file would simply be a blob too: try uploading any image with the same command and it lands right beside them — the lake does not care about shape.)
Validation: in the portal, open the storage account → Storage browser → Blob containers → bronze → raw, and confirm both files are present. Notice that the CSV is a neat grid (structured) while the JSON carries its own keys and a nested array (semi-structured).
Cleanup (important — avoid charges):
az group delete --name $RG --yes --no-wait
Cost note: kept to a single LRS storage account holding a few kilobytes, this lab costs effectively nothing — well under ₹1 — and the resource group is deleted immediately afterwards. Storage in India regions is billed per GB-month, so a few kilobytes for a few minutes rounds to zero; the discipline of always running the cleanup command is what keeps a learning subscription cheap.
Common mistakes & troubleshooting
| Symptom / mistake | Cause | Fix |
|---|---|---|
| Calling JSON “unstructured” | Confusing flexible with no structure | JSON is semi-structured — it is self-describing with keys and nesting |
| Running heavy reports against the production OLTP database | Treating one database as both transactional and analytical | Copy data to an OLAP store (Synapse/Fabric); keep OLTP for transactions |
| Choosing CSV for a large analytics dataset | Picking a familiar format over an efficient one | Use Parquet — columnar and compressed, far cheaper to scan |
| Storage account name rejected in the lab | Names must be globally unique, 3–24 chars, lowercase letters/numbers only | Append $RANDOM or pick another suffix |
| Expecting near-real-time results from a nightly batch | Mismatching latency need with processing style | Use stream processing (Stream Analytics/Event Hubs) for real-time |
| Mixing up data engineer and data analyst | Both “work with data” | Engineer moves and shapes data (pipelines); analyst interprets it (reports) |
Forgetting --auth-mode login on storage commands |
Defaults to key-based auth which may be disabled | Add --auth-mode login to use your Entra identity |
Best practices
- Match the store to the workload. Transactional and relational → relational database; flexible and globally scaled → non-relational; raw and analytical → data lake. Do not force one store to do every job.
- Separate OLTP from OLAP. Never let reporting queries compete with live transactions; move analytical data into a dedicated analytical store.
- Prefer columnar formats for analytics. Parquet/ORC cut scan cost and time dramatically over CSV/JSON.
- Keep raw data (bronze) immutable. Refine forward into silver and gold; never overwrite your source of truth.
- Choose batch or streaming by latency, not habit. Real-time need → streaming; tolerant of delay → batch (cheaper and simpler).
- Label data by sensitivity early. Knowing which datasets hold personal or financial data shapes every later security and governance decision.
Security notes
Even at fundamentals level, three security ideas underpin everything in DP-900. First, encryption: Azure data services encrypt data at rest by default and support encryption in transit (TLS), so data is protected both on disk and on the wire. Second, identity-based access: prefer Microsoft Entra ID identities and role-based access control (RBAC) over shared keys or passwords — in the lab we used --auth-mode login precisely so the storage operations ran as you, not as an account key that could leak. Third, classification and least privilege: identify which data is sensitive (personal, financial, health) and grant each role only the access it needs — a data analyst rarely needs write access to raw operational tables. These principles recur in every later data lesson, so internalise them now.
Interview & exam questions
-
What is the difference between structured, semi-structured and unstructured data? Give an example of each. Structured has a fixed schema (a SQL table); semi-structured is self-describing but flexible (JSON); unstructured has no inherent field model (an image or PDF).
-
Explain OLTP versus OLAP. OLTP handles many small, fast read/write transactions to run the business (and is normalised, row-oriented, ACID); OLAP handles large read-only aggregations to analyse the business (and is denormalised, often columnar). Data is born in OLTP and copied to OLAP.
-
Why can’t one database serve both transactional and analytical workloads well? Their optimal designs are opposites — OLTP wants normalised rows for fast small writes; OLAP wants denormalised columnar storage for fast huge reads. Mixing them degrades both.
-
What do the letters in ACID stand for, and which workload depends on them? Atomicity, Consistency, Isolation, Durability — the guarantees that make OLTP transactional databases reliable.
-
What is the difference between relational and non-relational data? Relational stores data in tables linked by keys with a strict schema and rich joins; non-relational (NoSQL) relaxes the schema for flexibility and horizontal scale (documents, key-value, columns, graphs).
-
Batch versus stream processing — when would you use each? Batch processes large volumes on a schedule with higher latency (nightly sales load); streaming processes events continuously with sub-second latency (live fraud detection). Choose by the latency requirement.
-
Describe the four data roles. Database administrator (health, security and performance of operational databases); data engineer (builds pipelines that move and transform data); data analyst (turns curated data into reports and insight); data scientist (builds ML models and predictions).
-
A team needs to analyse three years of sales by region in Power BI without slowing the checkout system. Outline an approach. Leave checkout on its OLTP database; use a data engineer’s pipeline to ingest the sales data into a data lake, refine it (bronze→silver→gold), serve it from Synapse/Fabric, and connect Power BI to that analytical store.
-
Why is Parquet preferred over CSV for large analytical datasets? Parquet is columnar and compressed, so a query reads only the columns it needs and far fewer bytes — dramatically faster and cheaper than scanning a whole row-based, uncompressed CSV.
-
What is a data lake, and how does it differ from a data warehouse? A data lake is massive, cheap storage holding raw data of any shape; a data warehouse holds curated, structured, query-optimised data ready for reporting. Data flows from lake to warehouse as it is refined.
-
What does the medallion architecture describe? Three progressive quality layers in a data lake — bronze (raw), silver (cleaned/conformed) and gold (business-ready aggregates) — through which data is refined.
-
Which file format would you choose for a streaming ingestion pipeline that reads whole records, and why? Avro — a compact, row-based binary format that stores its schema with the data and handles schema evolution well, suiting write-heavy, whole-record streaming.
Quick check
- JSON is an example of which data category?
- Which workload type is optimised for many small, fast transactions: OLTP or OLAP?
- True or false: streaming processing has higher latency than batch.
- Which role builds the pipelines that move and transform data?
- Which file format is columnar and best suited to large analytical scans?
Answers
- Semi-structured — it is self-describing and flexible but not freeform.
- OLTP (Online Transaction Processing).
- False — streaming has lower latency (seconds/ms); batch has higher latency.
- The data engineer.
- Parquet (ORC is the other columnar analytical format).
Exercise
Pick three systems you use or know — for example a banking app, a video-streaming service and a hospital’s patient-records system. For each, write a short note answering: (a) what kind of data does it mainly hold — structured, semi-structured or unstructured? (b) is its primary workload transactional (OLTP) or analytical (OLAP)? © would its reporting needs be better served by batch or streaming? (d) which two data roles would be most involved in running it? There is no single right answer — the value is in justifying each choice with the concepts from this lesson. As a stretch, sketch how data would flow through the lifecycle (generate → ingest → store → transform → serve) for one of them.
Certification mapping
This lesson maps directly to the DP-900: Microsoft Azure Data Fundamentals exam, principally the “Describe core data concepts” domain (the largest scored area), which covers ways to represent data (structured/semi-structured/unstructured and file formats), data workload options (batch vs streaming, transactional vs analytical) and common data roles and the tasks performed by each. The OLTP/OLAP, relational/non-relational and modern-data-warehouse concepts here also underpin the relational and non-relational/analytics domains covered in the following lessons. Everything in this lesson is foundational knowledge expected before any other DP-900 topic.
Glossary
- Structured data — data with a fixed, predefined schema of rows and typed columns (e.g. a SQL table).
- Semi-structured data — self-describing data with flexible structure (e.g. JSON), where records need not be identical.
- Unstructured data — data with no inherent field model (images, video, audio, free text), stored as files with separate metadata.
- Relational — data modelled as tables linked by keys, with a strict, enforced schema.
- Non-relational / NoSQL — databases that relax the fixed schema for flexibility and scale (document, key-value, column, graph).
- OLTP — Online Transaction Processing; many small, fast, ACID read/write transactions that run the business.
- OLAP — Online Analytical Processing; large read-only aggregations that analyse the business.
- ACID — Atomicity, Consistency, Isolation, Durability; the guarantees that make transactions reliable.
- Normalisation — organising data so each fact is stored once, without duplication (favoured by OLTP).
- Denormalisation — deliberately duplicating/pre-joining data into wide tables for fast reads (favoured by OLAP).
- Batch processing — processing data in scheduled chunks; higher latency, large volumes.
- Stream processing — processing events continuously as they arrive; very low latency.
- Data lake — massive, cheap storage holding raw data of any shape (Azure Data Lake Storage Gen2).
- Data warehouse — curated, query-optimised store of clean, structured data for analytics.
- Medallion architecture — bronze (raw) → silver (cleaned) → gold (business-ready) refinement layers in a lake.
- ETL / ELT — Extract-Transform-Load (or Extract-Load-Transform); the process of moving and reshaping data.
- Database administrator (DBA) — owns the health, security and performance of operational databases.
- Data engineer — builds the pipelines that ingest, move and transform data.
- Data analyst — turns curated data into reports, dashboards and business insight.
- Data scientist — applies statistics and machine learning to make predictions from data.
- Parquet / ORC — columnar, compressed binary formats optimised for analytical scans.
- Avro — a compact, row-based binary format with embedded schema, suited to streaming.
Next steps
You now have the vocabulary and the mental models the rest of the track depends on. Next, go hands-on with the most common workload of all in DP-900: Relational Data on Azure — tables, keys, normalisation, basic SQL and the Azure relational family (Azure SQL Database, Managed Instance, PostgreSQL and MySQL). From there the track moves to non-relational data and analytics, where the data lake, Cosmos DB, Synapse, Microsoft Fabric and Power BI ideas introduced here are explored in full.