In the space of a couple of years, generative AI went from a research curiosity to something your bank, your hospital, and your favourite app all quietly use. If the previous lessons taught you the classic AI building blocks — vision, language, speech, document intelligence — this one tackles the wave that changed everything: models that don’t just analyse content but create it. Ask one to draft an email, summarise a contract, write a SQL query, or explain a photograph, and it produces fluent, original output in seconds.
This is Part of Module AI Fundamentals in the Azure Zero-to-Hero course, and it maps directly to the generative-AI objectives of AI-900: Azure AI Fundamentals. We will build the ideas from the ground up. You do not need any maths, any coding, or any prior exposure to AI beyond the earlier fundamentals lessons. By the end you will understand what a large language model (LLM) actually is, the vocabulary that trips everyone up at first — tokens, prompts, completions, temperature, embeddings, vectors — how Microsoft packages OpenAI’s models as the Azure OpenAI Service, how to steer a model with good prompts, and the single most important production pattern in the field: retrieval-augmented generation (RAG), which grounds a model on your own data so it answers from facts instead of guessing. We close with copilots and agents and with the part no responsible architect skips — responsible generative AI: hallucination, grounding, and content safety.
Learning objectives
By the end of this lesson you can:
- Explain in plain English what generative AI and a large language model are, and describe a transformer at a teachable level.
- Define the core vocabulary — token, prompt, completion, context window, temperature, top-p — and predict how each affects a model’s output.
- Explain what embeddings and vectors are and why they let a computer measure meaning rather than matching words.
- Describe the Azure OpenAI Service: its model families, the idea of a deployment, content filters, and data residency, and how it differs from public ChatGPT.
- Apply prompt-engineering basics — clear instructions, examples, system messages, and grounding — to get better results.
- Explain the RAG pattern end to end (retrieve → augment → generate) and why grounding a model on your data matters, using Azure AI Search as the retriever.
- Define copilots and agents at a concept level, and state the key responsible-AI risks of generative models — hallucination, grounding, and content safety — and Azure’s mitigations.
Prerequisites & where this fits
You need only basic IT literacy and the earlier AI-900 fundamentals lessons — the AI & machine-learning fundamentals lesson (what AI is, the six Responsible AI principles) and the Azure AI Services lesson (the applied vision/language/speech building blocks). No coding and no maths are assumed; every term is defined the first time it appears. A free Azure account is enough for the conceptual lab, though note that the Azure OpenAI Service itself requires an approved subscription, so the hands-on section is written so you can follow it whether or not you have OpenAI access. This is the fourth lesson in the AI Fundamentals module, and it is the bridge from “AI that understands” to “AI that creates.”
What is generative AI?
Most of the AI you met earlier is discriminative: it looks at an input and puts it into a bucket. Is this email spam or not? Is the sentiment positive or negative? What objects are in this photo? The model draws a line between categories and tells you which side your input falls on.
Generative AI does the opposite. Instead of sorting existing content, it produces new content — text, code, images, audio — that did not exist before. You give it a starting instruction and it generates something plausible and original in response. The same underlying idea powers a chatbot that writes a poem, a tool that turns a description into a picture, and an assistant that drafts a function from a comment.
For AI-900 the spotlight is on text and code generation, which is driven by large language models. So that is where we will spend most of our time, with a short detour to images near the end.
A quick map of what generative models can produce:
| Modality | What it generates | Everyday example |
|---|---|---|
| Text | Prose, summaries, translations, answers, classifications expressed in words | “Summarise this 20-page report in five bullets.” |
| Code | Source code, queries, scripts, tests | “Write a Python function that validates an email address.” |
| Images | Pictures from a text description (text-to-image) | “A watercolour of the Mumbai skyline at dawn.” |
| Audio / speech | Synthesised speech, music | A natural-sounding voice reading this article aloud. |
| Multimodal | Output that reasons over mixed input — e.g. text + an image | “What is unusual about this photo?” with a picture attached. |
Large language models, explained from scratch
A large language model (LLM) is, at heart, a system that is extraordinarily good at one deceptively simple task: predicting the next word (more precisely, the next token — we’ll get there) given everything that came before. That is genuinely the whole trick. Show it “The capital of France is” and it predicts “Paris.” Show it the first half of a story and it predicts a plausible continuation. Scale that ability up — across billions of examples and billions of internal parameters — and “predict the next word really, really well” turns into something that can summarise, translate, reason step by step, and hold a conversation.
Where does the skill come from? Training. The model is shown an enormous amount of text — books, articles, code, websites — and repeatedly asked to predict the next token. Each time it guesses wrong, its internal numbers (its parameters, also called weights) are nudged slightly so it would guess better next time. Do this trillions of times and the parameters end up encoding a rich, statistical picture of how language — and a surprising amount of the world described in language — fits together. “Large” refers to exactly this: the sheer number of parameters (often hundreds of billions) and the size of the training data.
Two facts about training matter enormously and recur on the exam:
- A model’s knowledge has a cut-off. It only knows what was in its training data, which stops at some date. It has never heard of anything that happened afterwards, and — crucially — it knows nothing about your private documents. This single limitation is the entire reason the RAG pattern (later in this lesson) exists.
- The model is not a database; it is a predictor. It does not “look up” facts; it generates the most probable next token. Usually the most probable continuation is also the true one — but not always, which is why models sometimes produce confident-sounding nonsense, a failure we call hallucination.
The transformer, at a teachable level
Modern LLMs are built on an architecture called the transformer (the T in GPT — Generative Pre-trained Transformer). You do not need the maths for AI-900, but one idea is worth carrying with you because it explains why these models are so good: attention.
When a transformer processes a sentence, every word is allowed to “look at” — to pay attention to — every other word and weigh how relevant each one is to understanding it. Take “The trophy didn’t fit in the suitcase because it was too big.” What does “it” refer to — the trophy or the suitcase? A transformer learns to attend more strongly to “trophy”, because the context (something being too big to fit) makes that the sensible reading. This self-attention mechanism, applied in many layers, lets the model build a context-aware understanding of language rather than treating words in isolation. That is the single sentence to remember: a transformer uses attention to weigh how every token relates to every other token.
The vocabulary that trips everyone up
Before we touch Azure, let’s nail the handful of terms that confuse every newcomer. Get these right and most of generative AI clicks into place.
Tokens
Models do not read words; they read tokens. A token is a chunk of text — often a whole short word, sometimes part of a longer word, sometimes a space or punctuation mark. As a rough English rule of thumb, one token is about four characters, and 100 tokens is about 75 words. The word “tokenisation” might split into token + isation; “cat” is a single token.
Why care? Three reasons, and all three show up in practice and on the exam:
- You pay per token. Azure OpenAI bills by tokens — both the tokens you send (input/prompt tokens) and the tokens the model generates (output/completion tokens). Wordier prompts cost more.
- There is a hard limit. Every model has a context window — the maximum number of tokens it can consider at once, counting prompt plus completion together. Exceed it and the request fails or older context is dropped. Windows range from a few thousand tokens to hundreds of thousands depending on the model.
- Speed scales with tokens. More tokens in and out means a slower, more expensive response.
| Term | Plain meaning | Why it matters |
|---|---|---|
| Token | A chunk of text (~4 characters / ~¾ of a word) | The unit the model reads, the unit you are billed in |
| Input / prompt tokens | Tokens in what you send | Counted towards cost and the context window |
| Output / completion tokens | Tokens the model generates | Counted towards cost and the context window |
| Context window | Max tokens (prompt + completion) the model can handle at once | Caps how much you can feed in and get back |
Prompts and completions
The prompt is the input you give the model — your instruction, question, and any context. The completion is the text the model generates in response. The entire craft of using LLMs well comes down to writing a prompt that makes a good completion likely; that craft has a name, prompt engineering, and we devote a section to it below.
Temperature and top-p
By default an LLM’s next-token prediction is a probability distribution — a ranked list of candidate next tokens with a likelihood attached to each. Two settings let you control how adventurously the model picks from that list:
- Temperature (typically 0 to 1, sometimes up to 2). Low temperature (near 0) makes the model deterministic and focused — it almost always takes the single most probable token, giving consistent, “safe” output. High temperature flattens the probabilities so less-likely tokens get a chance, producing more varied, creative — and less predictable — output.
- Top-p (nucleus sampling, 0 to 1). Instead of scaling probabilities, top-p narrows the pool of candidates: top-p = 0.9 means “only consider the most likely tokens whose probabilities add up to 90%, and ignore the long tail.” Lower top-p = safer; higher = more diverse.
The practical guidance — and the exam answer — is simple: for factual, repeatable tasks (extraction, classification, code) use a low temperature; for creative tasks (brainstorming, marketing copy) raise it. Adjust one of temperature or top-p, not both at once.
| Setting | Range | Low value → | High value → |
|---|---|---|---|
| Temperature | 0–1 (–2) | Focused, deterministic, repeatable | Creative, varied, unpredictable |
| Top-p | 0–1 | Considers only the top few tokens | Considers a wider range of tokens |
Embeddings and vectors: teaching a computer about meaning
Here is a problem classic search cannot solve. Search the word “car” in a document that only ever says “automobile” and a keyword search finds nothing — the letters don’t match even though the meaning is identical. Generative AI systems get around this with embeddings.
An embedding is a way of turning a piece of text (a word, a sentence, a whole paragraph) into a vector — a long list of numbers, often hundreds or thousands of them. A special embedding model produces these vectors so that text with similar meaning ends up with similar numbers. The vector for “car” sits very close to the vector for “automobile” and far from the vector for “banana”. The numbers themselves are not human-readable; what matters is the distances between them.
Once meaning is expressed as numbers, the computer can do something powerful: measure how close two pieces of text are by measuring how close their vectors are. This is vector search (or semantic search), and it is the engine behind the RAG pattern. You store the vectors in a vector database (or a search index that supports vectors, such as Azure AI Search), then, given a question, you embed the question and ask “which stored chunks have the nearest vectors?” — i.e. which are the most semantically relevant, regardless of the exact words used.
A one-line mental model to remember: an embedding turns text into coordinates on a “map of meaning,” and similar meanings sit close together on that map.
The Azure OpenAI Service
OpenAI builds the models (the GPT family and others). Azure OpenAI Service is Microsoft’s way of delivering those same models inside Azure, with the security, compliance, networking, and governance an enterprise needs. It is the difference between using a model on the public internet and running it within your Azure subscription, under your controls.
Why an organisation chooses Azure OpenAI over the public ChatGPT website:
| Capability | What it gives you |
|---|---|
| Enterprise security & identity | Microsoft Entra ID authentication, role-based access control, integration with the rest of Azure |
| Networking controls | Private endpoints / VNet integration so traffic never traverses the public internet |
| Data privacy | Your prompts and completions are not used to train the models, and your data stays within your Azure tenant |
| Data residency | Choose the region your deployment lives in to meet sovereignty rules (e.g. keep data in a chosen geography) |
| Content filtering | Built-in content filters screen prompts and responses for harmful content |
| Compliance & SLA | Azure’s compliance certifications and a service-level agreement |
Model families
Azure OpenAI offers several families of models, each suited to different jobs:
| Family | What it does | Typical use |
|---|---|---|
| GPT chat/completion models (the GPT family) | Generate and reason over text and code; newer versions are multimodal (accept images too) | Chatbots, summarisation, drafting, code, Q&A |
| Embedding models | Turn text into vectors | Semantic search, the retrieve step of RAG, clustering |
| Image-generation models (e.g. the DALL·E family) | Create images from text descriptions | Marketing visuals, concept art |
| Speech models (e.g. the Whisper family) | Transcribe speech to text | Meeting transcripts, captions |
You do not need to memorise specific version numbers for AI-900 — they change often. You should know the categories: a model for chat/text generation, a model for embeddings, a model for images, a model for speech.
Deployments — the idea you must understand
In Azure OpenAI you never call a “model” directly. You create a deployment: a named, callable instance of a chosen model version within your resource. Your application then sends requests to the deployment’s name and endpoint, authenticated with a key or with Microsoft Entra ID. This indirection is deliberate — it lets you pin to a model version, manage capacity, and swap the underlying model later without changing your application’s wiring. The mental model: a deployment is your own named “phone line” to a specific model.
Content filters and data residency
Two governance features deserve a special mention because they appear on the exam and matter in production:
- Content filters. Azure OpenAI automatically runs a content-filtering system over both the prompt and the completion, screening categories such as hate, sexual, violence, and self-harm at configurable severity levels. This is part of Azure AI Content Safety and is on by default — you do not bolt it on afterwards.
- Data residency. Because you choose the Azure region for your resource, you control where your data is processed and stored, which is how organisations meet data-sovereignty obligations.
Prompt engineering basics
A model’s output is only as good as the prompt you give it. Prompt engineering is the practical skill of writing inputs that reliably produce the output you want — no model retraining required. A few high-leverage techniques, all examinable:
- Be clear and specific. Vague in, vague out. “Write something about Azure” is weak; “Write a 100-word introduction to Azure Storage for a beginner, in British English, with no marketing fluff” is strong.
- Use a system message. Most chat models accept a system message that sets the assistant’s role, tone, and rules for the whole conversation — e.g. “You are a concise Azure tutor. Answer in British English. If unsure, say so.” It steers every later reply.
- Give examples (few-shot prompting). Showing the model one or more worked examples of the input→output you want (“few-shot”) dramatically improves consistency, versus zero-shot (no examples). For a classifier, show two or three labelled samples and the model copies the pattern.
- Ask for structure. If you need machine-readable output, say so: “Return the answer as JSON with keys
nameandtotal.” - Encourage reasoning for hard tasks. For multi-step problems, asking the model to “work through it step by step” (chain-of-thought) often improves accuracy.
- Ground the model. The most powerful technique of all: put the relevant facts directly in the prompt so the model answers from them rather than from memory. “Using only the text below, answer the question…” This is grounding, and it leads us straight to RAG.
The RAG pattern: grounding a model on your data
We now arrive at the most important architectural pattern in applied generative AI. Recall the two hard limits of an LLM: it knows nothing past its training cut-off, and it knows nothing about your private documents. So how do you build an assistant that answers questions about your company handbook, your product catalogue, or this month’s policy — none of which the model was trained on?
You do not retrain the model (slow, expensive, and overkill). Instead you use Retrieval-Augmented Generation (RAG): at the moment of the question, you fetch the relevant facts from your own data and hand them to the model as part of the prompt, so it generates an answer grounded in those facts.
The pattern has three steps, and the names give it away:
- Retrieve. Take the user’s question, search your knowledge base for the most relevant chunks of text. This is typically a vector search over embeddings (often combined with keyword search — a hybrid search) using Azure AI Search as the retriever.
- Augment. Insert those retrieved chunks into the prompt alongside the question — “Here are the relevant passages: […]. Using only these, answer: […].” You have augmented the prompt with grounding data.
- Generate. The LLM produces an answer based on the supplied facts, ideally citing which passage each statement came from.
The diagram traces a single question through the full pipeline — embedding the query, retrieving the nearest chunks from an Azure AI Search index, augmenting the prompt with those grounded passages, and generating a cited answer with an Azure OpenAI deployment.
Why RAG is such a big deal, in one paragraph: it gives you a model that answers from current, private, authoritative data; it slashes hallucination because the model is told to use the supplied facts; it lets you cite sources so users can verify; and it needs no retraining — update the documents and the answers update too. The typical Azure shape is Azure AI Search (the retriever, holding your indexed and vectorised content) plus an Azure OpenAI chat deployment (the generator), with an embedding deployment turning both your documents and the incoming question into vectors. Azure even offers a built-in “on your data” capability that wires Azure OpenAI to an Azure AI Search index for you, so a basic RAG chatbot can be stood up without writing the retrieval plumbing by hand.
A crisp contrast to keep the alternatives straight:
| Approach | What it does | When to use it |
|---|---|---|
| Prompt engineering | Steer the base model with better instructions/examples | First resort; cheap; no extra infrastructure |
| RAG | Inject your retrieved facts into the prompt at query time | The model must answer from private/current data — the common enterprise case |
| Fine-tuning | Further-train the model on your examples to change its style/behaviour | You need a consistent format/tone/skill, not new facts; heavier and costlier |
The exam-ready distinction: RAG adds knowledge; fine-tuning adjusts behaviour. If the problem is “the model doesn’t know our facts,” reach for RAG, not fine-tuning.
Copilots and agents
Two words you will hear constantly, defined simply:
- A copilot is an AI assistant embedded inside an application to help you do that app’s work — drafting in a word processor, suggesting code in an editor, summarising a meeting. The human stays in control; the copilot assists. Microsoft’s family of these is branded Copilot, and they are built on the same generative models and grounding patterns covered here.
- An agent goes a step further: given a goal, it can plan and take actions — calling tools, querying systems, chaining several steps — to accomplish a task with more autonomy, rather than only responding to a single prompt. “Find the three cheapest flights and draft an email comparing them” is agent-shaped work: it must search, compare, then write. On Azure, such assistants are built with services like Azure AI Foundry and the Azure AI Agent Service.
For AI-900 you need the concepts: a copilot assists inside an app; an agent pursues a goal by taking actions. Both are applications built on top of models like those in Azure OpenAI — they are not models themselves.
Responsible generative AI
Generative AI inherits all six Microsoft Responsible AI principles you met earlier — fairness, reliability & safety, privacy & security, inclusiveness, transparency, accountability — but it adds risks specific to generated content. Three matter most:
- Hallucination. Because the model predicts plausible text rather than looking up true text, it can produce confident, fluent statements that are simply wrong — a fabricated citation, an invented figure, a made-up policy. The primary defence is grounding: use RAG so the model answers from supplied facts, ask it to cite sources, lower the temperature for factual tasks, and instruct it to say when it does not know.
- Grounding (the fix and the goal). Grounding means tying the model’s output to verifiable source data rather than its parametric memory. A grounded answer can be traced back to a document; an ungrounded one cannot. Grounding is the single most effective lever against hallucination — which is exactly why RAG is so widely adopted.
- Content safety. Generative models could, if unguarded, produce or be coaxed into producing harmful content (hate, violence, self-harm, sexual content), or be manipulated by prompt-injection attacks where hidden instructions in input try to override your rules. Azure’s answer is Azure AI Content Safety and the content filters built into Azure OpenAI, which screen both prompts and completions across harm categories at configurable severities, plus features such as groundedness detection and prompt-shield protections.
The senior-architect summary: never ship a generative feature without grounding, content filtering, a human-in-the-loop for high-stakes output, and clear disclosure that responses are AI-generated and may be imperfect. Transparency and accountability are not optional extras; they are the price of using this technology responsibly.
Hands-on lab
The Azure OpenAI Service requires an approved subscription, so this lab is written in two tiers. Everyone can do Part A (it costs nothing and proves the core concepts); do Part B only if your subscription has Azure OpenAI access.
Part A — Tokens, prompts and temperature (no special access, free)
- Open the Azure OpenAI Studio tokenizer page, or any OpenAI-compatible tokenizer, and paste a sentence such as “Generative AI predicts the next token.” Observe how it splits into tokens and note the token count. Try a long, rare word and watch it split into several tokens. Validation: you can state roughly how many tokens your sentence uses and why a wordier prompt costs more.
- In a chat playground you do have (for example the Azure AI Foundry chat playground if available, or the conceptual exercise on paper), write a system message: “You are a concise Azure tutor; answer in British English; if unsure, say so.” Then ask the same factual question twice — once at temperature 0 and once at temperature 0.9. Validation: the low-temperature answer is steady and repeatable; the high-temperature answer varies in wording and length. You have now seen what temperature does.
- Rewrite a vague prompt (“tell me about storage”) into a specific, grounded one (“Using only this paragraph: ‘…’, list in three bullets when to choose Azure Blob Storage”). Validation: the grounded version answers from your text rather than from the model’s memory — a hand-built taste of RAG.
Part B — A grounded chatbot with “on your data” (requires Azure OpenAI access)
- In the Azure portal, create an Azure OpenAI resource in a region close to you (this sets your data residency).
az cognitiveservices account create --name myopenai --resource-group rg-genai-lab --kind OpenAI --sku S0 --location eastus. - In Azure AI Foundry / Azure OpenAI Studio, create a deployment of a chat model and a deployment of an embedding model.
- Create an Azure AI Search service and upload a handful of your own documents (a few PDFs or text files).
- In the chat playground, use “Add your data” to point the chat deployment at your Azure AI Search index — this wires up RAG for you.
- Ask a question whose answer is only in your uploaded documents. Validation: the assistant answers correctly and cites the source document — proof of retrieval-augmented generation. Now ask something not in your data and watch a well-grounded setup say it cannot find the answer rather than hallucinating.
Cleanup. Delete everything so nothing keeps billing: az group delete --name rg-genai-lab --yes --no-wait (do the same for the resource group holding the Search service if separate).
Cost note. Part A is free. In Part B, Azure OpenAI bills per 1,000 tokens (input and output separately) and Azure AI Search bills per hour for the service tier. A short experiment with a small Search tier typically costs only a few rupees (well under ₹100) — but the Search service charges while it exists, whether or not you query it, so the single biggest cost mistake is leaving it running. Delete the resource group the moment you are done.
Common mistakes & troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Request rejected for being too long | Prompt + expected completion exceeds the context window | Shorten the prompt, retrieve fewer/smaller chunks, or use a model with a larger window |
| Model gives confident but wrong answers | Hallucination — answering from memory, not facts | Ground it with RAG, lower temperature, ask it to cite sources and to admit uncertainty |
| Bill is higher than expected | Billed per token in and out; long prompts/answers add up | Trim prompts, cap max output tokens, retrieve only the most relevant chunks |
| Output is inconsistent run to run | Temperature/top-p too high for a factual task | Lower temperature (towards 0) for extraction/classification/code |
| Chatbot ignores your documents | Retrieval misconfigured — wrong index, no embeddings, or data not added | Verify the Azure AI Search index, that content is vectorised, and that the deployment is pointed at it |
| “Model not found” when calling the API | Calling the model name instead of your deployment name | Call the deployment name and endpoint, not the raw model id |
| Harmful or off-policy text slips through | Content filtering not configured, or a prompt-injection attack | Use Azure AI Content Safety/content filters and prompt-shield protections; keep a human in the loop |
Best practices
- Start with the cheapest tool that works: try prompt engineering first, add RAG when the model needs your facts, and only fine-tune when you need consistent behaviour, not new knowledge.
- Always ground high-stakes answers and surface citations so users can verify.
- Pick the right model for the job — a small fast model for simple tasks, a larger one only where its capability earns the extra cost and latency.
- Tune temperature to the task: low for factual/extractive work, higher for creative work.
- Control the context window deliberately — retrieve only the most relevant chunks; more context is not always better and always costs more.
- Keep a human in the loop for anything consequential, and disclose clearly that content is AI-generated.
- Monitor cost by tokens and set budgets; tokens are the unit that moves the bill.
Security notes
- Use Microsoft Entra ID, not raw keys, where possible, and store any keys in Azure Key Vault — never in code or prompts.
- Lock down the network with private endpoints / VNet integration so traffic to Azure OpenAI stays off the public internet.
- Never put secrets or unnecessary personal data in prompts — prompts are sent to the service and may be logged for abuse monitoring; minimise what you send.
- Keep content filtering on and tune severities to your context; treat it as a required control, not optional.
- Defend against prompt injection: treat retrieved/user content as untrusted, separate instructions from data, and use prompt-shield/groundedness features.
- Remember data residency: choose the resource region to satisfy sovereignty rules, and recall that, in Azure OpenAI, your data is not used to train the models.
- Govern access with RBAC and audit usage — apply the same least-privilege discipline you would to any sensitive Azure resource.
Interview & exam questions
- What is a large language model, in one sentence? A model trained to predict the next token given preceding text, which at scale can generate, summarise, translate, and reason over language.
- What is a token, and why does it matter? A chunk of text (~4 characters / ~¾ word) — the unit the model reads and the unit you are billed in; prompt + completion tokens must also fit the context window.
- Explain temperature. A setting (≈0–1) controlling randomness: near 0 is focused/deterministic/repeatable; higher is more creative/varied. Use low for factual tasks, high for creative ones.
- What is an embedding, and what is it used for? A numeric vector representing the meaning of text so that similar meanings have similar vectors; it powers semantic/vector search and the retrieve step of RAG.
- What is the RAG pattern and why use it? Retrieve relevant facts from your data, augment the prompt with them, generate a grounded answer. It lets a model answer from private/current data, reduces hallucination, and enables citations — without retraining.
- RAG vs fine-tuning — when each? RAG adds knowledge (use it when the model lacks your facts); fine-tuning adjusts behaviour/style (use it for consistent format/tone). Most “it doesn’t know our data” problems are RAG problems.
- What is a deployment in Azure OpenAI? A named, callable instance of a specific model version in your resource; your app calls the deployment name + endpoint, not the raw model.
- How does Azure OpenAI differ from public ChatGPT? Enterprise identity (Entra ID), RBAC, private networking, data residency, built-in content filtering, compliance/SLA — and your data is not used to train the models.
- What is hallucination and how do you reduce it? Confident but false generated content; mitigate with grounding/RAG, citations, lower temperature, and instructing the model to admit uncertainty.
- Copilot vs agent? A copilot assists a human inside an app; an agent plans and takes actions (calls tools, chains steps) to pursue a goal more autonomously.
- Name the harm categories content filters screen for. Hate, sexual, violence, and self-harm — via Azure AI Content Safety, applied to both prompt and completion at configurable severities.
- What is grounding? Tying a model’s output to verifiable source data rather than its memory — the core technique behind trustworthy generative answers.
Quick check
- In your own words, what single task is an LLM fundamentally trained to do, and how does scaling that turn into useful abilities?
- Why does the same prompt cost more if it is wordier, and what two totals must fit inside the context window?
- A teammate wants a chatbot to answer questions about this quarter’s internal policy PDFs. Should they fine-tune a model or use RAG — and why?
- You are extracting structured fields from invoices and the output keeps varying between runs. Which setting do you change, and in which direction?
- Name the three risks specific to generative AI covered in this lesson and the single most effective mitigation for the first one.
Answers
- An LLM is trained to predict the next token given the preceding text. Scaled across vast data and parameters, “predict the next token really well” generalises into summarising, translating, reasoning, and conversing.
- You are billed per token, so more words = more tokens = more cost. The prompt tokens and the completion tokens together must fit inside the context window.
- Use RAG. The information is private and current (this quarter’s PDFs) — the model needs new knowledge, which RAG supplies by retrieving and grounding. Fine-tuning changes behaviour, not facts, and is heavier and costlier here.
- Lower the temperature (towards 0). Extraction is a factual task, so you want focused, repeatable output rather than creative variation.
- Hallucination, grounding (lack of), and content-safety risks. The most effective mitigation for hallucination is grounding the model on verifiable source data — i.e. use RAG (plus citations, low temperature, and admitting uncertainty).
Exercise
Design — on paper — a grounded customer-support assistant for a fictional company, then justify each choice using this lesson:
- State the data the assistant must answer from (e.g. product manuals, an FAQ, return policy) and explain why a plain LLM cannot answer these out of the box.
- Sketch the RAG pipeline: where embeddings are created, what plays the retriever (name the Azure service), and what plays the generator. Label the three steps retrieve → augment → generate.
- Choose a temperature for this assistant and justify it in one sentence.
- List three responsible-AI safeguards you will include (e.g. grounding with citations, content filtering, human escalation for refunds) and the risk each addresses.
- Name one cost lever you will pull to keep the token bill down.
If you can complete all five with a one-line justification each, you can explain applied generative AI on Azure end to end — exactly the level AI-900 expects.
Certification mapping
- AI-900 (Azure AI Fundamentals): Describe features of generative AI workloads on Azure — the headline objective this lesson covers end to end: what generative AI and LLMs are; tokens, prompts, completions, embeddings; the Azure OpenAI Service (models, deployments, content filters, responsible use); copilots; and grounding/RAG concepts. Expect plain definitional and scenario questions (“what is an embedding?”, “RAG vs fine-tuning?”, “what reduces hallucination?”).
- AI-102 (Azure AI Engineer Associate) on-ramp: this is the conceptual foundation for the generative-AI portions of AI-102, where you go hands-on building RAG solutions with Azure OpenAI and Azure AI Search, configuring deployments, and implementing content safety.
- It also underpins the Responsible AI thread across every Microsoft AI certification.
Glossary
- Generative AI — AI that creates new content (text, code, images, audio) rather than only classifying existing content.
- Large language model (LLM) — a model trained to predict the next token; at scale, capable of generating and reasoning over language.
- Transformer — the neural-network architecture behind modern LLMs, built on the attention mechanism (the T in GPT).
- Attention — the mechanism letting a model weigh how relevant every token is to every other token when building meaning.
- Token — a chunk of text (~4 characters / ~¾ word); the unit a model reads and is billed in.
- Prompt — the input/instruction you give the model. Completion — the text the model generates in response.
- Context window — the maximum number of tokens (prompt + completion) a model can consider at once.
- Temperature — a setting controlling randomness/creativity of output (low = focused, high = varied). Top-p — nucleus sampling; narrows the candidate pool to the most likely tokens summing to p.
- Embedding — a numeric vector representing the meaning of text, so similar meanings sit close together. Vector — the list of numbers itself.
- Vector / semantic search — finding text by closeness of meaning (vector distance) rather than exact keyword match.
- Azure OpenAI Service — Microsoft’s delivery of OpenAI models inside Azure with enterprise security, networking, residency, and content filtering.
- Deployment — a named, callable instance of a specific model version in your Azure OpenAI resource.
- RAG (Retrieval-Augmented Generation) — retrieve relevant facts from your data, augment the prompt with them, generate a grounded answer.
- Grounding — tying a model’s output to verifiable source data rather than its memory.
- Hallucination — confident but false content generated by a model.
- Fine-tuning — further-training a model on your examples to change its behaviour/style (not to add facts).
- Copilot — an AI assistant embedded in an app to help a human. Agent — an AI that plans and takes actions to pursue a goal.
- Azure AI Content Safety — Azure’s service (and the content filters in Azure OpenAI) that screen prompts and completions for harmful content.
- Prompt engineering — the practice of writing inputs that reliably produce the desired output without retraining.
Next steps
You can now explain generative AI from first principles — tokens and transformers through to a grounded, responsible RAG application on Azure — and answer the classic interview and exam questions on LLMs, embeddings, Azure OpenAI, and RAG.
- Next lesson: DP-900: Core Data Concepts, Roles & Workloads — generative AI runs on data, so the natural next move is the data fundamentals every architect needs.
Related reading to go deeper:
- AI-900: AI & Machine Learning Fundamentals on Azure (incl. Responsible AI) — the classic-ML foundation and the six Responsible AI principles this lesson builds on.
- AI-900: Azure AI Services — Vision, Language, Speech, Document Intelligence & Search — the applied building blocks, including Azure AI Search, the retriever at the heart of RAG.