Azure AI Fundamentals

AI-900: Generative AI & Azure OpenAI Fundamentals

In the space of a couple of years, generative AI went from a research curiosity to something your bank, your hospital, and your favourite app all quietly use. If the previous lessons taught you the classic AI building blocks — vision, language, speech, document intelligence — this one tackles the wave that changed everything: models that don’t just analyse content but create it. Ask one to draft an email, summarise a contract, write a SQL query, or explain a photograph, and it produces fluent, original output in seconds.

This is Part of Module AI Fundamentals in the Azure Zero-to-Hero course, and it maps directly to the generative-AI objectives of AI-900: Azure AI Fundamentals. We will build the ideas from the ground up. You do not need any maths, any coding, or any prior exposure to AI beyond the earlier fundamentals lessons. By the end you will understand what a large language model (LLM) actually is, the vocabulary that trips everyone up at first — tokens, prompts, completions, temperature, embeddings, vectors — how Microsoft packages OpenAI’s models as the Azure OpenAI Service, how to steer a model with good prompts, and the single most important production pattern in the field: retrieval-augmented generation (RAG), which grounds a model on your own data so it answers from facts instead of guessing. We close with copilots and agents and with the part no responsible architect skips — responsible generative AI: hallucination, grounding, and content safety.

Learning objectives

By the end of this lesson you can:

Prerequisites & where this fits

You need only basic IT literacy and the earlier AI-900 fundamentals lessons — the AI & machine-learning fundamentals lesson (what AI is, the six Responsible AI principles) and the Azure AI Services lesson (the applied vision/language/speech building blocks). No coding and no maths are assumed; every term is defined the first time it appears. A free Azure account is enough for the conceptual lab, though note that the Azure OpenAI Service itself requires an approved subscription, so the hands-on section is written so you can follow it whether or not you have OpenAI access. This is the fourth lesson in the AI Fundamentals module, and it is the bridge from “AI that understands” to “AI that creates.”

What is generative AI?

Most of the AI you met earlier is discriminative: it looks at an input and puts it into a bucket. Is this email spam or not? Is the sentiment positive or negative? What objects are in this photo? The model draws a line between categories and tells you which side your input falls on.

Generative AI does the opposite. Instead of sorting existing content, it produces new content — text, code, images, audio — that did not exist before. You give it a starting instruction and it generates something plausible and original in response. The same underlying idea powers a chatbot that writes a poem, a tool that turns a description into a picture, and an assistant that drafts a function from a comment.

For AI-900 the spotlight is on text and code generation, which is driven by large language models. So that is where we will spend most of our time, with a short detour to images near the end.

A quick map of what generative models can produce:

Modality What it generates Everyday example
Text Prose, summaries, translations, answers, classifications expressed in words “Summarise this 20-page report in five bullets.”
Code Source code, queries, scripts, tests “Write a Python function that validates an email address.”
Images Pictures from a text description (text-to-image) “A watercolour of the Mumbai skyline at dawn.”
Audio / speech Synthesised speech, music A natural-sounding voice reading this article aloud.
Multimodal Output that reasons over mixed input — e.g. text + an image “What is unusual about this photo?” with a picture attached.

Large language models, explained from scratch

A large language model (LLM) is, at heart, a system that is extraordinarily good at one deceptively simple task: predicting the next word (more precisely, the next token — we’ll get there) given everything that came before. That is genuinely the whole trick. Show it “The capital of France is” and it predicts “Paris.” Show it the first half of a story and it predicts a plausible continuation. Scale that ability up — across billions of examples and billions of internal parameters — and “predict the next word really, really well” turns into something that can summarise, translate, reason step by step, and hold a conversation.

Where does the skill come from? Training. The model is shown an enormous amount of text — books, articles, code, websites — and repeatedly asked to predict the next token. Each time it guesses wrong, its internal numbers (its parameters, also called weights) are nudged slightly so it would guess better next time. Do this trillions of times and the parameters end up encoding a rich, statistical picture of how language — and a surprising amount of the world described in language — fits together. “Large” refers to exactly this: the sheer number of parameters (often hundreds of billions) and the size of the training data.

Two facts about training matter enormously and recur on the exam:

  1. A model’s knowledge has a cut-off. It only knows what was in its training data, which stops at some date. It has never heard of anything that happened afterwards, and — crucially — it knows nothing about your private documents. This single limitation is the entire reason the RAG pattern (later in this lesson) exists.
  2. The model is not a database; it is a predictor. It does not “look up” facts; it generates the most probable next token. Usually the most probable continuation is also the true one — but not always, which is why models sometimes produce confident-sounding nonsense, a failure we call hallucination.

The transformer, at a teachable level

Modern LLMs are built on an architecture called the transformer (the T in GPT — Generative Pre-trained Transformer). You do not need the maths for AI-900, but one idea is worth carrying with you because it explains why these models are so good: attention.

When a transformer processes a sentence, every word is allowed to “look at” — to pay attention to — every other word and weigh how relevant each one is to understanding it. Take “The trophy didn’t fit in the suitcase because it was too big.” What does “it” refer to — the trophy or the suitcase? A transformer learns to attend more strongly to “trophy”, because the context (something being too big to fit) makes that the sensible reading. This self-attention mechanism, applied in many layers, lets the model build a context-aware understanding of language rather than treating words in isolation. That is the single sentence to remember: a transformer uses attention to weigh how every token relates to every other token.

The vocabulary that trips everyone up

Before we touch Azure, let’s nail the handful of terms that confuse every newcomer. Get these right and most of generative AI clicks into place.

Tokens

Models do not read words; they read tokens. A token is a chunk of text — often a whole short word, sometimes part of a longer word, sometimes a space or punctuation mark. As a rough English rule of thumb, one token is about four characters, and 100 tokens is about 75 words. The word “tokenisation” might split into token + isation; “cat” is a single token.

Why care? Three reasons, and all three show up in practice and on the exam:

Term Plain meaning Why it matters
Token A chunk of text (~4 characters / ~¾ of a word) The unit the model reads, the unit you are billed in
Input / prompt tokens Tokens in what you send Counted towards cost and the context window
Output / completion tokens Tokens the model generates Counted towards cost and the context window
Context window Max tokens (prompt + completion) the model can handle at once Caps how much you can feed in and get back

Prompts and completions

The prompt is the input you give the model — your instruction, question, and any context. The completion is the text the model generates in response. The entire craft of using LLMs well comes down to writing a prompt that makes a good completion likely; that craft has a name, prompt engineering, and we devote a section to it below.

Temperature and top-p

By default an LLM’s next-token prediction is a probability distribution — a ranked list of candidate next tokens with a likelihood attached to each. Two settings let you control how adventurously the model picks from that list:

The practical guidance — and the exam answer — is simple: for factual, repeatable tasks (extraction, classification, code) use a low temperature; for creative tasks (brainstorming, marketing copy) raise it. Adjust one of temperature or top-p, not both at once.

Setting Range Low value → High value →
Temperature 0–1 (–2) Focused, deterministic, repeatable Creative, varied, unpredictable
Top-p 0–1 Considers only the top few tokens Considers a wider range of tokens

Embeddings and vectors: teaching a computer about meaning

Here is a problem classic search cannot solve. Search the word “car” in a document that only ever says “automobile” and a keyword search finds nothing — the letters don’t match even though the meaning is identical. Generative AI systems get around this with embeddings.

An embedding is a way of turning a piece of text (a word, a sentence, a whole paragraph) into a vector — a long list of numbers, often hundreds or thousands of them. A special embedding model produces these vectors so that text with similar meaning ends up with similar numbers. The vector for “car” sits very close to the vector for “automobile” and far from the vector for “banana”. The numbers themselves are not human-readable; what matters is the distances between them.

Once meaning is expressed as numbers, the computer can do something powerful: measure how close two pieces of text are by measuring how close their vectors are. This is vector search (or semantic search), and it is the engine behind the RAG pattern. You store the vectors in a vector database (or a search index that supports vectors, such as Azure AI Search), then, given a question, you embed the question and ask “which stored chunks have the nearest vectors?” — i.e. which are the most semantically relevant, regardless of the exact words used.

A one-line mental model to remember: an embedding turns text into coordinates on a “map of meaning,” and similar meanings sit close together on that map.

The Azure OpenAI Service

OpenAI builds the models (the GPT family and others). Azure OpenAI Service is Microsoft’s way of delivering those same models inside Azure, with the security, compliance, networking, and governance an enterprise needs. It is the difference between using a model on the public internet and running it within your Azure subscription, under your controls.

Why an organisation chooses Azure OpenAI over the public ChatGPT website:

Capability What it gives you
Enterprise security & identity Microsoft Entra ID authentication, role-based access control, integration with the rest of Azure
Networking controls Private endpoints / VNet integration so traffic never traverses the public internet
Data privacy Your prompts and completions are not used to train the models, and your data stays within your Azure tenant
Data residency Choose the region your deployment lives in to meet sovereignty rules (e.g. keep data in a chosen geography)
Content filtering Built-in content filters screen prompts and responses for harmful content
Compliance & SLA Azure’s compliance certifications and a service-level agreement

Model families

Azure OpenAI offers several families of models, each suited to different jobs:

Family What it does Typical use
GPT chat/completion models (the GPT family) Generate and reason over text and code; newer versions are multimodal (accept images too) Chatbots, summarisation, drafting, code, Q&A
Embedding models Turn text into vectors Semantic search, the retrieve step of RAG, clustering
Image-generation models (e.g. the DALL·E family) Create images from text descriptions Marketing visuals, concept art
Speech models (e.g. the Whisper family) Transcribe speech to text Meeting transcripts, captions

You do not need to memorise specific version numbers for AI-900 — they change often. You should know the categories: a model for chat/text generation, a model for embeddings, a model for images, a model for speech.

Deployments — the idea you must understand

In Azure OpenAI you never call a “model” directly. You create a deployment: a named, callable instance of a chosen model version within your resource. Your application then sends requests to the deployment’s name and endpoint, authenticated with a key or with Microsoft Entra ID. This indirection is deliberate — it lets you pin to a model version, manage capacity, and swap the underlying model later without changing your application’s wiring. The mental model: a deployment is your own named “phone line” to a specific model.

Content filters and data residency

Two governance features deserve a special mention because they appear on the exam and matter in production:

Prompt engineering basics

A model’s output is only as good as the prompt you give it. Prompt engineering is the practical skill of writing inputs that reliably produce the output you want — no model retraining required. A few high-leverage techniques, all examinable:

The RAG pattern: grounding a model on your data

We now arrive at the most important architectural pattern in applied generative AI. Recall the two hard limits of an LLM: it knows nothing past its training cut-off, and it knows nothing about your private documents. So how do you build an assistant that answers questions about your company handbook, your product catalogue, or this month’s policy — none of which the model was trained on?

You do not retrain the model (slow, expensive, and overkill). Instead you use Retrieval-Augmented Generation (RAG): at the moment of the question, you fetch the relevant facts from your own data and hand them to the model as part of the prompt, so it generates an answer grounded in those facts.

The pattern has three steps, and the names give it away:

  1. Retrieve. Take the user’s question, search your knowledge base for the most relevant chunks of text. This is typically a vector search over embeddings (often combined with keyword search — a hybrid search) using Azure AI Search as the retriever.
  2. Augment. Insert those retrieved chunks into the prompt alongside the question — “Here are the relevant passages: […]. Using only these, answer: […].” You have augmented the prompt with grounding data.
  3. Generate. The LLM produces an answer based on the supplied facts, ideally citing which passage each statement came from.

Generative AI & RAG on Azure OpenAI

The diagram traces a single question through the full pipeline — embedding the query, retrieving the nearest chunks from an Azure AI Search index, augmenting the prompt with those grounded passages, and generating a cited answer with an Azure OpenAI deployment.

Why RAG is such a big deal, in one paragraph: it gives you a model that answers from current, private, authoritative data; it slashes hallucination because the model is told to use the supplied facts; it lets you cite sources so users can verify; and it needs no retraining — update the documents and the answers update too. The typical Azure shape is Azure AI Search (the retriever, holding your indexed and vectorised content) plus an Azure OpenAI chat deployment (the generator), with an embedding deployment turning both your documents and the incoming question into vectors. Azure even offers a built-in “on your data” capability that wires Azure OpenAI to an Azure AI Search index for you, so a basic RAG chatbot can be stood up without writing the retrieval plumbing by hand.

A crisp contrast to keep the alternatives straight:

Approach What it does When to use it
Prompt engineering Steer the base model with better instructions/examples First resort; cheap; no extra infrastructure
RAG Inject your retrieved facts into the prompt at query time The model must answer from private/current data — the common enterprise case
Fine-tuning Further-train the model on your examples to change its style/behaviour You need a consistent format/tone/skill, not new facts; heavier and costlier

The exam-ready distinction: RAG adds knowledge; fine-tuning adjusts behaviour. If the problem is “the model doesn’t know our facts,” reach for RAG, not fine-tuning.

Copilots and agents

Two words you will hear constantly, defined simply:

For AI-900 you need the concepts: a copilot assists inside an app; an agent pursues a goal by taking actions. Both are applications built on top of models like those in Azure OpenAI — they are not models themselves.

Responsible generative AI

Generative AI inherits all six Microsoft Responsible AI principles you met earlier — fairness, reliability & safety, privacy & security, inclusiveness, transparency, accountability — but it adds risks specific to generated content. Three matter most:

The senior-architect summary: never ship a generative feature without grounding, content filtering, a human-in-the-loop for high-stakes output, and clear disclosure that responses are AI-generated and may be imperfect. Transparency and accountability are not optional extras; they are the price of using this technology responsibly.

Hands-on lab

The Azure OpenAI Service requires an approved subscription, so this lab is written in two tiers. Everyone can do Part A (it costs nothing and proves the core concepts); do Part B only if your subscription has Azure OpenAI access.

Part A — Tokens, prompts and temperature (no special access, free)

  1. Open the Azure OpenAI Studio tokenizer page, or any OpenAI-compatible tokenizer, and paste a sentence such as “Generative AI predicts the next token.” Observe how it splits into tokens and note the token count. Try a long, rare word and watch it split into several tokens. Validation: you can state roughly how many tokens your sentence uses and why a wordier prompt costs more.
  2. In a chat playground you do have (for example the Azure AI Foundry chat playground if available, or the conceptual exercise on paper), write a system message: “You are a concise Azure tutor; answer in British English; if unsure, say so.” Then ask the same factual question twice — once at temperature 0 and once at temperature 0.9. Validation: the low-temperature answer is steady and repeatable; the high-temperature answer varies in wording and length. You have now seen what temperature does.
  3. Rewrite a vague prompt (“tell me about storage”) into a specific, grounded one (“Using only this paragraph: ‘…’, list in three bullets when to choose Azure Blob Storage”). Validation: the grounded version answers from your text rather than from the model’s memory — a hand-built taste of RAG.

Part B — A grounded chatbot with “on your data” (requires Azure OpenAI access)

  1. In the Azure portal, create an Azure OpenAI resource in a region close to you (this sets your data residency). az cognitiveservices account create --name myopenai --resource-group rg-genai-lab --kind OpenAI --sku S0 --location eastus.
  2. In Azure AI Foundry / Azure OpenAI Studio, create a deployment of a chat model and a deployment of an embedding model.
  3. Create an Azure AI Search service and upload a handful of your own documents (a few PDFs or text files).
  4. In the chat playground, use “Add your data” to point the chat deployment at your Azure AI Search index — this wires up RAG for you.
  5. Ask a question whose answer is only in your uploaded documents. Validation: the assistant answers correctly and cites the source document — proof of retrieval-augmented generation. Now ask something not in your data and watch a well-grounded setup say it cannot find the answer rather than hallucinating.

Cleanup. Delete everything so nothing keeps billing: az group delete --name rg-genai-lab --yes --no-wait (do the same for the resource group holding the Search service if separate).

Cost note. Part A is free. In Part B, Azure OpenAI bills per 1,000 tokens (input and output separately) and Azure AI Search bills per hour for the service tier. A short experiment with a small Search tier typically costs only a few rupees (well under ₹100) — but the Search service charges while it exists, whether or not you query it, so the single biggest cost mistake is leaving it running. Delete the resource group the moment you are done.

Common mistakes & troubleshooting

Symptom Likely cause Fix
Request rejected for being too long Prompt + expected completion exceeds the context window Shorten the prompt, retrieve fewer/smaller chunks, or use a model with a larger window
Model gives confident but wrong answers Hallucination — answering from memory, not facts Ground it with RAG, lower temperature, ask it to cite sources and to admit uncertainty
Bill is higher than expected Billed per token in and out; long prompts/answers add up Trim prompts, cap max output tokens, retrieve only the most relevant chunks
Output is inconsistent run to run Temperature/top-p too high for a factual task Lower temperature (towards 0) for extraction/classification/code
Chatbot ignores your documents Retrieval misconfigured — wrong index, no embeddings, or data not added Verify the Azure AI Search index, that content is vectorised, and that the deployment is pointed at it
“Model not found” when calling the API Calling the model name instead of your deployment name Call the deployment name and endpoint, not the raw model id
Harmful or off-policy text slips through Content filtering not configured, or a prompt-injection attack Use Azure AI Content Safety/content filters and prompt-shield protections; keep a human in the loop

Best practices

Security notes

Interview & exam questions

  1. What is a large language model, in one sentence? A model trained to predict the next token given preceding text, which at scale can generate, summarise, translate, and reason over language.
  2. What is a token, and why does it matter? A chunk of text (~4 characters / ~¾ word) — the unit the model reads and the unit you are billed in; prompt + completion tokens must also fit the context window.
  3. Explain temperature. A setting (≈0–1) controlling randomness: near 0 is focused/deterministic/repeatable; higher is more creative/varied. Use low for factual tasks, high for creative ones.
  4. What is an embedding, and what is it used for? A numeric vector representing the meaning of text so that similar meanings have similar vectors; it powers semantic/vector search and the retrieve step of RAG.
  5. What is the RAG pattern and why use it? Retrieve relevant facts from your data, augment the prompt with them, generate a grounded answer. It lets a model answer from private/current data, reduces hallucination, and enables citations — without retraining.
  6. RAG vs fine-tuning — when each? RAG adds knowledge (use it when the model lacks your facts); fine-tuning adjusts behaviour/style (use it for consistent format/tone). Most “it doesn’t know our data” problems are RAG problems.
  7. What is a deployment in Azure OpenAI? A named, callable instance of a specific model version in your resource; your app calls the deployment name + endpoint, not the raw model.
  8. How does Azure OpenAI differ from public ChatGPT? Enterprise identity (Entra ID), RBAC, private networking, data residency, built-in content filtering, compliance/SLA — and your data is not used to train the models.
  9. What is hallucination and how do you reduce it? Confident but false generated content; mitigate with grounding/RAG, citations, lower temperature, and instructing the model to admit uncertainty.
  10. Copilot vs agent? A copilot assists a human inside an app; an agent plans and takes actions (calls tools, chains steps) to pursue a goal more autonomously.
  11. Name the harm categories content filters screen for. Hate, sexual, violence, and self-harm — via Azure AI Content Safety, applied to both prompt and completion at configurable severities.
  12. What is grounding? Tying a model’s output to verifiable source data rather than its memory — the core technique behind trustworthy generative answers.

Quick check

  1. In your own words, what single task is an LLM fundamentally trained to do, and how does scaling that turn into useful abilities?
  2. Why does the same prompt cost more if it is wordier, and what two totals must fit inside the context window?
  3. A teammate wants a chatbot to answer questions about this quarter’s internal policy PDFs. Should they fine-tune a model or use RAG — and why?
  4. You are extracting structured fields from invoices and the output keeps varying between runs. Which setting do you change, and in which direction?
  5. Name the three risks specific to generative AI covered in this lesson and the single most effective mitigation for the first one.

Answers

  1. An LLM is trained to predict the next token given the preceding text. Scaled across vast data and parameters, “predict the next token really well” generalises into summarising, translating, reasoning, and conversing.
  2. You are billed per token, so more words = more tokens = more cost. The prompt tokens and the completion tokens together must fit inside the context window.
  3. Use RAG. The information is private and current (this quarter’s PDFs) — the model needs new knowledge, which RAG supplies by retrieving and grounding. Fine-tuning changes behaviour, not facts, and is heavier and costlier here.
  4. Lower the temperature (towards 0). Extraction is a factual task, so you want focused, repeatable output rather than creative variation.
  5. Hallucination, grounding (lack of), and content-safety risks. The most effective mitigation for hallucination is grounding the model on verifiable source data — i.e. use RAG (plus citations, low temperature, and admitting uncertainty).

Exercise

Design — on paper — a grounded customer-support assistant for a fictional company, then justify each choice using this lesson:

  1. State the data the assistant must answer from (e.g. product manuals, an FAQ, return policy) and explain why a plain LLM cannot answer these out of the box.
  2. Sketch the RAG pipeline: where embeddings are created, what plays the retriever (name the Azure service), and what plays the generator. Label the three steps retrieve → augment → generate.
  3. Choose a temperature for this assistant and justify it in one sentence.
  4. List three responsible-AI safeguards you will include (e.g. grounding with citations, content filtering, human escalation for refunds) and the risk each addresses.
  5. Name one cost lever you will pull to keep the token bill down.

If you can complete all five with a one-line justification each, you can explain applied generative AI on Azure end to end — exactly the level AI-900 expects.

Certification mapping

Glossary

Next steps

You can now explain generative AI from first principles — tokens and transformers through to a grounded, responsible RAG application on Azure — and answer the classic interview and exam questions on LLMs, embeddings, Azure OpenAI, and RAG.

Related reading to go deeper:

AzureGenerative AIAzure OpenAIRAGPrompt EngineeringAI-900
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading