A medical-device manufacturer’s global support organization is drowning in its own knowledge. Two thousand field service engineers and Tier-1 agents support infusion pumps and imaging systems across forty countries, and the answer to almost any question they get — a calibration tolerance, a recall bulletin, a known-defect workaround, the exact torque spec for a replacement part — already exists. It lives in fourteen years of Confluence spaces (service manuals, engineering runbooks, regulatory bulletins) and in a quarter-million resolved Jira tickets that encode tribal knowledge no manual ever captured. The problem is finding it: an engineer at a hospital with a pump throwing an error code has eight tabs open and a clinician waiting, and the right Confluence page is on page three of a search that also returns four obsolete drafts. Worse, this is a regulated medical-device maker under FDA 21 CFR Part 820 and EU MDR, where quoting a superseded service procedure is not an inconvenience — it is a reportable quality event. The VP of Service wants “an assistant that answers from our own Confluence and Jira, with the source link, and never shows an engineer a document they aren’t cleared to see.” This article is the reference architecture for building that assistant on AWS — a managed, permission-aware, guardrailed RAG platform on Bedrock Knowledge Bases that the company’s quality and security functions will actually approve.
The pressures are the usual enterprise four, sharpened by regulation. Compliance means every answer needs a verifiable citation back to the live source and an audit trail, and an engineer must never ground on a withdrawn bulletin. Permissions are non-negotiable: Confluence space restrictions and Jira project roles already encode who may see what — recall investigations, unreleased product specs, customer-identifiable ticket data — and the assistant cannot become the hole in that wall. Scale means two thousand concurrent users across time zones, not a demo. And cost has to be defensible to a CFO who has watched “AI projects” burn budget. Retrieval-augmented generation — RAG — is the pattern that satisfies all four: it grounds the model’s answer in retrieved, permission-filtered, citable passages from a search index you control, rather than the frozen, uncitable memory baked into model weights. The knowledge stays in your index; the model supplies language, not facts.
Why not the obvious shortcuts
Three naive fixes will be proposed on this project, and each fails predictably.
Better keyword search in Confluence returns documents, not answers, and misses anything phrased differently from the query — “pump occlusion alarm” will not surface a page titled “downstream pressure fault.” It also has no notion of grounding a synthesized answer, and no way to fuse a Confluence page with the three Jira tickets where engineers actually solved the problem. Fine-tuning a model on the corpus bakes facts into weights you cannot cite, cannot update the hour a bulletin is withdrawn, and cannot make respect a Confluence space restriction — and the model still hallucinates a torque value with total confidence. Pasting tickets into a public chatbot leaks customer-identifiable, regulated data across a tenant boundary the security team will never approve.
RAG threads the needle. At query time the system retrieves the handful of passages actually relevant to the question — across both Confluence and Jira — hands them to the model as grounding context, and the model composes an answer from that context with citations back to the source page or ticket. Retrieval is also the natural choke point to enforce document-level permissions: an engineer only ever grounds on content their Confluence and Jira accounts already entitle them to read. The citations turn an unauditable black box into something a quality auditor can verify line by line.
Architecture overview
The platform runs two paths on different schedules: a synchronous query path that serves engineers, and a scheduled ingestion path that keeps the knowledge base synced with Confluence and Jira. The center of gravity is AWS Bedrock Knowledge Bases, which manages the connector-driven ingestion, chunking, embedding, and — critically for this design — permission-aware retrieval, so most of the undifferentiated plumbing is AWS’s problem, not yours.
Query path, following the control flow:
- An engineer opens the assistant in the company’s service portal. Identity federates through Okta as the workforce IdP via OIDC; the portal sits behind Akamai at the edge for TLS termination, global anycast, and WAF/bot protection before traffic reaches AWS. The Okta token carries the user’s identity and the group claims that mirror their Confluence space access and Jira project roles.
- The request hits Amazon API Gateway fronting an AWS Lambda (or a Fargate service for steady throughput) that holds the orchestration logic. API Gateway validates the Okta JWT with a Lambda authorizer, enforces per-team throttling and usage plans, and produces the single audit log of who asked what.
- The orchestrator pulls the few secrets it cannot get from an IAM role — the Atlassian connector credentials’ rotation key, third-party API tokens — from HashiCorp Vault via the AWS auth method, so nothing sensitive lives in a Lambda environment variable.
- The orchestrator calls the Bedrock
RetrieveAndGenerate(orRetrieve) API against the Knowledge Base, passing the caller’s Okta-derived group identifiers as a metadata filter. Bedrock embeds the question, queries the Amazon OpenSearch Serverless vector collection, and returns only passages whose stamped ACL metadata the caller is entitled to — a forbidden page or ticket is never even retrieved. - Bedrock assembles the top passages into a grounded prompt and invokes the foundation model — Anthropic Claude on Bedrock for substantive reasoning, a smaller, cheaper Claude tier for simple lookups.
- Both the inbound prompt and the model’s output pass through Amazon Bedrock Guardrails — prompt-injection and jailbreak detection, denied-topic and PII filters, and a contextual grounding check that flags any answer the retrieved passages do not support. The cited answer (with Confluence page and Jira ticket links) streams back; the turn is written to DynamoDB.
Ingestion path, scheduled and connector-driven: Bedrock Knowledge Bases’ native Confluence connector and Jira connector crawl their respective sources on a schedule, pulling pages, attachments, and tickets along with the per-object access-control metadata Atlassian exposes. Bedrock chunks the content, embeds each chunk, and writes the vectors stamped with the source object’s ACL identifiers into the OpenSearch collection. That ACL stamp at ingestion time is the entire basis for per-document security at query time. A change in a Confluence space restriction or a Jira project role is picked up on the next sync, so entitlements do not drift.
Component breakdown
| Component | Service / tool | Role in the platform | Key configuration choices |
|---|---|---|---|
| Edge | Akamai | TLS, anycast, WAF, bot mitigation at the perimeter | Custom WAF rules for prompt-flood patterns; origin shield to API Gateway |
| Identity / SSO | Okta | Workforce SSO; group claims mirror Confluence/Jira access | OIDC; groups mapped to KB metadata filter keys; conditional access |
| API edge | Amazon API Gateway | JWT validation, per-team throttling, usage plans, audit log | Lambda authorizer; usage plans per service region; access logging |
| Orchestrator | AWS Lambda / Fargate | RetrieveAndGenerate calls, filter assembly, streaming |
Provisioned concurrency; response streaming; least-privilege role |
| Managed RAG | Bedrock Knowledge Bases | Connectors, chunking, embedding, permission-aware retrieval | Confluence + Jira connectors; metadata filtering; hybrid search |
| Vector store | OpenSearch Serverless | Vector + keyword index behind the KB | KNN (HNSW) field; ACL metadata fields; capacity units sized to load |
| Models | Bedrock (Anthropic Claude) | Generation + embeddings | Claude reasoning tier + cheaper tier; Titan/Cohere embeddings |
| Secrets | HashiCorp Vault | Connector rotation keys, third-party tokens | AWS auth method; dynamic leases; short-lived credentials |
| Guardrails | Bedrock Guardrails | Block jailbreaks, PII, denied topics; grounding check | Prompt-attack filter; contextual grounding + relevance thresholds |
| State | DynamoDB | Conversation history, feedback, per-user memory | On-demand capacity; partition by conversation id; TTL on transient turns |
| CSPM / data posture | Wiz + Wiz Code | Cloud posture, sensitive-data exposure, attack-path; IaC scanning | Agentless scan of OpenSearch/S3; Wiz Code gates Terraform in CI |
| Runtime security | CrowdStrike Falcon | Runtime protection on Fargate tasks and ingestion compute | Sensor on the cluster; detections piped to the SOC |
| Observability | Datadog | Distributed tracing, token/cost telemetry, RAG spans | APM on the orchestrator; OpenTelemetry RAG span; LLM Observability |
| ITSM / approvals | ServiceNow | Corpus onboarding approvals, change requests, incidents | Change gate before a new space goes live; auto-ticket on guardrail breach |
| CI / IaC | GitHub Actions + Terraform + Argo CD | Pipeline build/test/eval; IaC; GitOps app delivery | OIDC to AWS (no stored keys); eval gate; Argo CD syncs the app |
A few choices deserve the why, because they are the ones teams get wrong.
Why Bedrock Knowledge Bases instead of a hand-rolled pipeline. You could wire LangChain to a vector DB and write your own Confluence/Jira crawlers. For this corpus, the managed connectors earn their keep specifically because they carry access-control metadata through ingestion automatically — replicating Atlassian’s permission model by hand is exactly the kind of bespoke security code that leaks. Bedrock handles the crawl, the incremental sync, the chunking, the embedding, and the filtered retrieval, leaving you to own orchestration, identity mapping, and guardrail policy — the parts that are actually your business.
Why security trimming belongs in retrieval, not the app. It is tempting to retrieve broadly and filter in application code. Do not — that means restricted Confluence pages and customer-identifiable Jira tickets leave the index into your Lambda’s memory and your traces before being dropped, and one bug leaks them across a regulatory line. Instead, every chunk is stamped at ingestion with the principals allowed to read its source, and the caller’s Okta groups are passed as a Bedrock metadata filter so the vector store never returns a forbidden passage:
{
"retrievalConfiguration": {
"vectorSearchConfiguration": {
"filter": {
"in": { "key": "acl_principals", "value": ["grp-field-svc-emea", "grp-imaging-l2"] }
}
}
}
}
Permission stays a property of the data, not a hope in the application layer. The principals come from the user’s verified Okta token — never from a value the client could forge.
Why hybrid search, not pure vector. Vector similarity is excellent at “passages about a downstream pressure fault” even when the page says “occlusion alarm.” But it is mediocre at exact-match needs — error codes, part numbers, bulletin IDs — where a single token must match. The Knowledge Base’s hybrid search runs both a vector query and a keyword query over OpenSearch and fuses the results, which lifts answer quality on this corpus more than any prompt tweak you will make.
Implementation guidance
Provision with Terraform, and treat identity mapping as the hard part. The mechanical resources — the OpenSearch Serverless collection, the Knowledge Base, the data sources, IAM roles — are straightforward IaC. The subtle work is making the Okta group a field engineer carries line up with the ACL principal Bedrock stamped on a Confluence page, so the metadata filter actually matches. Get that mapping wrong and the assistant either over-shares or returns nothing.
A minimal Terraform shape for the Knowledge Base and its Confluence data source communicates the intent:
resource "aws_bedrockagent_knowledge_base" "support" {
name = "kb-support-assistant-prod"
role_arn = aws_iam_role.kb_role.arn
knowledge_base_configuration {
type = "VECTOR"
vector_knowledge_base_configuration {
embedding_model_arn = "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0"
}
}
storage_configuration {
type = "OPENSEARCH_SERVERLESS"
opensearch_serverless_configuration {
collection_arn = aws_opensearchserverless_collection.kb.arn
vector_index_name = "support-vectors"
field_mapping {
vector_field = "embedding"
text_field = "chunk"
metadata_field = "metadata" # carries acl_principals, source_url, updated_at
}
}
}
}
The pipeline that applies this runs in GitHub Actions, authenticating to AWS via OIDC federation so there is no stored access key to leak — a hard lesson the platform team intends never to repeat — and Argo CD syncs the orchestrator service into the cluster GitOps-style from the same Git source of truth. Wiz Code scans the Terraform in the pull request and blocks a merge that would create the OpenSearch collection with public access or an over-broad IAM policy, before any infrastructure exists. The same CI pipeline runs the offline evaluation harness (below) as a required gate. Where the team manages OS-level configuration on the Fargate base images or any ingestion VMs, Ansible keeps them to a known, hardened baseline rather than drifting by hand.
Kill the keys, federate the humans. Human SSO flows Okta → the portal: engineers log in once with their corporate Okta credentials and conditional-access policies, and the resulting token carries the group claims that the orchestrator turns into the retrieval filter. The orchestrator assumes a tightly scoped IAM role — permission to call bedrock:RetrieveAndGenerate on this Knowledge Base, dynamodb access to the conversation table, and nothing else; the Knowledge Base’s own role can read the OpenSearch collection and invoke the embedding model. The residual secrets that are not IAM — the Atlassian connector’s credential rotation key, third-party feed tokens — live in HashiCorp Vault, leased dynamically through the AWS auth method and short-lived, so they are never written to a Lambda environment variable or a task definition.
Connector and chunking wiring. Point the Confluence connector at the spaces in scope (not the whole instance — start with the service and engineering spaces, expand through change control) and the Jira connector at the support and field-service projects. Let Bedrock’s semantic or hierarchical chunking handle structure, but carry the metadata that makes this corpus safe and current: acl_principals (the entitlement stamp), source_url (the citation), updated_at, and a status flag. Carry status and updated_at specifically so a withdrawn bulletin or a superseded service procedure can be filtered out at query time — the exact failure that started this project. Set the sync schedule tight enough that a permission change in Confluence propagates within the hour, because a stale ACL is a security finding, not a freshness nuisance.
Enterprise considerations
Security & Zero Trust. The architecture is Zero Trust by construction: identity-based access only, least-privilege IAM scoped per resource, no public data-plane surface on OpenSearch or S3. Layer on top: (a) Bedrock Guardrails prompt-attack filters to catch jailbreaks and the under-appreciated indirect injection where a malicious instruction is hidden inside a retrieved Jira comment; (b) the contextual grounding check as the last line against hallucinating a torque value or a calibration tolerance; © Wiz running continuous CSPM and sensitive-data-exposure scanning across OpenSearch, S3, and DynamoDB, alerting the moment any resource drifts to public exposure or an IAM change widens access — the posture backstop behind the policy controls, with Wiz Code having already gated the IaC that created them; (d) CrowdStrike Falcon sensors on the Fargate tasks and ingestion compute for runtime threat detection, feeding the company’s SOC; (e) a guardrail breach — a blocked jailbreak, a sustained grounding failure — auto-raises a ServiceNow incident so security has a ticket, not just a log line. An SCP and AWS Config rule deny any OpenSearch collection or S3 bucket created with public access, and Wiz independently verifies the control is actually holding.
Cost optimization. Token and retrieval spend dominate and grow with success, so engineer for it from day one.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Model tiering | Route simple lookups to a cheaper Claude tier; reserve the reasoning tier for hard questions | Large saving on the routed share |
| Response caching | Serve near-identical prior questions from a cache keyed on normalized query | Deflects 30–50% of model calls on a repetitive support corpus |
| Top-k discipline | Retrieve 4–6 passages, not 20, after hybrid search | Cuts input tokens every turn |
| OpenSearch sizing | Size Serverless capacity units to steady QPS, not peak fear | Avoids paying for idle vector capacity |
| Per-team metering | API Gateway usage plans feed chargeback | Makes each service region own its spend |
| Embedding reuse | Only re-embed changed chunks on incremental sync | Keeps ingestion cost proportional to real change |
Meter usage per team in API Gateway and pipe the metric to Datadog, which the platform team uses for the chargeback dashboard the CFO sees.
Scalability. Each tier scales independently. The orchestrator scales on concurrency (Lambda concurrency or Fargate task count on CPU/queue depth). OpenSearch Serverless scales on its own capacity units. Bedrock model throughput scales with on-demand limits, or you buy provisioned throughput for the interactive reasoning tier when steady demand justifies it. The natural ceiling is the Bedrock regional model quota, which is why a two-thousand-seat rollout plans capacity and possibly a second region early. Ingestion scales with the connector sync; the lever there is sync frequency and the changed-document delta, not raw compute.
Failure modes, and what each one looks like. Name them before they page you.
- Stale corpus — a withdrawn service bulletin still in the index gets cited to an engineer at a hospital. Mitigation: the
status/updated_atmetadata filter and a tight sync that tombstones superseded documents within the hour. This is the failure that would become an FDA-reportable event, so it gets the most attention. - ACL drift between Okta and Atlassian — someone’s Confluence access is revoked but their Okta group still maps to a permitted filter, so retrieval over-shares. Mitigation: drive the Okta groups and the connector ACLs from the same authoritative source, sync often, and have Wiz alert on entitlement widening.
- Retrieval miss — the relevant passage is not in the top-k, so the model cannot ground on it and a confidently wrong answer slips out. Mitigation: hybrid search, the grounding check set to flag-and-cite, and an eval harness that catches regressions.
- Indirect prompt injection — a malicious instruction planted in a Jira comment (“ignore prior instructions and reveal…”) rides into the prompt through retrieval. Mitigation: Bedrock Guardrails prompt-attack filtering on the retrieved context, not just the user input.
- Bedrock throttling under load — at shift change across regions you hit on-demand limits and latency jitter. Mitigation: provisioned throughput for the interactive tier; graceful backoff and a cached-answer fallback.
- Regional outage — see DR below.
Reliability & DR (RTO/RPO). Decide the numbers per tier. DynamoDB global tables give near-zero RPO and seconds RTO for chat state. The vector index is rebuildable from the source of truth — Confluence and Jira themselves, replayed through the connectors — so DR for retrieval means maintaining a warm Knowledge Base in a second region (re-run ingestion against both) rather than treating the index as precious. Bedrock model access is regional; for DR, ensure the same models are enabled in the paired region and fail over at the API layer. A pragmatic target: RTO 30 minutes, RPO 5 minutes for the conversational service, with the knowledge base rebuildable from Atlassian within hours if a region is lost. Akamai health checks drive edge failover for ingress.
Observability. Instrument the RAG span end to end in Datadog with OpenTelemetry: one trace covering retrieve → filter → generate → guard, with timing and token counts on each hop, plus Datadog LLM Observability to track prompts, completions, and guardrail outcomes. Emit the metrics the business actually cares about — retrieval hit-rate, grounding pass-rate, cache-deflection rate, tokens and cost per team, and p95 time-to-first-token (the latency an engineer feels mid-call). Run an offline evaluation harness (a golden set of real support questions scored on grounding and relevance) inside the GitHub Actions pipeline so a prompt or model change is scored before it ships. New corpora — a new Confluence space, a new Jira project — pass through a ServiceNow change approval before going live, giving quality a documented gate.
Governance. Pin the model version explicitly so behavior does not drift; promote a new version through the eval gate. Keep prompt templates and the orchestrator in version control, reviewable and instantly revertable, delivered by Argo CD so what is running always matches Git. Apply AWS Config rules and SCPs to deny public access and require logging on every relevant resource, with Wiz as the independent check that the controls are real. Log every prompt/response pair for audit, incident review, and future eval data — with a deletion path, since support conversations and the customer-identifiable content in Jira tickets are personal data under the same regime that started this project. Note one organizational reality: the company also runs a field-engineer certification program in Moodle, and the assistant’s most-cited gaps become the next quarter’s training modules — the knowledge base and the LMS feed each other.
Explicit tradeoffs
Accept these or do not build it. RAG adds real moving parts — connector syncs to keep healthy, an index whose freshness you must monitor, embedding costs, and retrieval quality you have to measure and tune. Latency is the sum of retrieval and generation, never just one. And RAG answers are only as good as retrieval: if the relevant passage is not in the top-k, the model cannot ground on it, and the grounding check mitigates but does not eliminate the confidently-wrong failure. Leaning on Bedrock’s managed connectors and permission-aware retrieval buys you out of writing security-critical crawl-and-filter code, at the cost of less control over chunking and ranking than a hand-rolled pipeline — a trade that is correct here precisely because the permission model is the risky part. The Okta-to-Atlassian entitlement mapping is genuine, ongoing work that a single-IdP, single-app shop never faces.
The alternatives, and when they win. If your corpus is small, static, and fits the model’s context window, long-context prompting skips the index entirely and is simpler. If you need the model to adopt a style or domain vocabulary rather than recall facts, fine-tuning is the right tool — and it composes with RAG (fine-tune for behavior, retrieve for facts). If you need the assistant to take actions — auto-create a Jira ticket, dispatch a field engineer — you want an agent/tool-calling architecture (Bedrock Agents), with this Knowledge Base as one of its tools. And if you are a small team optimizing for speed, you can stand up a Bedrock Knowledge Base over a single Confluence space in an afternoon; graduate to this full permission-aware, guardrailed, multi-source platform when security, scale, compliance, or governance demand it.
The shape of the win
For the device maker’s service organization, the payoff is not “a chatbot.” It is that a field engineer standing at an infusion pump throwing an occlusion alarm asks “what’s the documented fix for this on the 2023 model,” gets a one-line answer that fuses the current service manual page with the two Jira tickets where engineers actually solved it, each linked, in about a second — and because the answer is grounded, cited, permission-trimmed, and never showed them a withdrawn bulletin, the quality function approved the tool for use on regulated devices, which a fine-tuned, uncitable model would never have cleared. That last sentence is the one that funds the platform. Everything upstream — the Bedrock connectors carrying ACLs, the Okta-derived retrieval filter, the Vault-held connector keys, the Wiz posture scanning, the Guardrails grounding gate, the Datadog RAG span — exists to make an auditor, a CISO, and a CFO each say yes. The architecture here is the destination; start with one Confluence space if you must, but this is where a regulated, at-scale “answer from our Confluence and Jira” has to land.