A health-insurance group running 40 AWS accounts under Organizations gets a pointed question from its CFO after the quarterly close: the cloud bill crossed $3.1M a month, it is growing 9% a quarter, and not one business unit believes the number is theirs. The claims-processing division swears the analytics team’s machine-learning training is what blew up the December bill; the analytics team says claims’ always-on RDS fleet is the real cost; and finance cannot adjudicate because the only artifact anyone has is a single consolidated invoice with no owner on any line. The mandate that lands on the platform team is specific and unglamorous: “Tell each business unit exactly what they spent, prove it, and bill it back to their cost center every month — automatically.” In a regulated payer where every dollar eventually maps to a medical-loss-ratio calculation a regulator audits, “roughly” is not acceptable. This article is the reference architecture for building that showback-and-chargeback platform on AWS — one a CFO will trust, a business-unit GM cannot dispute, and an auditor can trace end to end.
The pressures here are the ones that make FinOps hard rather than the ones that make it interesting in a slide. Accuracy means every dollar of a $3M bill must land on exactly one owner, including the dollars nobody tagged. Auditability means the chargeback a business unit is billed must be reproducible six months later from immutable source data. Timeliness means the report has to be ready a few days after month-end close, not three weeks later when the next month’s spend has already moved. And fairness means shared costs — the data-transfer backbone, the security tooling, the support plan — have to be split by a rule everyone agreed to in advance, not a number finance invented. The pattern that satisfies all four is tag-based cost allocation driven off the Cost and Usage Report (CUR) — the most granular, line-item-level billing data AWS produces — queried in place and turned into per-unit statements.
Why not the obvious shortcuts
Three cheaper approaches get proposed on every one of these projects, and each fails in a way worth naming before someone burns a sprint on it.
The AWS Cost Explorer console is excellent for a human poking at trends, but it is not a chargeback engine: its data is aggregated and rounded, its API is rate-limited and not built to feed a billing run, and you cannot reproduce an exact historical statement from it months later for an audit. A monthly spreadsheet built by hand from the invoice is what most companies actually do — and it is unauditable, breaks the moment one analyst is on leave, silently drops the untagged spend, and gives every business unit a standing reason to dispute the number. Splitting the bill evenly across business units is the laziest option and the most corrosive: it punishes the frugal team that runs three Lambdas and rewards the one training models on a fleet of GPUs, which destroys the incentive a chargeback model exists to create in the first place.
The CUR-driven approach threads the needle. The CUR is the system of record AWS itself bills from — every line item, every resource, every tag, hourly, delivered to your own S3 bucket. Querying it directly means your chargeback is derived from the same data AWS used to charge you, the allocation logic lives in version-controlled SQL anyone can review, and any statement is reproducible from immutable source files. Tags become the mechanism that maps each line item to an owner; the untagged residue becomes a problem you handle explicitly rather than one you hide.
Architecture overview
The platform is fundamentally a monthly batch pipeline with a self-service analytics layer on top, not a real-time system — and recognizing that shapes every decision. There are two flows that share storage but run on different clocks: a daily ingestion-and-validation flow that keeps the cost data current and the tag hygiene visible, and a month-end allocation-and-chargeback flow that produces the statements finance actually bills from.
The defining property of the topology is that the CUR is the single source of truth and is never mutated. AWS delivers the report to a locked-down S3 bucket; everything downstream reads from it and writes derived artifacts elsewhere. That immutability is what makes a chargeback defensible six months later in front of an auditor — you can always re-derive the exact statement from the exact bytes AWS delivered.
Ingestion and validation flow, following the data:
- AWS Organizations consolidates billing across all 40 accounts into the management account, and the Cost and Usage Report 2.0 is configured there to deliver hourly, resource-level line items — with cost-allocation tags activated — to a dedicated S3 bucket in Apache Parquet, partitioned by billing period. CUR overwrites the current month’s files daily as charges finalize, which is why downstream reads always re-query rather than cache.
- An AWS Glue crawler (or the CUR’s own Athena integration) keeps a Glue Data Catalog table in sync with the report’s evolving schema, so new services and new columns appear without a manual change.
- A daily EventBridge schedule triggers a Step Functions state machine that runs the validation pass: an Athena query computes the untagged-spend ratio per account and per service, and a second query checks that every active account carries the mandatory tag keys (
cost-center,business-unit,environment,application). The results land in a small DynamoDB table that drives the tag-hygiene dashboard. - Any account drifting past a threshold — say, more than 5% of spend untagged — auto-raises a ServiceNow ticket assigned to that account’s owning team, so tag debt has a name and a due date instead of accumulating silently.
Allocation and chargeback flow, fired by EventBridge a few days after AWS finalizes the prior month (the CUR’s bill/InvoiceId populating is the real “books are closed” signal):
- Step Functions runs the allocation Athena queries in sequence: directly-attributable cost is summed per
cost-centerfrom tags; untagged and shared cost is apportioned by the agreed rule (more on this below); and the results are written as a derived, partitioned “chargeback ledger” table in S3 — itself immutable once written for that period. - Amazon QuickSight reads the chargeback ledger through SPICE, serving each business unit a row-level-security-scoped dashboard so a GM sees their own spend, trend, and top resources — and nobody else’s.
- A Lambda function renders each business unit’s signed monthly statement (PDF + CSV), writes it to a per-unit S3 prefix, and pushes the chargeback record into ServiceNow as a financial transaction against that unit’s cost center — closing the loop from raw billing data to a line on the unit’s internal P&L.
- In parallel, AWS Cost Anomaly Detection runs continuously against the same spend, segmented by the same cost-allocation dimensions, and alerts the owning team plus FinOps the moment a unit’s daily run-rate spikes — so a runaway cost is caught mid-month, not discovered in next month’s statement.
Component breakdown
| Component | Service / tool | Role in the platform | Key configuration choices |
|---|---|---|---|
| Billing source | AWS Organizations + CUR 2.0 | Consolidated, hourly, resource-level line items — the system of record | Management-account delivery; Parquet; cost-allocation tags activated; daily refresh |
| Raw store | Amazon S3 (CUR bucket) | Immutable landing zone for billing data | Versioning + Object Lock; bucket policy locks writes to the billing service; SSE-KMS |
| Schema catalog | AWS Glue Data Catalog | Tracks CUR’s evolving schema for SQL access | Crawler or CUR-native Athena integration; partition projection by month |
| Query engine | Amazon Athena | Serverless SQL for validation + allocation | Workgroup with result location + bytes-scanned guardrail; partitioned, columnar reads |
| Orchestration | Step Functions + EventBridge | Daily validation and month-end chargeback runs | Schedule on CUR finalization; retries; per-step state in DynamoDB |
| Tag-hygiene state | Amazon DynamoDB | Untagged ratios, mandatory-tag coverage per account | On-demand capacity; feeds the hygiene dashboard and ServiceNow tickets |
| Anomaly detection | AWS Cost Anomaly Detection | Continuous spike detection per cost dimension | Monitors by linked account + cost-allocation tag; SNS to owning team |
| BI / showback | Amazon QuickSight | Per-unit dashboards and trend analysis | SPICE; row-level security by business unit; scheduled refresh post-run |
| Statement rendering | AWS Lambda | Signed PDF/CSV statements per unit; pushes to ITSM | Renders from chargeback ledger; signs artifacts; idempotent per period |
| Identity / SSO | Okta + Microsoft Entra ID | Workforce SSO into QuickSight and the FinOps console | OIDC/SAML federation; group claims map to QuickSight RLS groups |
| Secrets | HashiCorp Vault | ServiceNow API creds, signing keys, third-party tokens | Dynamic leases; AWS auth method; no long-lived creds in Lambda env |
| ITSM / chargeback book | ServiceNow | Receives chargeback records; raises tag-debt tickets | Financial-transaction record per unit; auto-ticket on tag drift |
| CSPM / posture | Wiz + Wiz Code | Guards the data store and the IaC that builds it | Alerts on CUR-bucket public exposure; Wiz Code scans Terraform pre-merge |
| Runtime security | CrowdStrike Falcon | Runtime protection on any rendering/ETL compute | Sensor on containers/instances if the render path is not pure Lambda |
| Observability | Datadog | Pipeline health, run duration, freshness SLOs | Step Functions + Lambda metrics; monitor on missed/late chargeback run |
| CI / IaC | GitHub Actions + Terraform | Builds the platform; version-controls allocation SQL | OIDC to AWS (no stored keys); SQL change-reviewed like code |
A few of these choices carry the weight of the design and deserve the why.
Why Athena on the CUR, not a data warehouse. The CUR for a $3M/month estate is large but queried in a bursty, monthly cadence — a few heavy allocation queries plus a daily validation sweep. Standing up Redshift to hold it means paying for a cluster that is idle 95% of the time and an ETL job to load it. Athena queries the Parquet directly in S3, you pay only for bytes scanned, and partition projection by billing month means a single month’s allocation touches only that month’s data. The tradeoff — covered below — is that you must discipline your queries against scanning the whole report.
Why the CUR bucket is the most locked-down resource in the account. This bucket is the system of record for a number a regulator may audit, so it gets S3 Object Lock and versioning (the billing data cannot be altered or deleted, even by an admin), an SSE-KMS key, and a bucket policy that permits writes only from the AWS billing service and reads only from the FinOps roles. Wiz scans it continuously and pages the moment its posture drifts toward public or its policy widens — because a tampered or leaked billing record poisons every downstream statement.
Why row-level security in QuickSight is non-negotiable. Showback only changes behavior if a GM can self-serve their own numbers — but a payer’s business units include lines that must not see each other’s cost (a competitive analytics unit, an M&A-sensitive workload). QuickSight RLS ties a viewer’s Okta-federated group to a permissions dataset so the same dashboard transparently filters to only the rows that viewer’s cost centers own. One dashboard, many tenants, zero cross-unit leakage.
Handling untagged and shared cost — the part everyone underestimates
The honest truth of cloud chargeback is that tags are never 100% clean, and how you handle the gap is what makes the model fair or fraudulent. Three categories of cost resist direct attribution: genuinely untagged resources (someone forgot), untaggable charges (some data-transfer, certain support and tax line items, savings-plan amortization that does not carry a resource tag), and deliberately shared infrastructure (the central logging account, the transit gateway, the security tooling every unit benefits from). Hiding these or dropping them silently is how a chargeback loses credibility the first time the numbers do not reconcile to the invoice.
The platform handles each explicitly. Untagged spend is surfaced loudly — the daily validation sweep computes the untagged ratio per account, the hygiene dashboard ranks the worst offenders, and a ServiceNow ticket lands on the owning team with a deadline; the cost of staying untagged is that it gets apportioned back via the shared-cost rule, so there is a financial nudge to fix it. Untaggable and shared costs are pooled and split by a pre-agreed allocation key ratified by the FinOps council — most commonly proportional to each unit’s directly-attributed spend (the unit consuming 30% of the attributable bill absorbs 30% of the shared pool), though some pools split by headcount or by a usage proxy where that is fairer. The rule lives in version-controlled SQL, so it is transparent, reviewable, and identical every month.
-- Apportion the unallocated pool to each cost center,
-- proportional to that center's directly-tagged spend.
WITH attributed AS (
SELECT resource_tags['user_cost_center'] AS cost_center,
SUM(line_item_unblended_cost) AS direct_cost
FROM cur.chargeback
WHERE billing_period = DATE '2026-05-01'
AND resource_tags['user_cost_center'] <> ''
GROUP BY 1
),
shared_pool AS ( -- everything with no usable cost center
SELECT SUM(line_item_unblended_cost) AS pool
FROM cur.chargeback
WHERE billing_period = DATE '2026-05-01'
AND (resource_tags['user_cost_center'] = '' OR resource_tags['user_cost_center'] IS NULL)
)
SELECT a.cost_center,
a.direct_cost,
ROUND(p.pool * a.direct_cost / SUM(a.direct_cost) OVER (), 2) AS shared_alloc,
a.direct_cost
+ ROUND(p.pool * a.direct_cost / SUM(a.direct_cost) OVER (), 2) AS total_chargeback
FROM attributed a CROSS JOIN shared_pool p
ORDER BY total_chargeback DESC;
The property that matters: the per-unit totals reconcile exactly to the AWS invoice. Direct cost plus the apportioned pool sums to 100% of the unblended bill, with zero unallocated remainder — which is the first thing finance checks and the thing a hand-built spreadsheet always gets wrong.
| Cost category | How it’s attributed | Why this rule |
|---|---|---|
| Tagged resources | Direct, by cost-center tag |
Unambiguous ownership; the goal state |
| Untagged but taggable | Apportioned via shared pool; ticketed to owner | Creates a financial nudge to tag, never silently dropped |
| Untaggable (some transfer, tax, support) | Pooled, split proportionally | No resource to tag; fairness by consumption share |
| Deliberately shared (logging, TGW, security) | Pooled, split by agreed key | Everyone benefits; council-ratified rule |
| Savings Plans / RI amortization | Amortized view, allocated to the consuming account | Reflects effective cost, not lumpy upfront purchase |
Implementation guidance
Provision with Terraform, and treat the billing data store as the first deliverable. The order matters: the CUR can take up to 24 hours to begin delivering after you enable it, so stand up the bucket, the report definition, and the Glue catalog on day one and let data accumulate while you build the rest.
- The locked-down CUR S3 bucket — versioning, Object Lock in compliance mode, SSE-KMS, and a bucket policy scoped to the billing service and FinOps roles.
- The CUR 2.0 report definition in the management account: hourly granularity, resource IDs included, Parquet, cost-allocation tags activated, Athena integration enabled.
- The Glue Data Catalog table (CUR-native integration or a crawler) with partition projection on billing period so Athena never has to list partitions.
- The Athena workgroup with an enforced result location and a per-query bytes-scanned cap as a guardrail against an accidental full-report scan.
- Step Functions, EventBridge schedules, Lambda, DynamoDB, QuickSight, and the ServiceNow integration, with the allocation SQL stored as versioned files the pipeline reads at runtime.
Activating cost-allocation tags is the step teams forget, and a tag is not retroactive — it only allocates cost from the moment it is activated forward, which is why this is day-one work:
resource "aws_ce_cost_allocation_tag" "cost_center" {
tag_key = "cost-center"
status = "Active"
}
resource "aws_cur_report_definition" "finops" {
report_name = "finops-chargeback-cur2"
time_unit = "HOURLY"
format = "Parquet"
compression = "Parquet"
additional_schema_elements = ["RESOURCES"]
s3_bucket = aws_s3_bucket.cur.id
s3_region = "ap-south-1"
additional_artifacts = ["ATHENA"]
refresh_closed_reports = true
report_versioning = "OVERWRITE_REPORT"
}
The pipeline that applies this runs in GitHub Actions, authenticating to AWS via OIDC federation so there is no stored access key to leak — and the allocation SQL is reviewed in pull requests exactly like application code, because a change to an allocation rule is a change to what each business unit pays. Wiz Code scans the Terraform on the same pull request, blocking a merge that would, say, drop Object Lock or open the bucket policy.
Identity: federate the humans, lease the machine creds. FinOps analysts and business-unit viewers reach QuickSight and the internal FinOps console through Okta as the workforce IdP, federated to Microsoft Entra ID where the corporate identity estate lives, with the resulting group claims mapping straight to QuickSight RLS groups so a viewer’s unit determines the rows they see. The pipeline’s one sensitive dependency — the ServiceNow API credential it uses to post chargeback records and raise tickets, plus the key that signs each statement — lives in HashiCorp Vault, leased dynamically via the AWS auth method and never written into a Lambda environment variable, so a leaked function config exposes nothing.
Enterprise considerations
Security & data integrity. The crown jewel is the billing data itself; compromise it and every statement downstream is wrong. The controls layer accordingly: (a) S3 Object Lock + versioning make the CUR tamper-evident and tamper-resistant; (b) SSE-KMS with a dedicated key and a tight key policy; © Wiz running continuous CSPM so any drift of the CUR bucket toward public exposure or a widened policy pages immediately, with Wiz Code catching the same risk in IaC before it ships; (d) CrowdStrike Falcon sensors on any non-Lambda ETL or rendering compute for runtime threat detection into the SOC; (e) least-privilege IAM throughout — the validation role can read the CUR and nothing else, the render Lambda can read the chargeback ledger but not the raw report. A tag-drift breach raises a ServiceNow ticket so tag debt is tracked work, not a forgotten log line.
Cost optimization (of the FinOps platform itself). A cost-management platform that is itself expensive is a bad look, and the dominant cost here is Athena bytes scanned. Engineer for it.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Partition projection | Query only the relevant billing month’s partition | Cuts a full-report scan to a single month |
| Columnar Parquet | CUR in Parquet; select only needed columns | Athena reads a fraction of the bytes |
| Bytes-scanned guardrail | Workgroup limit aborts a runaway query | Caps the cost of a bad ad-hoc query |
| SPICE in QuickSight | Dashboards read cached SPICE, not live Athena | Viewers never trigger a query per page load |
| Right-sized cadence | Heavy allocation runs monthly, light validation daily | Avoids re-deriving the world every day |
The platform’s own bill should be a rounding error — low hundreds of dollars a month on a $3M estate — and Datadog tracks Athena spend as a first-class metric so the cost-management tool does not quietly become a cost problem.
Scalability. The architecture scales with the number of accounts and the size of the bill almost for free because the heavy lifting is serverless: Athena scales query concurrency on demand, Step Functions and Lambda scale with the run, and S3 has no capacity to manage. The real scaling axis is organizational — more business units means more RLS groups, more allocation rules, and more shared-cost pools to ratify — which is governance work, not infrastructure work. The one genuine ceiling is Athena’s per-query and per-workgroup concurrency limits during a heavy month-end run; sequencing the allocation queries through Step Functions rather than firing them all at once keeps you well under it.
Failure modes, and what each one looks like. Name them before they corrupt a chargeback.
- Running the chargeback before the books close — the CUR overwrites the current month’s data daily as charges finalize, so a statement generated too early is simply wrong. Mitigation: gate the month-end run on the
bill/InvoiceIdfield populating, not a fixed calendar date. - A schema change in the CUR breaks the allocation query — AWS adds columns and services over time. Mitigation: the Glue crawler / CUR-native integration tracks the schema, and the allocation SQL selects named columns defensively rather than
SELECT *. - Tag coverage silently degrades — a new team launches untagged and the shared pool quietly balloons, distorting everyone’s apportioned share. Mitigation: the daily untagged-ratio check, the hygiene dashboard, and the auto-raised ServiceNow ticket make the drift visible the next morning.
- An Athena query scans the whole report — a missing partition predicate turns a cheap query into an expensive one. Mitigation: partition projection plus the workgroup bytes-scanned cap, which aborts it.
- The statement does not reconcile to the invoice — the cardinal sin. Mitigation: an automated reconciliation assertion in the pipeline that the sum of per-unit chargebacks equals the unblended invoice total to the cent, failing the run if it does not.
Reliability & DR (RTO/RPO). This is a batch system, so the SLO is freshness, not uptime — the chargeback must be ready by an agreed day of the month. Because the entire pipeline re-derives its output from the immutable CUR in S3 (which AWS replicates durably and you can additionally cross-region replicate), recovery is “re-run the pipeline against the same source.” A pragmatic posture: RPO is effectively zero (the source of truth is immutable and durable), and RTO is a single pipeline re-run — a few hours — because nothing is lost that cannot be recomputed. Datadog monitors enforce the freshness SLO and page FinOps if a scheduled run is late or fails, which for a batch platform is the failure that actually matters.
Observability. Instrument the pipeline in Datadog as an end-to-end batch SLO: run duration, success/failure of each Step Functions stage, the data-freshness lag (how stale the latest chargeback ledger is), and the reconciliation delta between summed chargebacks and the invoice. Emit the business metrics FinOps lives by — untagged-spend ratio, shared-pool size as a percentage of total, per-unit month-over-month variance, and anomaly-alert count per unit — so a degrading tag culture or a runaway workload surfaces on a dashboard rather than in a quarterly surprise. New allocation rules pass through a ServiceNow change approval before going live, giving finance a documented gate on changes to how units are billed.
Governance. The allocation logic is financial logic, so govern it like financial controls: the FinOps council ratifies the shared-cost key and the mandatory tag taxonomy, the SQL that implements them lives in version control and changes only through reviewed pull requests, and every monthly chargeback ledger is retained immutably for the audit horizon the regulator requires. Pin the report definition and never let allocation rules drift silently; promote changes through the change gate; and keep a documented, reproducible path from any historical statement back to the exact CUR bytes that produced it — which is the artifact that ends a dispute.
Explicit tradeoffs
Accept these or do not build it. A CUR-driven chargeback platform is a batch system with a lag — the freshest possible chargeback is still a few days after month-end, because that is when AWS finalizes the bill, and no architecture changes that physics. It lives or dies on tag discipline, which is an organizational problem the platform can surface and nudge but cannot solve by itself; the shared-pool apportionment is a real, defensible answer to imperfect tags, but it is an approximation, and a unit that stays sloppy on tags will rightly feel its apportioned share is fuzzy. The Athena-not-warehouse choice keeps cost near zero but demands query discipline — one un-partitioned ad-hoc query can scan the whole report. And the Okta-to-Entra federation plus QuickSight RLS plus Vault-held creds are all overhead a five-account startup can skip and a 40-account regulated payer absolutely cannot.
The alternatives, and when they win. If you run a handful of accounts and a small bill, AWS Cost Explorer with cost categories and a couple of activated tags is enough showback and you do not need this pipeline at all. If your finance team has standardized on a third-party FinOps platform — CloudHealth, Apptio Cloudability, or similar — those ingest the CUR and do much of this allocation out of the box, and the build-versus-buy decision turns on whether your allocation rules are unusual enough, and your data-residency or customization needs strict enough, to justify owning the code; for a regulated payer with bespoke shared-cost rules and an auditor to satisfy, owning the SQL is often the right call. If you need cost data joined to business metrics — cost per claim processed, cost per member — you extend this exact pipeline by joining the chargeback ledger to operational data in Athena, which is the natural next step once the unit-cost discipline is in place.
The shape of the win
For the payer’s CFO, the payoff is not “a dashboard.” It is that at the next quarterly close, finance hands each business-unit GM a signed statement that reconciles to the AWS invoice to the cent, the claims division can see its own RDS fleet is in fact its largest line, the analytics team can see its training spend with the untagged residue fairly apportioned and nobody else’s costs on its bill — and because every number traces back to immutable CUR data through version-controlled SQL, the dispute that used to eat the first week of every quarter simply does not happen. That last point is what funds the platform: the chargeback is no longer a fight, it is a fact. Everything upstream — the locked-down CUR bucket, the Athena allocation queries, the explicit untagged handling, the QuickSight row-level security, the ServiceNow chargeback push, the Cost Anomaly Detection guardrail, the Datadog freshness SLO — exists so that a CFO, a business-unit GM, and an auditor each look at the same monthly number and agree it is right. Start narrower if your estate is small, but for a regulated organization spending real money across many accounts, this is where accountable cloud cost has to land.