FinOps Showback and Chargeback Platform on AWS

A health-insurance group running 40 AWS accounts under Organizations gets a pointed question from its CFO after the quarterly close: the cloud bill crossed $3.1M a month, it is growing 9% a quarter, and not one business unit believes the number is theirs. The claims-processing division swears the analytics team’s machine-learning training is what blew up the December bill; the analytics team says claims’ always-on RDS fleet is the real cost; and finance cannot adjudicate because the only artifact anyone has is a single consolidated invoice with no owner on any line. The mandate that lands on the platform team is specific and unglamorous: “Tell each business unit exactly what they spent, prove it, and bill it back to their cost center every month — automatically.” In a regulated payer where every dollar eventually maps to a medical-loss-ratio calculation a regulator audits, “roughly” is not acceptable. This article is the reference architecture for building that showback-and-chargeback platform on AWS — one a CFO will trust, a business-unit GM cannot dispute, and an auditor can trace end to end.

The pressures here are the ones that make FinOps hard rather than the ones that make it interesting in a slide. Accuracy means every dollar of a $3M bill must land on exactly one owner, including the dollars nobody tagged. Auditability means the chargeback a business unit is billed must be reproducible six months later from immutable source data. Timeliness means the report has to be ready a few days after month-end close, not three weeks later when the next month’s spend has already moved. And fairness means shared costs — the data-transfer backbone, the security tooling, the support plan — have to be split by a rule everyone agreed to in advance, not a number finance invented. The pattern that satisfies all four is tag-based cost allocation driven off the Cost and Usage Report (CUR) — the most granular, line-item-level billing data AWS produces — queried in place and turned into per-unit statements.

Why not the obvious shortcuts

Three cheaper approaches get proposed on every one of these projects, and each fails in a way worth naming before someone burns a sprint on it.

The AWS Cost Explorer console is excellent for a human poking at trends, but it is not a chargeback engine: its data is aggregated and rounded, its API is rate-limited and not built to feed a billing run, and you cannot reproduce an exact historical statement from it months later for an audit. A monthly spreadsheet built by hand from the invoice is what most companies actually do — and it is unauditable, breaks the moment one analyst is on leave, silently drops the untagged spend, and gives every business unit a standing reason to dispute the number. Splitting the bill evenly across business units is the laziest option and the most corrosive: it punishes the frugal team that runs three Lambdas and rewards the one training models on a fleet of GPUs, which destroys the incentive a chargeback model exists to create in the first place.

The CUR-driven approach threads the needle. The CUR is the system of record AWS itself bills from — every line item, every resource, every tag, hourly, delivered to your own S3 bucket. Querying it directly means your chargeback is derived from the same data AWS used to charge you, the allocation logic lives in version-controlled SQL anyone can review, and any statement is reproducible from immutable source files. Tags become the mechanism that maps each line item to an owner; the untagged residue becomes a problem you handle explicitly rather than one you hide.

Architecture overview

FinOps Showback and Chargeback Platform on AWS — architecture

The platform is fundamentally a monthly batch pipeline with a self-service analytics layer on top, not a real-time system — and recognizing that shapes every decision. There are two flows that share storage but run on different clocks: a daily ingestion-and-validation flow that keeps the cost data current and the tag hygiene visible, and a month-end allocation-and-chargeback flow that produces the statements finance actually bills from.

The defining property of the topology is that the CUR is the single source of truth and is never mutated. AWS delivers the report to a locked-down S3 bucket; everything downstream reads from it and writes derived artifacts elsewhere. That immutability is what makes a chargeback defensible six months later in front of an auditor — you can always re-derive the exact statement from the exact bytes AWS delivered.

Ingestion and validation flow, following the data:

AWS Organizations consolidates billing across all 40 accounts into the management account, and the Cost and Usage Report 2.0 is configured there to deliver hourly, resource-level line items — with cost-allocation tags activated — to a dedicated S3 bucket in Apache Parquet, partitioned by billing period. CUR overwrites the current month’s files daily as charges finalize, which is why downstream reads always re-query rather than cache.
An AWS Glue crawler (or the CUR’s own Athena integration) keeps a Glue Data Catalog table in sync with the report’s evolving schema, so new services and new columns appear without a manual change.
A daily EventBridge schedule triggers a Step Functions state machine that runs the validation pass: an Athena query computes the untagged-spend ratio per account and per service, and a second query checks that every active account carries the mandatory tag keys (cost-center, business-unit, environment, application). The results land in a small DynamoDB table that drives the tag-hygiene dashboard.
Any account drifting past a threshold — say, more than 5% of spend untagged — auto-raises a ServiceNow ticket assigned to that account’s owning team, so tag debt has a name and a due date instead of accumulating silently.

Allocation and chargeback flow, fired by EventBridge a few days after AWS finalizes the prior month (the CUR’s bill/InvoiceId populating is the real “books are closed” signal):

Step Functions runs the allocation Athena queries in sequence: directly-attributable cost is summed per cost-center from tags; untagged and shared cost is apportioned by the agreed rule (more on this below); and the results are written as a derived, partitioned “chargeback ledger” table in S3 — itself immutable once written for that period.
Amazon QuickSight reads the chargeback ledger through SPICE, serving each business unit a row-level-security-scoped dashboard so a GM sees their own spend, trend, and top resources — and nobody else’s.
A Lambda function renders each business unit’s signed monthly statement (PDF + CSV), writes it to a per-unit S3 prefix, and pushes the chargeback record into ServiceNow as a financial transaction against that unit’s cost center — closing the loop from raw billing data to a line on the unit’s internal P&L.
In parallel, AWS Cost Anomaly Detection runs continuously against the same spend, segmented by the same cost-allocation dimensions, and alerts the owning team plus FinOps the moment a unit’s daily run-rate spikes — so a runaway cost is caught mid-month, not discovered in next month’s statement.

Component breakdown

Component	Service / tool	Role in the platform	Key configuration choices
Billing source	AWS Organizations + CUR 2.0	Consolidated, hourly, resource-level line items — the system of record	Management-account delivery; Parquet; cost-allocation tags activated; daily refresh
Raw store	Amazon S3 (CUR bucket)	Immutable landing zone for billing data	Versioning + Object Lock; bucket policy locks writes to the billing service; SSE-KMS
Schema catalog	AWS Glue Data Catalog	Tracks CUR’s evolving schema for SQL access	Crawler or CUR-native Athena integration; partition projection by month
Query engine	Amazon Athena	Serverless SQL for validation + allocation	Workgroup with result location + bytes-scanned guardrail; partitioned, columnar reads
Orchestration	Step Functions + EventBridge	Daily validation and month-end chargeback runs	Schedule on CUR finalization; retries; per-step state in DynamoDB
Tag-hygiene state	Amazon DynamoDB	Untagged ratios, mandatory-tag coverage per account	On-demand capacity; feeds the hygiene dashboard and ServiceNow tickets
Anomaly detection	AWS Cost Anomaly Detection	Continuous spike detection per cost dimension	Monitors by linked account + cost-allocation tag; SNS to owning team
BI / showback	Amazon QuickSight	Per-unit dashboards and trend analysis	SPICE; row-level security by business unit; scheduled refresh post-run
Statement rendering	AWS Lambda	Signed PDF/CSV statements per unit; pushes to ITSM	Renders from chargeback ledger; signs artifacts; idempotent per period
Identity / SSO	Okta + Microsoft Entra ID	Workforce SSO into QuickSight and the FinOps console	OIDC/SAML federation; group claims map to QuickSight RLS groups
Secrets	HashiCorp Vault	ServiceNow API creds, signing keys, third-party tokens	Dynamic leases; AWS auth method; no long-lived creds in Lambda env
ITSM / chargeback book	ServiceNow	Receives chargeback records; raises tag-debt tickets	Financial-transaction record per unit; auto-ticket on tag drift
CSPM / posture	Wiz + Wiz Code	Guards the data store and the IaC that builds it	Alerts on CUR-bucket public exposure; Wiz Code scans Terraform pre-merge
Runtime security	CrowdStrike Falcon	Runtime protection on any rendering/ETL compute	Sensor on containers/instances if the render path is not pure Lambda
Observability	Datadog	Pipeline health, run duration, freshness SLOs	Step Functions + Lambda metrics; monitor on missed/late chargeback run
CI / IaC	GitHub Actions + Terraform	Builds the platform; version-controls allocation SQL	OIDC to AWS (no stored keys); SQL change-reviewed like code

A few of these choices carry the weight of the design and deserve the why.

Why Athena on the CUR, not a data warehouse. The CUR for a $3M/month estate is large but queried in a bursty, monthly cadence — a few heavy allocation queries plus a daily validation sweep. Standing up Redshift to hold it means paying for a cluster that is idle 95% of the time and an ETL job to load it. Athena queries the Parquet directly in S3, you pay only for bytes scanned, and partition projection by billing month means a single month’s allocation touches only that month’s data. The tradeoff — covered below — is that you must discipline your queries against scanning the whole report.

Why the CUR bucket is the most locked-down resource in the account. This bucket is the system of record for a number a regulator may audit, so it gets S3 Object Lock and versioning (the billing data cannot be altered or deleted, even by an admin), an SSE-KMS key, and a bucket policy that permits writes only from the AWS billing service and reads only from the FinOps roles. Wiz scans it continuously and pages the moment its posture drifts toward public or its policy widens — because a tampered or leaked billing record poisons every downstream statement.

Why row-level security in QuickSight is non-negotiable. Showback only changes behavior if a GM can self-serve their own numbers — but a payer’s business units include lines that must not see each other’s cost (a competitive analytics unit, an M&A-sensitive workload). QuickSight RLS ties a viewer’s Okta-federated group to a permissions dataset so the same dashboard transparently filters to only the rows that viewer’s cost centers own. One dashboard, many tenants, zero cross-unit leakage.

Handling untagged and shared cost — the part everyone underestimates

The honest truth of cloud chargeback is that tags are never 100% clean, and how you handle the gap is what makes the model fair or fraudulent. Three categories of cost resist direct attribution: genuinely untagged resources (someone forgot), untaggable charges (some data-transfer, certain support and tax line items, savings-plan amortization that does not carry a resource tag), and deliberately shared infrastructure (the central logging account, the transit gateway, the security tooling every unit benefits from). Hiding these or dropping them silently is how a chargeback loses credibility the first time the numbers do not reconcile to the invoice.

The platform handles each explicitly. Untagged spend is surfaced loudly — the daily validation sweep computes the untagged ratio per account, the hygiene dashboard ranks the worst offenders, and a ServiceNow ticket lands on the owning team with a deadline; the cost of staying untagged is that it gets apportioned back via the shared-cost rule, so there is a financial nudge to fix it. Untaggable and shared costs are pooled and split by a pre-agreed allocation key ratified by the FinOps council — most commonly proportional to each unit’s directly-attributed spend (the unit consuming 30% of the attributable bill absorbs 30% of the shared pool), though some pools split by headcount or by a usage proxy where that is fairer. The rule lives in version-controlled SQL, so it is transparent, reviewable, and identical every month.

-- Apportion the unallocated pool to each cost center,
-- proportional to that center's directly-tagged spend.
WITH attributed AS (
  SELECT  resource_tags['user_cost_center'] AS cost_center,
          SUM(line_item_unblended_cost)      AS direct_cost
  FROM    cur.chargeback
  WHERE   billing_period = DATE '2026-05-01'
    AND   resource_tags['user_cost_center'] <> ''
  GROUP BY 1
),
shared_pool AS (              -- everything with no usable cost center
  SELECT  SUM(line_item_unblended_cost) AS pool
  FROM    cur.chargeback
  WHERE   billing_period = DATE '2026-05-01'
    AND  (resource_tags['user_cost_center'] = '' OR resource_tags['user_cost_center'] IS NULL)
)
SELECT  a.cost_center,
        a.direct_cost,
        ROUND(p.pool * a.direct_cost / SUM(a.direct_cost) OVER (), 2) AS shared_alloc,
        a.direct_cost
          + ROUND(p.pool * a.direct_cost / SUM(a.direct_cost) OVER (), 2) AS total_chargeback
FROM    attributed a CROSS JOIN shared_pool p
ORDER BY total_chargeback DESC;

The property that matters: the per-unit totals reconcile exactly to the AWS invoice. Direct cost plus the apportioned pool sums to 100% of the unblended bill, with zero unallocated remainder — which is the first thing finance checks and the thing a hand-built spreadsheet always gets wrong.

Cost category	How it’s attributed	Why this rule
Tagged resources	Direct, by `cost-center` tag	Unambiguous ownership; the goal state
Untagged but taggable	Apportioned via shared pool; ticketed to owner	Creates a financial nudge to tag, never silently dropped
Untaggable (some transfer, tax, support)	Pooled, split proportionally	No resource to tag; fairness by consumption share
Deliberately shared (logging, TGW, security)	Pooled, split by agreed key	Everyone benefits; council-ratified rule
Savings Plans / RI amortization	Amortized view, allocated to the consuming account	Reflects effective cost, not lumpy upfront purchase

Implementation guidance

Provision with Terraform, and treat the billing data store as the first deliverable. The order matters: the CUR can take up to 24 hours to begin delivering after you enable it, so stand up the bucket, the report definition, and the Glue catalog on day one and let data accumulate while you build the rest.

The locked-down CUR S3 bucket — versioning, Object Lock in compliance mode, SSE-KMS, and a bucket policy scoped to the billing service and FinOps roles.
The CUR 2.0 report definition in the management account: hourly granularity, resource IDs included, Parquet, cost-allocation tags activated, Athena integration enabled.
The Glue Data Catalog table (CUR-native integration or a crawler) with partition projection on billing period so Athena never has to list partitions.
The Athena workgroup with an enforced result location and a per-query bytes-scanned cap as a guardrail against an accidental full-report scan.
Step Functions, EventBridge schedules, Lambda, DynamoDB, QuickSight, and the ServiceNow integration, with the allocation SQL stored as versioned files the pipeline reads at runtime.

Activating cost-allocation tags is the step teams forget, and a tag is not retroactive — it only allocates cost from the moment it is activated forward, which is why this is day-one work:

resource "aws_ce_cost_allocation_tag" "cost_center" {
  tag_key = "cost-center"
  status  = "Active"
}

resource "aws_cur_report_definition" "finops" {
  report_name                = "finops-chargeback-cur2"
  time_unit                  = "HOURLY"
  format                     = "Parquet"
  compression                = "Parquet"
  additional_schema_elements = ["RESOURCES"]
  s3_bucket                  = aws_s3_bucket.cur.id
  s3_region                  = "ap-south-1"
  additional_artifacts       = ["ATHENA"]
  refresh_closed_reports     = true
  report_versioning          = "OVERWRITE_REPORT"
}

The pipeline that applies this runs in GitHub Actions, authenticating to AWS via OIDC federation so there is no stored access key to leak — and the allocation SQL is reviewed in pull requests exactly like application code, because a change to an allocation rule is a change to what each business unit pays. Wiz Code scans the Terraform on the same pull request, blocking a merge that would, say, drop Object Lock or open the bucket policy.

Identity: federate the humans, lease the machine creds. FinOps analysts and business-unit viewers reach QuickSight and the internal FinOps console through Okta as the workforce IdP, federated to Microsoft Entra ID where the corporate identity estate lives, with the resulting group claims mapping straight to QuickSight RLS groups so a viewer’s unit determines the rows they see. The pipeline’s one sensitive dependency — the ServiceNow API credential it uses to post chargeback records and raise tickets, plus the key that signs each statement — lives in HashiCorp Vault, leased dynamically via the AWS auth method and never written into a Lambda environment variable, so a leaked function config exposes nothing.

Enterprise considerations

Security & data integrity. The crown jewel is the billing data itself; compromise it and every statement downstream is wrong. The controls layer accordingly: (a) S3 Object Lock + versioning make the CUR tamper-evident and tamper-resistant; (b) SSE-KMS with a dedicated key and a tight key policy; © Wiz running continuous CSPM so any drift of the CUR bucket toward public exposure or a widened policy pages immediately, with Wiz Code catching the same risk in IaC before it ships; (d) CrowdStrike Falcon sensors on any non-Lambda ETL or rendering compute for runtime threat detection into the SOC; (e) least-privilege IAM throughout — the validation role can read the CUR and nothing else, the render Lambda can read the chargeback ledger but not the raw report. A tag-drift breach raises a ServiceNow ticket so tag debt is tracked work, not a forgotten log line.

Cost optimization (of the FinOps platform itself). A cost-management platform that is itself expensive is a bad look, and the dominant cost here is Athena bytes scanned. Engineer for it.

Lever	Mechanism	Typical effect
Partition projection	Query only the relevant billing month’s partition	Cuts a full-report scan to a single month
Columnar Parquet	CUR in Parquet; select only needed columns	Athena reads a fraction of the bytes
Bytes-scanned guardrail	Workgroup limit aborts a runaway query	Caps the cost of a bad ad-hoc query
SPICE in QuickSight	Dashboards read cached SPICE, not live Athena	Viewers never trigger a query per page load
Right-sized cadence	Heavy allocation runs monthly, light validation daily	Avoids re-deriving the world every day

The platform’s own bill should be a rounding error — low hundreds of dollars a month on a $3M estate — and Datadog tracks Athena spend as a first-class metric so the cost-management tool does not quietly become a cost problem.

Scalability. The architecture scales with the number of accounts and the size of the bill almost for free because the heavy lifting is serverless: Athena scales query concurrency on demand, Step Functions and Lambda scale with the run, and S3 has no capacity to manage. The real scaling axis is organizational — more business units means more RLS groups, more allocation rules, and more shared-cost pools to ratify — which is governance work, not infrastructure work. The one genuine ceiling is Athena’s per-query and per-workgroup concurrency limits during a heavy month-end run; sequencing the allocation queries through Step Functions rather than firing them all at once keeps you well under it.

Failure modes, and what each one looks like. Name them before they corrupt a chargeback.

Running the chargeback before the books close — the CUR overwrites the current month’s data daily as charges finalize, so a statement generated too early is simply wrong. Mitigation: gate the month-end run on the bill/InvoiceId field populating, not a fixed calendar date.
A schema change in the CUR breaks the allocation query — AWS adds columns and services over time. Mitigation: the Glue crawler / CUR-native integration tracks the schema, and the allocation SQL selects named columns defensively rather than SELECT *.
Tag coverage silently degrades — a new team launches untagged and the shared pool quietly balloons, distorting everyone’s apportioned share. Mitigation: the daily untagged-ratio check, the hygiene dashboard, and the auto-raised ServiceNow ticket make the drift visible the next morning.
An Athena query scans the whole report — a missing partition predicate turns a cheap query into an expensive one. Mitigation: partition projection plus the workgroup bytes-scanned cap, which aborts it.
The statement does not reconcile to the invoice — the cardinal sin. Mitigation: an automated reconciliation assertion in the pipeline that the sum of per-unit chargebacks equals the unblended invoice total to the cent, failing the run if it does not.

Reliability & DR (RTO/RPO). This is a batch system, so the SLO is freshness, not uptime — the chargeback must be ready by an agreed day of the month. Because the entire pipeline re-derives its output from the immutable CUR in S3 (which AWS replicates durably and you can additionally cross-region replicate), recovery is “re-run the pipeline against the same source.” A pragmatic posture: RPO is effectively zero (the source of truth is immutable and durable), and RTO is a single pipeline re-run — a few hours — because nothing is lost that cannot be recomputed. Datadog monitors enforce the freshness SLO and page FinOps if a scheduled run is late or fails, which for a batch platform is the failure that actually matters.

Observability. Instrument the pipeline in Datadog as an end-to-end batch SLO: run duration, success/failure of each Step Functions stage, the data-freshness lag (how stale the latest chargeback ledger is), and the reconciliation delta between summed chargebacks and the invoice. Emit the business metrics FinOps lives by — untagged-spend ratio, shared-pool size as a percentage of total, per-unit month-over-month variance, and anomaly-alert count per unit — so a degrading tag culture or a runaway workload surfaces on a dashboard rather than in a quarterly surprise. New allocation rules pass through a ServiceNow change approval before going live, giving finance a documented gate on changes to how units are billed.

Governance. The allocation logic is financial logic, so govern it like financial controls: the FinOps council ratifies the shared-cost key and the mandatory tag taxonomy, the SQL that implements them lives in version control and changes only through reviewed pull requests, and every monthly chargeback ledger is retained immutably for the audit horizon the regulator requires. Pin the report definition and never let allocation rules drift silently; promote changes through the change gate; and keep a documented, reproducible path from any historical statement back to the exact CUR bytes that produced it — which is the artifact that ends a dispute.

Explicit tradeoffs

Accept these or do not build it. A CUR-driven chargeback platform is a batch system with a lag — the freshest possible chargeback is still a few days after month-end, because that is when AWS finalizes the bill, and no architecture changes that physics. It lives or dies on tag discipline, which is an organizational problem the platform can surface and nudge but cannot solve by itself; the shared-pool apportionment is a real, defensible answer to imperfect tags, but it is an approximation, and a unit that stays sloppy on tags will rightly feel its apportioned share is fuzzy. The Athena-not-warehouse choice keeps cost near zero but demands query discipline — one un-partitioned ad-hoc query can scan the whole report. And the Okta-to-Entra federation plus QuickSight RLS plus Vault-held creds are all overhead a five-account startup can skip and a 40-account regulated payer absolutely cannot.

The alternatives, and when they win. If you run a handful of accounts and a small bill, AWS Cost Explorer with cost categories and a couple of activated tags is enough showback and you do not need this pipeline at all. If your finance team has standardized on a third-party FinOps platform — CloudHealth, Apptio Cloudability, or similar — those ingest the CUR and do much of this allocation out of the box, and the build-versus-buy decision turns on whether your allocation rules are unusual enough, and your data-residency or customization needs strict enough, to justify owning the code; for a regulated payer with bespoke shared-cost rules and an auditor to satisfy, owning the SQL is often the right call. If you need cost data joined to business metrics — cost per claim processed, cost per member — you extend this exact pipeline by joining the chargeback ledger to operational data in Athena, which is the natural next step once the unit-cost discipline is in place.

The shape of the win

For the payer’s CFO, the payoff is not “a dashboard.” It is that at the next quarterly close, finance hands each business-unit GM a signed statement that reconciles to the AWS invoice to the cent, the claims division can see its own RDS fleet is in fact its largest line, the analytics team can see its training spend with the untagged residue fairly apportioned and nobody else’s costs on its bill — and because every number traces back to immutable CUR data through version-controlled SQL, the dispute that used to eat the first week of every quarter simply does not happen. That last point is what funds the platform: the chargeback is no longer a fight, it is a fact. Everything upstream — the locked-down CUR bucket, the Athena allocation queries, the explicit untagged handling, the QuickSight row-level security, the ServiceNow chargeback push, the Cost Anomaly Detection guardrail, the Datadog freshness SLO — exists so that a CFO, a business-unit GM, and an auditor each look at the same monthly number and agree it is right. Start narrower if your estate is small, but for a regulated organization spending real money across many accounts, this is where accountable cloud cost has to land.

FinOps Showback and Chargeback Platform on AWS

Why not the obvious shortcuts

Architecture overview

Component breakdown

Handling untagged and shared cost — the part everyone underestimates

Implementation guidance

Enterprise considerations

Explicit tradeoffs

The shape of the win

Written by Vinod

Comments

Keep Reading

The AWS Architecting Ladder: From a Static Site to Multi-Region Active-Active

The Azure Architecting Ladder: From a Simple Web App to Mission-Critical

Azure Architecture Case Studies: Real Proposal Walkthroughs (Easy → Complex)