Architecture AWS

PCI-DSS Cardholder Data Environment on AWS: Segmentation and Scope Reduction

A mid-market e-commerce retailer — call it a fashion marketplace doing about 40 million card transactions a year — gets the letter every CISO dreads: its acquiring bank has bumped it from a self-assessment questionnaire to a full Report on Compliance (RoC), because volume crossed the Level 1 threshold. A QSA will now walk the environment annually, and the auditor’s opening question is always the same and always the most expensive one: “Show me everything in scope for PCI.” The honest answer, the first time anyone asks, is usually “we don’t entirely know” — and that ambiguity is what turns a tractable audit into a six-figure, six-month ordeal. The retailer’s mandate to the platform team is blunt: build the cardholder data environment on AWS so that scope is small, provable, and stays small — so that the answer to the auditor’s question is a short list, not the whole estate. This article is the reference architecture for doing exactly that.

The pressures here are specific to payments and they compound. Scope is the master variable: every system that stores, processes, or transmits cardholder data (CHD), plus every system connected to it, falls under all twelve PCI-DSS requirements — so an un-segmented network drags your entire AWS estate into the audit. Regulation moved the goalposts: PCI-DSS v4.0 became mandatory in 2024, the future-dated requirements landed in March 2025, and it explicitly demands continuous control monitoring rather than a once-a-year snapshot. Cost is real money: every in-scope server is an instance a QSA inspects, a CrowdStrike sensor you license, a config you must evidence. And breach liability is existential — a card-data breach means fines, forensic investigation, brand damage, and potential loss of the ability to process cards at all. The single highest-leverage move against all four is the same: reduce scope, and prove the reduction holds.

Why segmentation and tokenization are the whole game

Two techniques do most of the work, and it is worth being precise about what each one buys, because teams routinely conflate them.

Network segmentation is what keeps your non-CDE systems out of scope. PCI-DSS scope follows connectivity: if your marketing analytics box can reach the CDE, the marketing box is in scope. Segmentation draws a hard, enforced boundary so that only a tiny, named set of systems can touch cardholder data, and everything else is demonstrably isolated. A QSA tests segmentation directly — they will try to route from a non-CDE subnet into the CDE and expect it to fail. Effective segmentation is the difference between a 30-system audit and a 3,000-system audit.

Tokenization is what keeps cardholder data out of your systems entirely, which shrinks scope from the inside. Instead of storing a primary account number (PAN), you swap it for a meaningless surrogate — a token — the instant it arrives, and store only the token. The real PAN lives in exactly one hardened vault (or, better, never lands in your environment at all because a payment provider tokenizes at capture). Systems that handle only tokens are out of PCI scope, because a token is not cardholder data. Tokenization is strictly better than encrypting-and-storing PANs yourself: encrypted CHD is still CHD in scope, with all the key-management burden that implies.

The design philosophy that follows is descope first, protect what remains second. You make the in-scope footprint as small as physically possible, then you wrap that small footprint in defense-in-depth so thorough that the QSA’s checklist is satisfied by configuration they can read, not by promises.

Architecture overview

PCI-DSS Cardholder Data Environment on AWS: Segmentation and Scope Reduction — architecture

The estate is organized as an AWS Organizations multi-account structure — non-negotiable for serious PCI work, because an AWS account is the cleanest, most auditable isolation boundary AWS offers. A dedicated CDE account holds everything in scope; separate accounts hold the public storefront, shared security tooling, logging, and the rest of the business. Service Control Policies (SCPs) at the OU level enforce guardrails the CDE account cannot escape — no disabling CloudTrail, no opening Security Groups to 0.0.0.0/0 on sensitive ports, no creating resources outside approved regions.

The defining property of the topology is the one the QSA tests hardest: the CDE is a network island. It is an isolated VPC with no internet gateway. Inbound payment traffic arrives only through a tightly controlled path; outbound traffic is forced through an inspection firewall; and the systems that handle a real PAN are a handful of components you can name on one hand.

Transaction path, following the data and control flow:

  1. A shopper checks out on the storefront, which runs in the storefront account — ECS Fargate behind an Application Load Balancer, fronted by Akamai at the edge for TLS termination, global anycast, App & API Protector WAF, and Bot Manager to absorb credential-stuffing and carding attacks before they reach AWS. Critically, the storefront uses a hosted payment field / client-side tokenization so the raw PAN is captured directly into an iframe served by the payment tokenization provider and never touches the storefront’s own servers. This single decision keeps the entire storefront out of the strictest PCI scope (it qualifies for the much lighter SAQ A-style footprint for the web tier).
  2. For flows that must handle the PAN server-side — recurring billing, certain card-on-file operations, settlement reconciliation — traffic enters the CDE VPC through a private, internal path: an internal ALB reachable only from the storefront account over a peered, narrowly-routed connection or via PrivateLink, never from the internet.
  3. Inside the CDE, the payment microservices run on a private EKS cluster (or ECS Fargate) in private subnets only. The PAN is tokenized immediately by the tokenization service, which is the only component permitted to call the payment provider’s vault. From that point on, every other system — orders, fulfilment, analytics — sees only tokens.
  4. The tokenization service and any service that must briefly handle a PAN pull their secrets — provider API keys, the tokenization vault credentials, database passwords — from HashiCorp Vault running in the security account, via Kubernetes auth with short-lived dynamic leases. No long-lived card-provider credential ever sits in a Kubernetes Secret, an environment variable, or (the cardinal sin) a git repo.
  5. The token vault mapping — and any CHD that genuinely must be stored, such as the last four digits and an encrypted token reference — lives in an Aurora PostgreSQL cluster encrypted with a customer-managed KMS key (envelope encryption), in CDE-only private subnets, reachable only from the payment service security group.
  6. All outbound traffic from the CDE — the calls to the payment provider, OS patch fetches, telemetry — is forced through AWS Network Firewall in a dedicated inspection subnet, with a strict domain allow-list (egress filtering). The CDE can reach the payment provider’s API endpoints and the patch mirror and nothing else. This is the control that stops a compromised container from exfiltrating tokens or beaconing to a C2 server.

Identity and operations path, layered across the whole thing: human operators reach the CDE only through Okta-federated SSO into AWS IAM Identity Center, mapped to least-privilege permission sets, with phishing-resistant MFA and just-in-time elevation — there are no IAM users with static keys in the CDE account. CrowdStrike Falcon sensors run on every CDE node and container for runtime threat detection and file-integrity monitoring (FIM). Wiz continuously scans the CDE account for misconfiguration, vulnerabilities, exposed data, and PCI control drift, generating the audit evidence. And ServiceNow is the system of record for every change into the CDE, every access request, and every security incident — the paper trail a QSA lives in.

Component breakdown

Component Service / tool Role in the CDE Key configuration choices
Edge Akamai TLS, anycast, WAF, bot/carding mitigation at the perimeter App & API Protector; Bot Manager rules for carding; rate controls on checkout
Storefront (out of scope) ECS Fargate + ALB (storefront acct) Web tier using hosted payment fields Client-side tokenization iframe; no PAN on servers
Account isolation AWS Organizations + SCPs Hard isolation boundary for the CDE Dedicated CDE account; SCPs deny public exposure, region lock
CDE network Isolated VPC, private subnets, no IGW The PCI network island No internet gateway; flow logs on; segmentation tested
Ingress Internal ALB / PrivateLink Controlled inbound to payment services Reachable only from storefront acct; no public listener
Payment compute Private EKS / ECS Fargate Tokenization + card-on-file logic Private cluster; CDE-only security groups; least privilege
Tokenization Provider token vault + tokenization svc Swap PAN for token at the boundary Only component allowed to call the vault; PAN never persisted elsewhere
Secrets HashiCorp Vault Provider keys, DB creds, vault credentials Kubernetes auth; dynamic short-lived leases; no static keys
Data store Aurora PostgreSQL + KMS CMK Token mappings / minimal stored CHD Customer-managed key; envelope encryption; CDE-only access
Egress control AWS Network Firewall Force + filter all outbound traffic Domain allow-list; default-deny egress; TLS inspection where allowed
Identity Okta + AWS IAM Identity Center Human access to the CDE SSO federation; least-privilege permission sets; JIT + phishing-resistant MFA
Runtime security CrowdStrike Falcon Runtime threat detection + FIM on CDE hosts Sensor on all nodes/containers; FIM for PCI Req 11.5; detections to SIEM
Compliance posture Wiz Continuous CSPM, vuln, data-exposure, PCI evidence PCI-DSS framework mapping; drift alerts; agentless data scanning
Logging CloudTrail + Config + centralized account Immutable audit trail and config history Org-wide CloudTrail; Config rules; S3 Object Lock (WORM)
ITSM / change ServiceNow Change records, access requests, incidents Change gate on CDE deploys; auto-ticket on Wiz/Falcon findings
CI / IaC GitHub Actions + Terraform Pipeline; infra as code with policy scanning OIDC to AWS (no stored creds); Wiz Code / Checkov IaC gate

A few of these choices carry the audit and deserve the why, because they are the ones teams get wrong and pay for later.

Why a separate AWS account, not just a separate VPC. A VPC boundary is real, but an account boundary is stronger and far easier to evidence: separate IAM, separate quotas, SCP guardrails the workload genuinely cannot override, and a billing/blast-radius line that a QSA understands instantly. When the auditor asks “what is in scope,” pointing at one account with a defined set of resources is a vastly cleaner answer than carving a VPC out of a shared account and arguing about the IAM and peering edges. The account is the scope boundary.

Why force and filter egress. Inbound filtering is obvious; the control that actually stops a breach from becoming a data exfiltration is egress control. A compromised payment container that can open arbitrary outbound connections will quietly ship tokens or PANs to an attacker. By routing all CDE outbound traffic through AWS Network Firewall with a default-deny, domain-allow-list policy, the only destinations reachable are the payment provider and the patch mirror — so even a fully owned container has nowhere to send the data. PCI-DSS v4.0 expects exactly this kind of outbound restriction, and it is the control that converts “we got popped” into “we got popped and lost nothing.”

Why tokenize at capture, not encrypt-and-store. If you encrypt PANs and store the ciphertext, that ciphertext is still cardholder data in scope, you own the entire key lifecycle, and a key compromise is a PAN compromise. If instead the PAN is tokenized at the moment of capture (ideally client-side, in the provider’s iframe) and you store only tokens, those tokens are not cardholder data, the systems holding them fall out of scope, and a database breach yields meaningless surrogates. The contrast is stark:

Approach What you store PCI scope of that store Breach impact
Encrypt PAN yourself Encrypted PAN (still CHD) Full PCI scope + key management Key compromise = PAN compromise
Tokenize at capture Surrogate token only Out of scope (tokens ≠ CHD) Tokens useless to an attacker

The right answer is almost always to push tokenization as close to capture as possible and let the systems behind it deal only in tokens.

Implementation guidance

Provision with Terraform, and treat the account topology and network as the first deliverables. Get the boundaries right before any application code exists, because retrofitting segmentation onto a running system is how audits slip.

  1. The AWS Organizations layout: a Security OU (logging, Vault/security tooling, Wiz, CrowdStrike management), a dedicated CDE OU/account, and separate accounts for the storefront and the rest of the business. SCPs on the CDE OU deny internet gateway creation, deny opening sensitive ports to 0.0.0.0/0, deny disabling CloudTrail/Config, and lock the account to approved regions.
  2. The CDE VPC: private subnets across multiple AZs, no internet gateway, a separate inspection subnet for AWS Network Firewall, VPC Flow Logs enabled to the central logging account, and route tables that send all 0.0.0.0/0 egress to the firewall endpoint.
  3. Encryption everywhere: a customer-managed KMS key for the CDE with a tight key policy and automatic rotation; Aurora, EBS, and S3 in the CDE all bound to it; TLS enforced on every listener.
  4. AWS Network Firewall with a stateful rule group implementing default-deny egress and an explicit domain allow-list for the payment provider and patch mirror only.
  5. The private EKS/ECS workloads with CDE-only security groups, IRSA / task roles granting exactly the permissions each service needs, and CrowdStrike sensors deployed as a DaemonSet from the start.

A minimal Terraform shape for the egress firewall policy communicates the intent — deny by default, allow a named few:

resource "aws_networkfirewall_rule_group" "cde_egress_allowlist" {
  capacity = 100
  name     = "cde-egress-allowlist"
  type     = "STATEFUL"
  rule_group {
    rules_source {
      rules_source_list {
        generated_rules_type = "ALLOWLIST"   # default-deny everything else
        target_types         = ["TLS_SNI", "HTTP_HOST"]
        targets = [
          "api.payments-provider.example",    # tokenization vault + gateway
          "patch-mirror.internal.example",    # OS/package updates only
        ]
      }
    }
  }
}

The pipeline that applies this runs in GitHub Actions, authenticating to AWS via OIDC federation so there is no stored access key to leak — and every Terraform plan is scanned by Wiz Code (and/or Checkov) on the pull request, with a critical IaC finding (a public S3 bucket, an over-broad security group, an unencrypted volume) failing the build before it ever reaches the CDE. This is PCI Requirement 6 — secure development — enforced mechanically rather than by reviewer goodwill.

Identity: kill the static keys, federate the humans, scope the machines. No IAM users with long-lived access keys exist in the CDE account — that is a control you assert with an SCP and verify with Wiz. Human operators authenticate through Okta, which federates into AWS IAM Identity Center; Okta group membership maps to least-privilege permission sets, conditional access enforces device posture and phishing-resistant MFA, and any standing administrative access is replaced by just-in-time elevation with a logged approval (PCI v4.0 leans hard on least privilege and MFA for all access into the CDE). Workloads use IRSA (EKS) or task roles (ECS) scoped to the single resource each one needs. The residual secrets that are not IAM roles — payment-provider API keys, the tokenization vault credential, the Aurora master password — live in HashiCorp Vault, leased dynamically with short TTLs and injected at runtime, so they rotate automatically and never persist in an image or a config map.

File-integrity monitoring and runtime detection are not optional. PCI-DSS Requirement 11.5 mandates change detection on critical files, and Requirement 10 mandates that you detect and respond to anomalies. CrowdStrike Falcon covers both: its sensor on every CDE node and container provides runtime threat detection (a process spawning a shell, an unexpected outbound attempt, a known-bad binary) and FIM on system and application binaries, streaming detections into the SIEM so the SOC has a real-time signal — not a log nobody reads.

Enterprise considerations

Security & Zero Trust. The architecture is Zero Trust by construction within the CDE: no implicit network trust (segmented island, default-deny egress), identity-based access only (Okta → IAM Identity Center, least privilege), and machine identity scoped per resource. Layer on top the controls a QSA itemizes: encryption of CHD at rest (KMS CMK envelope encryption) and in transit (TLS everywhere) for Requirements 3 and 4; Vault-managed secrets and rotation; CrowdStrike runtime protection and FIM for Requirements 10 and 11; Akamai WAF for Requirement 6.4’s public-facing-app protection and for carding defense; and AWS Network Firewall egress control as the exfiltration backstop. Every security event of consequence auto-raises a ServiceNow incident, so there is a ticket and an owner, not just a log line. The throughline: each control maps to a numbered PCI requirement, and each one is configured in a way an auditor can read.

Continuous compliance evidence — the v4.0 mandate. The defining shift in PCI-DSS v4.0 is from annual snapshot to continuous monitoring, and Wiz is the engine for it here. Wiz maps its findings directly to the PCI-DSS framework, continuously scans the CDE account for misconfiguration, unencrypted data, vulnerable images, exposed resources, and control drift — and crucially generates the evidence the QSA wants: a dated, exportable record that encryption was on, that no resource drifted to public, that vulnerabilities were remediated within the required window. When Wiz flags a drift (someone widened a security group, a new image shipped a critical CVE), it auto-tickets ServiceNow for closed-loop remediation. This turns the audit from a frantic six-week evidence-gathering scramble into “export the report.” That single capability is often the difference between passing the RoC on the first pass and not.

Cost optimization. Scope reduction is the cost story — fewer in-scope systems means fewer CrowdStrike licenses, fewer instances a QSA inspects, fewer controls to evidence. Beyond that:

Lever Mechanism Typical effect
Aggressive tokenization Push tokenization client-side / to capture Removes whole subsystems from scope
Fargate for spiky payment load Serverless containers, scale to demand No idle CDE compute billed off-peak
Right-sized CDE only Keep the island genuinely small Fewer sensors, fewer audited instances
Consolidated logging One central logging account with lifecycle tiering Cheaper long-term WORM retention
Reserved capacity for steady core RIs/Savings Plans on always-on CDE baseline Discount on the predictable floor

The counterintuitive truth: spending on tokenization and segmentation up front is the cheapest line item in the whole program, because it shrinks everything downstream.

Scalability. Each tier scales independently. The storefront (out of scope) scales freely on Fargate behind Akamai. The CDE payment services scale on EKS/ECS by concurrency — and because the storefront is decoupled and most traffic never enters the CDE, the in-scope tier handles only the genuine PAN-touching flows, which are a fraction of total volume. Aurora scales reads with replicas; the token vault is the provider’s concern at scale. AWS Network Firewall scales horizontally and is managed. The natural ceiling is the payment provider’s throughput and your KMS request rate, both of which are raised by request well ahead of a peak event.

Failure modes, and what each one looks like. Name them before they page you — or before they fail an audit.

Reliability & DR (RTO/RPO). Payment availability is revenue and reputation, so decide the numbers per tier and — importantly — replicate the controls, not just the data, because a DR region is in PCI scope too. Aurora Global Database gives cross-region replication with low RPO (seconds) for the token-mapping store; KMS keys are replicated as multi-region keys so the DR region can decrypt; the EKS/ECS workloads and the full CDE network (including Network Firewall rules, Vault, CrowdStrike) are defined in Terraform so the standby region is a terraform apply away and identically hardened. A pragmatic target for this platform: RTO 30 minutes, RPO 1 minute for the payment path, with the explicit discipline that the secondary region carries the same SCPs, the same egress allow-list, and the same sensors — a DR region with weaker controls is a compliance gap, not a safety net. Akamai health checks drive edge failover for ingress.

Observability. Beyond CrowdStrike’s security telemetry, instrument the payment path for operations and for audit: CloudTrail organization-wide (immutable, to an S3 bucket with Object Lock / WORM so the audit trail cannot be altered — Requirement 10), AWS Config with PCI-relevant rules recording every configuration change, VPC Flow Logs on the CDE, and application metrics/traces for latency and error rate on the tokenization and settlement services. Centralize logs in the dedicated logging account that the CDE can write to but not read or delete, so an attacker who owns the CDE still cannot tamper with the evidence. The metrics that matter to the business and the auditor: tokenization success rate, p95 checkout latency, egress-deny count (a spike is an incident), control-drift count from Wiz, and time-to-remediate on findings.

Governance. Every change into the CDE flows through a ServiceNow change record with approval — Requirement 6’s change control, enforced as a gate in the pipeline, not a courtesy. Access into the CDE is requested and granted through ServiceNow with a logged approver and an expiry. Pin and scan container base images; promote them through the Wiz Code / IaC gate. Apply AWS Config and SCPs as the policy backbone, with Wiz as the independent verifier that the policy is actually holding — defense in depth applied to governance itself. Keep all infrastructure and firewall policy in version control, reviewable and instantly revertable, so the state of the CDE is always a readable, auditable artifact.

Explicit tradeoffs

Accept these or do not build it. A scoped, multi-account, segmented CDE adds real complexity: an Organizations structure to run, a network island with no convenient public debugging path, default-deny egress that breaks the first time a developer assumes they can curl anything, and a tokenization boundary that every new feature must respect. Tokenization itself adds a hop and a dependency on the payment provider’s vault — and a detokenization path that is, by definition, the most sensitive operation in the system and must be guarded accordingly. The continuous-compliance tooling (Wiz, CrowdStrike, the ServiceNow change discipline) is ongoing license and process cost you cannot meaningfully skip for a Level 1 RoC. And the Okta-to-IAM-Identity-Center federation with JIT elevation adds friction to operator access that a “just give everyone admin” shop will resent — until the first audit, when that friction is exactly what passes Requirement 7.

The alternatives, and when they win. If your transaction volume is genuinely low and you can use a payment provider’s fully hosted checkout / redirect so that no card data ever touches your infrastructure in any flow, you may qualify for SAQ A and skip most of this — that is the simplest path and the right one for many small merchants; build the full CDE only when you have server-side flows (card-on-file, recurring billing, marketplaces splitting payments) that force a PAN into your environment. If you are deeply invested in a single cloud’s native primitives, AWS-native equivalents (Security Hub’s PCI standard, GuardDuty, Inspector) can substitute for parts of Wiz/CrowdStrike — at the cost of the unified, audit-ready, multi-cloud evidence those tools provide. And if you operate at the scale where you run your own tokenization vault rather than the provider’s, that is a legitimate architecture — but it pulls the vault firmly back into your most sensitive scope, which is exactly the burden tokenization-at-capture exists to avoid; only take it on when contractual or latency requirements truly demand it.

The shape of the win

For the retailer, the payoff is not “we passed the audit,” though they did. It is that when the QSA opens with “show me everything in scope,” the answer is a single AWS account, a named handful of payment services, one encrypted token store, and a Wiz report dated this morning proving every control held — instead of a frightened sweep of the whole estate. Because the storefront tokenizes at capture and never sees a PAN, the entire web tier — the largest, most-changed, most-exposed part of the business — sits outside the strictest scope, so the engineering teams that ship features daily are not dragged through PCI on every release. That is the sentence that pays for the architecture. Everything upstream — the account isolation, the segmented island, the forced egress, the KMS envelope encryption, the Vault-held secrets, the CrowdStrike sensors, the continuous Wiz evidence, the ServiceNow change trail — exists to make a QSA, an acquiring bank, and a CISO each say yes, and to keep the answer to one dangerous question short. Start narrower if your flows allow it, but when real card data must live in your environment, this is where a PCI-DSS cardholder data environment has to land.

AWSPCI-DSSSecuritySegmentationTokenizationCompliance
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading