A fraud team at a digital bank measures its life in milliseconds and basis points. Every card authorization, every account-to-account transfer, every login from a new device is a decision that must be made while the customer waits — the payment network gives you a budget of roughly 150–300ms end to end to say approve or decline, and your model is one hop inside that. Score too slowly and the issuer times out and approves by default, which is precisely the moment fraudsters probe for. Score too aggressively and you decline a legitimate ₹80,000 wedding purchase, the customer calls in furious, and a chargeback-prevention system has just manufactured churn. Underneath both failure modes sits a regulator — RBI, the FFIEC, PSD2’s SCA mandate — that expects you to explain every automated decision and prove the model is not quietly discriminating or quietly decaying. This article is a reference architecture for building that system properly on AWS: not a batch job that flags fraud the next morning, but a streaming machine-learning platform that scores in single-digit-to-low-double-digit milliseconds, keeps its features fresh to the second, watches itself for drift, and turns a confirmed fraud signal into an investigated, audited case.
The business scenario
The driver is adversarial, fast-moving loss against a hard latency ceiling. Consider Cresta Pay, a fictional Indian digital bank and payments processor: ~28 million cards, ~4,000 card authorizations per second at peak (festival sale evenings), plus a growing UPI and account-transfer book. Fraud here is not one thing. It is card-testing bots running thousands of $1 auths to find live numbers; account-takeover where a credential-stuffed login is followed by a payee addition and a drain; first-party “friendly” fraud; and mule-account rings that launder in fan-out patterns no single transaction looks wrong in.
The naive approaches each break in a specific way. Static rules (“decline if amount > X and country ≠ home”) are explainable and fast, but fraudsters reverse-engineer them in hours and they generate brutal false-positive rates on edge-case-but-legitimate behavior. Nightly batch scoring catches yesterday’s fraud — useless against a card-testing burst that empties a BIN range in ninety seconds. A model that only sees the current transaction is blind to the thing that actually distinguishes fraud: velocity and context — that this is the 19th transaction on this card in four minutes, from a device first seen ten seconds ago, on a card whose 90-day average ticket is ₹600 and whose merchant-category entropy just spiked.
Streaming ML threads this. The core idea: maintain a continuously-updated picture of each entity (card, device, account, merchant) as events flow, compute behavioral features in real time, and score each transaction against a model that has those features at its fingertips — all inside the authorization window. The decision is contextual (it knows the entity’s recent history), fresh (the history includes the transaction from 800ms ago), fast (sub-100ms, with headroom), and explainable (the features and their contributions are logged for every decision the regulator might ask about).
The scenario scales cleanly. A small lender runs one model, a few hundred transactions per second, a modest feature set. Cresta Pay runs an ensemble — a gradient-boosted model for card fraud, a separate model for account-takeover, a graph signal for mule rings — at thousands of TPS with a hard p99 latency SLO and a 24/7 fraud-ops floor. The architecture below is the same shape for both; what changes is partition counts, the number of model endpoints, and how aggressively you cache.
Architecture overview
The design has three planes that share infrastructure but run on different clocks: a streaming feature plane (always-on, computes and serves features), a synchronous scoring plane (per-transaction, must beat the latency budget), and a feedback / governance plane (catches confirmed outcomes, retrains, monitors, and feeds case management). Keeping them mentally separate is the first step to operating this without melting down.
Scoring path, numbered as in the diagram: (1) the authorization request hits the payments gateway and is published to Amazon MSK (managed Apache Kafka) on a transactions topic, partitioned by card/account so an entity’s events stay ordered on one partition. (2) The decision service — your low-latency scoring code on EKS, fronted by Akamai at the edge for TLS termination, geo-routing, and L7 DDoS absorption on the public API — consumes the event (or is called synchronously by the gateway and reads context). (3) It fetches the entity’s precomputed features from the online feature store — Amazon DynamoDB (or ElastiCache/Redis for the hottest keys) keyed by card and account, holding rolling aggregates kept current by the stream. (4) It assembles the feature vector and calls a SageMaker real-time endpoint hosting the fraud model(s); the endpoint returns a score and SHAP-style feature attributions in a handful of milliseconds. (5) A thin decisioning layer combines the model score with a small set of hard policy rules and a risk threshold to emit approve / decline / step-up, returns it inside the auth window, and emits a decisions event back to Kafka for audit and feedback.
Feature plane runs continuously and is what makes step (3) instant: an Apache Flink application (on Amazon Managed Service for Apache Flink) consumes the transactions topic and maintains stateful, windowed aggregates per entity — transaction count and sum over 1-minute / 1-hour / 24-hour sliding windows, distinct-device and distinct-merchant counts, time-since-last-transaction, velocity ratios — and writes them to the online store (DynamoDB) the instant they change. The same Flink job (or a sink) mirrors features to Amazon S3 as the offline store, so training data is computed by the exact same code that serves online — the single most important property for avoiding train/serve skew.
Feedback & governance plane closes the loop: confirmed outcomes — chargebacks (which arrive days later via card-network files), customer-confirmed fraud, and analyst dispositions — are joined back to the original decisions to produce labels in S3; SageMaker Model Monitor and Pipelines watch live feature/score distributions for drift and retrain on schedule or on a drift trigger; and any decision that needs a human — a high-score decline, a flagged mule pattern — is opened as a ServiceNow case for the fraud-operations team to investigate, action, and audit.
Component breakdown
| Component | AWS / third-party | Role in the platform | Key configuration choices |
|---|---|---|---|
| Edge ingress | Akamai | TLS, geo-routing, L7 DDoS, bot mitigation on the public scoring/API surface | Bot Manager rules for card-testing patterns; origin to ALB over mTLS |
| Event backbone | Amazon MSK (Kafka) | Durable, ordered transaction + decision event log | Partition by card/account hash; RF=3 across AZs; transactions, decisions, labels topics |
| Stream processing | Managed Service for Apache Flink | Stateful windowed feature computation; exactly-once aggregation | RocksDB state backend; incremental checkpoints to S3; event-time + watermarks for out-of-order |
| Online feature store | DynamoDB (+ ElastiCache for hot keys) | Single-digit-ms feature reads keyed by entity | On-demand or provisioned w/ autoscaling; DAX or Redis for hottest cards; TTL on stale entities |
| Offline feature store | Amazon S3 (Parquet) | Point-in-time-correct training data, same code as online | Partitioned by date/entity; Glue catalog; backs SageMaker training |
| Model serving | SageMaker real-time endpoints | Low-latency scoring; returns score + attributions | Multi-model or multi-variant endpoints; autoscaling on InvocationsPerInstance; inference-recommender-sized instances |
| Decisioning | EKS service | Combine score + policy rules + threshold → approve/decline/step-up | Workload identity (IRSA); circuit breaker + fallback rules; p99 latency SLO budget |
| Drift & quality | SageMaker Model Monitor + Clarify | Data-drift, score-drift, bias, and feature-attribution monitoring | Baseline from training set; hourly monitoring schedule; CloudWatch alarms on violations |
| Retraining | SageMaker Pipelines | Scheduled / triggered retrain, evaluate, register, gated deploy | Model Registry approval gate; shadow/canary before promotion |
| Case management | ServiceNow | Human investigation, disposition, SLA tracking, audit trail | Fraud case table; bi-directional sync of disposition back to labels topic |
| Identity | Okta / Entra ID | SSO + MFA for analysts, engineers, and ServiceNow | SAML/OIDC federation; step-up auth for high-risk admin actions |
| Secrets | HashiCorp Vault | Kafka/SASL creds, DB creds, model-registry tokens, dynamic DB secrets | Short-TTL dynamic secrets; EKS auth via Kubernetes service-account JWT |
| Runtime security | CrowdStrike Falcon | EDR on EKS nodes + container runtime threat detection | Falcon sensor as DaemonSet; blocks lateral movement on the scoring fleet |
| Data posture | Wiz | CSPM / DSPM — finds exposed PII, public buckets, risky IAM across the data path | Agentless scan of S3/DynamoDB/MSK config; alerts on cardholder-data exposure |
| Observability | Datadog (or Dynatrace) | Distributed tracing of the scoring span, latency SLOs, business metrics | APM trace: ingest→feature→infer→decide; SLO monitors on p99; custom fraud-rate metrics |
| CI/CD + IaC | GitHub Actions + Terraform | Build/test/deploy app, infra, and model-promotion automation | OIDC to AWS (no static keys); Terraform for MSK/Flink/EKS/SageMaker; plan gated in PR |
A few choices deserve the why, because they are the ones teams get wrong.
Why a feature store, not features computed at scoring time. The intuitive design is to compute “transactions in the last hour” inside the decision service when a request arrives. Do not — that means a range query against your transaction store on the hot path (slow, and it scales with history), and it means your training code computes the same feature a different way (skew). The feature store inverts this: Flink computes the aggregate as events flow and writes the current value; the scoring path does a single O(1) key lookup. The online store (DynamoDB) serves it in single-digit milliseconds; the offline store (S3) holds the same values with point-in-time correctness for training. One definition, two surfaces, zero skew.
Why Flink for the feature plane specifically. You can window-aggregate in Kafka Streams or a hand-rolled consumer, but fraud features punish you on three axes Flink handles natively: event-time with watermarks (transactions arrive out of order across partitions and you must not double-count a late event into the wrong window), large keyed state (millions of active cards, each with rolling aggregates — RocksDB state backend keeps this off-heap and checkpoints it incrementally), and exactly-once sinks (a feature double-counted is a false decline). Flink’s windowed, stateful, exactly-once model is the right shape for “velocity per entity over sliding windows,” and it scales horizontally by adding task slots/parallelism against more Kafka partitions.
Why SageMaker real-time endpoints, not Lambda or in-process. The model must score in a few milliseconds, autoscale with traffic, return feature attributions for explainability, and be swappable without a code deploy. SageMaker real-time endpoints give a managed, autoscaling inference tier with production variants (run a champion and a challenger behind the same endpoint, split traffic for A/B or shadow testing) and multi-model endpoints (host the card-fraud, ATO, and mule models economically). In-process models drag model lifecycle into your app’s deploy cycle; Lambda’s cold starts and the lack of a warm GPU/optimized-CPU pool make tail latency unpredictable for a hard p99.
Why the decision is policy + model, not model alone. A pure-ML decision is a black box that a regulator and a fraud lead will both distrust, and it has no floor when the model is wrong. The decisioning layer wraps the score with a thin band of deterministic policy — hard declines for sanctioned BINs, mandatory step-up over a velocity ceiling, allow-list for known-good corporate cards — and a tunable threshold that the fraud team moves to trade false-positives against losses without retraining. It is also where the circuit breaker lives: if the SageMaker endpoint or feature store is slow or down, fall back to conservative rules rather than blowing the latency budget or failing open.
Implementation guidance
Provision with IaC, and treat the streaming backbone as the first deliverable. Use Terraform — the AWS provider covers MSK, Managed Flink, DynamoDB, EKS, and SageMaker. GitHub Actions runs terraform plan on every PR (gated, reviewed) and applies on merge via OIDC federation to AWS so there are no long-lived AWS keys in CI. The order matters: networking and MSK first (everything depends on the event log), then the feature plane (Flink + DynamoDB + S3), then serving (SageMaker), then the decision service (EKS), and finally the governance plane.
A minimal Terraform shape for the online feature table and a SageMaker endpoint communicates the intent:
resource "aws_dynamodb_table" "online_features" {
name = "fraud-online-features-prod"
billing_mode = "PAY_PER_REQUEST" # absorbs festival-evening spikes
hash_key = "entity_id" # e.g. "card#<hash>" / "acct#<hash>"
attribute { name = "entity_id" type = "S" }
ttl { attribute_name = "expires_at" enabled = true } # evict cold entities
point_in_time_recovery { enabled = true }
server_side_encryption { enabled = true } # KMS
}
resource "aws_sagemaker_endpoint_config" "fraud" {
name = "fraud-card-endpoint-config"
production_variants { # champion
variant_name = "champion"
model_name = aws_sagemaker_model.card_v7.name
initial_instance_count = 3
instance_type = "ml.c6i.xlarge"
initial_variant_weight = 0.9
}
production_variants { # challenger, shadow at 10%
variant_name = "challenger"
model_name = aws_sagemaker_model.card_v8.name
initial_instance_count = 1
instance_type = "ml.c6i.xlarge"
initial_variant_weight = 0.1
}
}
Feature definitions live in one place. Express each feature once (e.g., in the Flink job and a shared schema) so the online aggregate and the offline training column cannot diverge. A representative Flink windowed aggregate — card velocity over a sliding 1-hour window — sketches the pattern:
transactions
.keyBy(tx -> tx.cardId())
.window(SlidingEventTimeWindows.of(Time.hours(1), Time.minutes(1)))
.aggregate(new CardVelocityAggregate()) // count, sum, distinct-merchant
.map(f -> toFeatureRow(f))
.addSink(dynamoDbSink); // online store, written on change
Identity and secrets: no static keys anywhere. The EKS decision service authenticates to AWS with IRSA (IAM Roles for Service Accounts) — a scoped role granting exactly dynamodb:GetItem on the feature table and sagemaker:InvokeEndpoint on the fraud endpoint, nothing else. HashiCorp Vault issues short-TTL dynamic credentials for the Kafka SASL/SCRAM users and any database access, with EKS pods authenticating to Vault via their Kubernetes service-account JWT — so a leaked credential expires in minutes and there is nothing static to rotate. Human access (analysts into ServiceNow, engineers into the AWS console and Grafana/Datadog) federates through Okta or Microsoft Entra ID with MFA and step-up for privileged actions.
Wire the scoring path for the latency budget. Co-locate the decision service, DynamoDB (or its DAX/ElastiCache cache), and the SageMaker endpoint in the same region and ideally span the same AZs to kill cross-AZ hops. Read features in a single batched GetItem/BatchGetItem; keep the SageMaker payload compact; reuse HTTP connections (keep-alive pools) to the endpoint. Budget explicitly: ~2–5ms for the feature read, ~3–8ms for inference, a couple of ms of overhead — comfortably inside 100ms with margin for the network and the gateway. Instrument every hop as a span in Datadog (or Dynatrace) so you can see where a slow tail came from rather than guessing.
Enterprise considerations
Security and data protection. Cardholder data drags PCI-DSS scope across everything it touches, so minimize and protect it: tokenize the PAN at the edge so the platform scores on a token + features, never the raw number; encrypt every store with KMS (DynamoDB, S3, MSK, EBS); enforce TLS/mTLS on every hop; and keep the scoring fleet in private subnets reachable only through Akamai → ALB. CrowdStrike Falcon runs as a DaemonSet on the EKS nodes for runtime threat detection and to block lateral movement if a container is compromised — non-negotiable on a system this attractive to attackers. Wiz continuously scans the data path’s posture (DSPM): a public S3 bucket holding training data with PII, an over-broad IAM policy, an MSK cluster with plaintext listeners — Wiz surfaces these before an auditor or an attacker does. IAM is least-privilege per workload (IRSA), secrets are dynamic (Vault), and every admin action is behind SSO + MFA (Okta/Entra). Adversarial robustness is a first-class security concern here, not an afterthought: attackers actively probe the model (slow, low-amount card-testing tuned to stay under thresholds), so rate-limit per entity at the edge, monitor for probing patterns, and never expose raw scores or thresholds in any customer-facing response.
Cost optimization. Streaming-always-on plus ML inference is the cost center, so engineer it from day one. (1) Right-size SageMaker with Inference Recommender and autoscale variants on InvocationsPerInstance — pay for peak only at peak; multi-model endpoints host several models on shared instances. (2) DynamoDB on-demand absorbs spiky festival traffic without provisioning for peak 24/7; cache the hottest cards in DAX/ElastiCache so you are not paying read units (or latency) for the 5% of entities driving 50% of reads. (3) Tier compute on MSK and Flink — size Flink parallelism to steady TPS and let it scale, use MSK tiered storage so old log segments fall to cheap storage. (4) Don’t over-feature — every feature is Flink state to maintain and a byte to read on every score; prune features that don’t move the model. (5) Tag everything and feed Datadog/Cost Explorer per-stream so fraud-loss-avoided can be weighed against run cost — the only ROI conversation the CFO cares about.
Scalability. Each plane scales independently. MSK scales by adding partitions (and brokers) — partition count is your ceiling on Flink parallelism, so provision headroom. Flink scales by parallelism/task slots against those partitions; large keyed state lives in RocksDB and checkpoints incrementally to S3, so a rescale rehydrates from the last checkpoint rather than replaying history. DynamoDB scales effectively without limit on a good (high-cardinality entity) key; the cache layer protects hot partitions. SageMaker endpoints scale instances on invocation rate. The decision service on EKS scales pods on CPU/concurrency. The natural bottleneck is usually Flink state size and checkpoint duration — watch checkpoint times as your leading scaling indicator.
Reliability and failure modes (RTO/RPO). The unforgiving requirement is that a partial failure must not block authorizations. Design the circuit breaker explicitly: if the SageMaker endpoint times out or errors, the decision service falls back to deterministic policy rules (a conservative but available decision) and emits a degraded-mode metric — fail safe, not fail open or fail closed-on-everything. If the online feature store is unavailable, score on the current transaction with whatever cached/default features exist, again under stricter rules. Kafka with RF=3 across AZs and Flink’s exactly-once checkpointing means the feature plane survives a broker or task-manager loss and resumes from the last checkpoint with no double-counting (RPO ≈ checkpoint interval, tens of seconds; the scoring path itself is stateless and recovers in seconds). Run the scoring stack active-active across AZs; for region failure, a warm standby in a second region with MSK replication (MirrorMaker 2 / managed replication) and asynchronously-replicated features gives a low-minutes RTO. A pragmatic target: RTO 5 minutes, RPO under 1 minute for the scoring service, with the feature plane rebuildable by replaying the retained Kafka log if state is ever lost.
Model monitoring and governance — the part that keeps you compliant. A fraud model decays faster than almost any other because the adversary adapts to it; a model that was excellent in March is dangerous by July. SageMaker Model Monitor baselines the training distribution and watches live data for data drift (the incoming feature distribution shifting), score drift (the model’s output distribution moving), and model quality once labels arrive. SageMaker Clarify monitors feature attribution drift (the model leaning on different features over time — an early warning the moment it appears) and bias (is the decline rate diverging across protected attributes — a regulatory landmine). Violations raise CloudWatch alarms that trigger a SageMaker Pipelines retrain → evaluate → register flow, with a Model Registry approval gate and a shadow/canary rollout (the challenger variant above) so a new model proves itself on live traffic before taking real decisions. Explainability is mandatory, not optional: log the per-decision feature attributions so that when a regulator (or a wronged customer) asks “why was this declined,” you have a concrete, contemporaneous answer — this is the streaming-fraud analog of the citations that made the RAG platform auditable. Pin model versions in the registry; never let a floating alias change behavior under you.
Case management and the human loop. Most flagged events are auto-actioned, but the high-stakes ones — a high-confidence decline on a high-value customer, a suspected mule ring, a step-up that failed — open a ServiceNow fraud case automatically with the transaction, the score, the feature attributions, and the recommended action attached. The fraud-ops analyst (logged in via Okta SSO) investigates, dispositions it (confirmed fraud / false positive / inconclusive), and that disposition syncs back into the labels Kafka topic — closing the loop so today’s human judgment becomes tomorrow’s training label. ServiceNow also tracks SLA on investigations and gives the audit trail examiners ask for. This bi-directional sync is what turns a one-way scoring system into a system that learns from its own operations.
Observability. Instrument the scoring span end to end in Datadog (or Dynatrace) APM: one trace covers ingest → feature read → inference → decision, with timing on each hop, so a p99 regression is attributable, not mysterious. Emit the metrics the business actually runs on — p99/p999 scoring latency (against the SLO), decline rate and step-up rate, model score distribution (your earliest drift signal), false-positive rate and fraud-loss-avoided as labels arrive, degraded-mode invocation count (how often the circuit breaker fired), and feature freshness lag (how stale the online store is behind the stream). SLO monitors page on latency-budget burn before customers feel it.
Reference enterprise example
Cresta Pay, the fictional digital bank above (~28M cards, ~4,000 auth/s peak), built this platform after a card-testing campaign drained a BIN range in under two minutes while their nightly batch model slept.
Decisions they made. They partitioned MSK by card hash across 48 partitions / 3 brokers (RF=3) and ran the Flink feature job at parallelism 48 with a RocksDB state backend checkpointing to S3 every 30 seconds — maintaining 1m/1h/24h velocity aggregates, distinct-device and distinct-merchant counts, and time-since-last-transaction per card. Features landed in a DynamoDB on-demand table with the hottest ~3% of cards fronted by ElastiCache (Redis). Scoring ran an XGBoost card-fraud model and a separate account-takeover model on a SageMaker multi-model endpoint (ml.c6i.xlarge, autoscaled 3→12 variants), with a challenger shadowed at 10% via production variants. The decision service on EKS (IRSA, no static keys) wrapped the score with ~12 hard policy rules, a fraud-team-tunable threshold, and a circuit breaker falling back to rules if inference exceeded its budget. Vault issued dynamic Kafka/DB creds; CrowdStrike Falcon ran on every node; Wiz watched the data path; Akamai fronted the public surface with Bot Manager tuned for card-testing patterns. Model Monitor + Clarify ran hourly; high-risk events opened ServiceNow cases that synced dispositions back to a labels topic; Datadog traced the scoring span with a p99 SLO.
The numbers. Median scoring latency ~9ms, p99 ~41ms — comfortably inside the auth window. ~4,000 TPS peak, ~120M decisions/day. Monthly run cost landed near ₹46 lakh (~$55,000): SageMaker inference ~$18,000, MSK + Managed Flink ~$11,000, DynamoDB + ElastiCache ~$7,500, EKS + networking ~$6,000, monitoring/Wiz/CrowdStrike/Akamai allocation the remainder. The challenger model, shadowed for three weeks before promotion via the Model Registry gate, cut the false-positive rate ~17% at constant catch-rate — worth several times the platform’s entire run cost in avoided customer friction and chargebacks.
The outcome. The next card-testing burst was throttled at the Akamai edge and the surviving probes were declined by velocity features within the first dozen transactions, not after two minutes. Confirmed-fraud basis points dropped while the false-positive rate fell (the contextual model declined fewer good customers than the old rules) — and because every decline carried logged feature attributions, the bank’s compliance team and the RBI examiners got the explainability the law requires. A region-failover game day held the 5-minute RTO: MSK replication and the warm secondary feature store let the standby region take scoring while the primary recovered, and Flink rehydrated state from its last checkpoint with no double-counted features.
When to use it
Use this architecture when decisions must be made synchronously inside a tight latency budget, the signal that matters is behavioral and time-sensitive (velocity, context, sequence), the adversary adapts so the model must be monitored and retrained continuously, and you must explain and audit automated decisions. That covers payments/card fraud, account-takeover and login-risk scoring, real-time AML transaction monitoring, ad-click fraud, and abuse/bot detection on high-traffic platforms.
Trade-offs to accept. This is a genuinely complex platform — a streaming backbone, a feature plane with large stateful jobs, a serving tier, and a governance loop, each needing operational expertise. Train/serve skew and feature freshness become things you must engineer and monitor, not assume. The model is only as good as its features and labels, and labels arrive late (chargebacks take days), so your “model quality” signal always lags reality — which is exactly why drift and score-distribution monitoring (leading indicators) matter more than waiting for ground truth.
Anti-patterns. (1) Computing features at scoring time — slow on the hot path and a skew bug waiting to happen; precompute in the stream. (2) Pure-ML decisions with no policy floor or circuit breaker — unexplainable and with no safe fallback when inference is down. (3) Failing open under load — a fraud system that approves-by-default when overloaded is worse than no system; fail to conservative rules. (4) No drift/bias monitoring — your model silently rots and may discriminate; you find out from a regulator. (5) Letting a model auto-promote without a gate and shadow period — a bad model takes real money decisions before anyone notices.
Alternatives, and when they win. If your volume is low and latency relaxed, a batch or micro-batch scoring job is far simpler and may suffice for slow-moving fraud (e.g., periodic AML reviews). If you need to catch ring/network fraud that no per-transaction view reveals, add a graph layer (Amazon Neptune or a graph-feature pipeline) as a complement, feeding graph signals in as features rather than replacing this. If you are a small team that wants a managed shortcut, Amazon Fraud Detector stands up a hosted fraud model without the streaming plumbing — graduate to this full streaming architecture when latency, scale, custom features, or governance control demand it. And if your decisions are not latency-bound at all, the whole synchronous scoring plane is over-engineering — the streaming feature plane plus batch scoring is enough. The architecture here is the destination for real-time, adversarial, regulated fraud — not always the starting line.