A payments processor moves money on the strength of a single field. Somewhere upstream, a service emits a transaction event with an amount expressed in minor units — cents, paise, pence — as an integer. Forty downstream consumers depend on that contract: the ledger, the fraud scorer, three regulatory reporting jobs, a real-time settlement engine, and a warehouse that the finance team queries to close the books. One Tuesday, a well-meaning engineer on the producer team “cleans up” the schema and changes amount from an integer of minor units to a decimal of major units, because that reads more naturally. The change passes their unit tests. It deploys at 14:00. By 14:06 the fraud scorer is flagging every transaction as anomalous, the settlement engine is off by a factor of one hundred, and a regulatory report that the firm is legally obligated to file is now silently wrong. Nobody finds out until reconciliation fails at midnight. This is the most expensive and most common class of data outage in the enterprise, and it is entirely preventable. The fix is not heroics or more monitoring — it is treating the shape of data as a versioned, owned, machine-enforced interface. That is what data contracts and a schema registry give you, and this article is a reference architecture for building them so the Tuesday above becomes a failed CI check instead of a midnight page.
The business scenario
The recurring driver is producer/consumer coupling that nobody designed on purpose. A logistics company is a good archetype: hundreds of warehouse, telematics, customs, and last-mile services, most owned by different squads, most emitting events onto a shared backbone — Kafka, Amazon Kinesis, Azure Event Hubs, Google Pub/Sub, it does not matter which. Each event is consumed by jobs the producing team has never heard of. A field rename in a “scan” event silently breaks the ETA model; a producer that starts sending null for a field that used to be guaranteed crashes a Spark job at 03:00; a team adds a new required field and every consumer that does not yet know about it fails validation. The shared bus that made the company fast also made every team an unwitting upstream dependency of every other.
The naive fixes all fail in the same direction. Wiki documentation of “what’s in the topic” is stale the day it is written and read by no one. Defensive consumers — every downstream job coding around every field it might receive — turns the whole estate into a swamp of if field exists and is not null and is the right type guards that still miss the case nobody anticipated. A central data team that reviews every change becomes a bottleneck that teams route around, and the review happens in a meeting, not in code, so it catches nothing mechanical. The common thread: the agreement between producer and consumer is implicit, undocumented, and unenforced, so it drifts until it breaks.
A data contract makes that agreement explicit. It is a versioned artifact — owned by the producing team, living in source control next to their code — that declares the schema of what they emit (fields, types, nullability), plus the semantics (units, enumerations, meaning), the guarantees (freshness, ordering, an SLA), and the rules for how it may change. A schema registry is the runtime system of record for those schemas: producers and consumers fetch the authoritative schema from it, serializers validate against it, and crucially it can refuse to register a new version that would break existing consumers. Put the contract under CI and the registry’s compatibility check becomes a gate, and the Tuesday outage turns into a red build with a message like Schema being registered is incompatible with an earlier schema: deleting field 'amount' is a breaking change. The producer learns at commit time, not from a furious downstream team at midnight.
This scales in both directions. A small data platform might have one Kafka cluster, a single registry, and twenty schemas; the discipline still pays for itself the first time it blocks a bad change. A large enterprise runs thousands of schemas across multiple buses and clouds, with a governance layer, automated lineage, and contracts that gate not just streaming topics but warehouse tables and API payloads too. The architecture below is the same at both ends — the difference is how much governance, automation, and federation you wrap around the core registry, not the core itself.
Architecture overview
The design has two planes that people constantly conflate. The control plane is where contracts are authored, reviewed, registered, and governed — it runs at the speed of CI, in pull requests and pipelines. The data plane is where events actually flow at runtime, where serializers consult the registry and validation either passes data through or quarantines it. Keeping them separate is the first step to operating this well: a contract change is a code event; a contract violation is a runtime event, and they are handled by different machinery.
Control plane, the path a schema change takes. (1) A producer team edits a contract — an Avro .avsc, a Protobuf .proto, or a JSON Schema, plus a YAML metadata sidecar describing owner, SLA, and semantics — in their repository. (2) On the pull request, GitHub Actions (or Jenkins for teams standardized on it) runs the contract pipeline: it lints the schema, runs a compatibility check against the schema registry for the target subject, and fails the build if the change would break existing consumers under the configured compatibility mode. (3) For a backward-compatible change the merge proceeds and the pipeline registers the new version in the registry. For a breaking change the pipeline blocks the merge and opens — or attaches to — a ServiceNow change request, because a breaking change is a coordinated event that needs consumer sign-off and a migration plan, not a silent override. (4) Terraform manages the registry infrastructure itself, the topics, the subject-level compatibility settings, and the IAM, so the governance configuration is itself code-reviewed and reproducible. Identity for both the registry UI and the CI service principals flows through Okta or Entra ID via OIDC/SAML, so who can register or evolve a schema is centrally controlled and audited.
Data plane, the path an event takes. (5) A producer serializes an event: the serializer fetches the schema’s ID from the schema registry, validates the payload against it, and writes the event to the bus prefixed with that schema ID (the Confluent wire format is a magic byte plus a 4-byte schema ID, so the payload itself carries no schema, only a pointer). (6) The event lands on the streaming backbone — a Kafka topic, Kinesis stream, Event Hub, or Pub/Sub topic. (7) A consumer deserializes by reading the schema ID off the message, fetching that exact schema version from the registry (cached locally), and decoding — guaranteeing it reads the data with the schema it was written with, not whatever the consumer happens to think the schema is. (8) Events that fail validation or cannot be deserialized are routed to a dead-letter queue (DLQ) rather than dropped or crashing the consumer, and a quarantine alert fires. (9) The whole flow is instrumented into Datadog or Dynatrace — validation failure rate, DLQ depth, schema-version skew across consumers, and registry latency are first-class metrics, because a rising validation-failure rate is the earliest possible signal of a contract problem in production.
The defining property of the topology: no producer can put a shape on the bus that the registry has not blessed, and no breaking shape can be blessed without passing CI and a change process. The registry sits in the write path as the gatekeeper; CI sits in the merge path as the gate.
Component breakdown
| Component | Representative tooling | Role in the platform | Key configuration choices |
|---|---|---|---|
| Schema definition | Avro / Protobuf / JSON Schema | The machine-readable shape of an event or record | Avro for streaming (compact, rich evolution rules); Protobuf for gRPC-adjacent estates; JSON Schema for HTTP payloads |
| Contract metadata | YAML sidecar (e.g. an Open Data Contract spec) | Owner, SLA, semantics, units, classification, change policy | One contract file per data product; versioned in the producer’s repo |
| Schema registry | Confluent Schema Registry, AWS Glue Schema Registry, Apicurio, Azure Schema Registry | Runtime source of truth; serves schemas by ID; enforces compatibility | Per-subject compatibility mode; subject naming strategy; immutable versions |
| Streaming backbone | Kafka / MSK, Kinesis, Event Hubs, Pub/Sub | Transport for validated events | Topic-per-contract or topic-per-domain; retention sized to replay needs |
| CI enforcement | GitHub Actions, Jenkins | Lints, runs compatibility check, registers schemas, gates merges | Registry compat check as a required status check; register only on merge to main |
| IaC | Terraform | Registry config, topics, compat modes, IAM as code | confluent / aws providers; compatibility level set declaratively per subject |
| Identity / SSO | Okta, Entra ID | AuthN/Z for registry, UI, and CI principals | OIDC for CI; SAML for humans; RBAC scoping who can evolve which subject |
| Secrets | HashiCorp Vault | Registry/bus credentials, signing keys, mTLS certs | Dynamic short-lived credentials; no static keys in CI or apps |
| Approvals / ITSM | ServiceNow | Change request for breaking changes, consumer sign-off, audit trail | Auto-raised by CI on a detected breaking change; gates the migration |
| Observability | Datadog, Dynatrace | Validation failures, DLQ depth, version skew, registry SLOs | Alert on validation-failure-rate and DLQ growth; trace registry calls |
| Data posture | Wiz | CSPM/DSPM: finds PII in topics, mis-scoped registry/bus IAM | Scans for sensitive data classes; flags public or over-permissioned subjects |
| Runtime security | CrowdStrike Falcon | Threat detection on the broker, registry, and connector hosts | Workload sensors on the data-plane nodes; behavioral detection |
| Edge / API contracts | Akamai | Validates and shapes inbound API payloads at the edge against the same JSON Schema | Schema validation at the CDN tier; rejects malformed payloads before origin |
A few choices deserve the why, because they are the ones teams get wrong.
Why compatibility modes are the whole game. A schema registry’s superpower is that it can compare a proposed schema against the history of a subject and answer “is this safe?” — but only relative to a compatibility mode you choose deliberately. The modes are not academic; they encode who you are protecting:
| Mode | What it allows | Protects | Use when |
|---|---|---|---|
BACKWARD (default) |
Add optional fields; delete fields | Consumers can read new data with old schema | Consumers upgrade after producers (the common case) |
FORWARD |
Add fields; delete optional fields | Producers can write new data read by old consumers | Producers upgrade first, consumers lag |
FULL |
Add/remove only optional fields | Both directions | High-stakes contracts where either side may lead |
*_TRANSITIVE |
Same, but against all prior versions, not just the last | Long replay windows | You replay history and need every old version readable |
NONE |
Anything | Nothing | Almost never — disables the gate |
The mistake is leaving it on the default BACKWARD for a topic you replay from the beginning of time, then discovering an old consumer reprocessing history cannot read a version from six months ago. For event-sourced or audited topics, use the *_TRANSITIVE variant so compatibility holds across the entire history, not just the most recent version.
Why the contract is more than the schema. A schema says amount is an int. It does not say the int is minor currency units, that the currency is in a sibling currency field as ISO-4217, that the value is always non-negative, or that the producer guarantees the event within five seconds of the transaction. Those semantics are exactly what broke the payments processor — the type was fine, the meaning changed. The YAML metadata sidecar (an Open Data Contract Standard document, or a homegrown equivalent) captures units, enumerations, ranges, freshness SLA, ownership, and data classification. The registry enforces the schema; the contract enforces the meaning, and your CI can check semantic rules (e.g. “amount must declare units”) that a bare schema cannot express.
Why a dead-letter queue, not a strict reject. When a malformed or genuinely incompatible event reaches a consumer at runtime — say from a producer that bypassed the registry, or a poison message — you have two bad options if you have not planned: crash the consumer (and stall the whole partition behind the poison message), or silently drop the event (and lose data you may be legally required to keep). The DLQ is the third way: the consumer routes what it cannot process to a side stream, commits the offset, and keeps moving, while an alert and the quarantined payloads let a human triage. For a regulated firm the DLQ is also the audit of every rejection — you can prove what you refused and why.
Implementation guidance
Provision the registry and its governance with IaC. The registry’s compatibility configuration is the most important security control in this whole system, so it must not be a setting someone clicked in a UI. With Terraform the confluent provider (Confluent Cloud) or the aws provider (Glue) declares subjects and their compatibility levels alongside the topics:
resource "confluent_schema" "transaction_value" {
subject_name = "payments.transaction-value"
format = "AVRO"
schema = file("${path.module}/schemas/transaction.avsc")
schema_registry_cluster { id = confluent_schema_registry_cluster.main.id }
}
resource "confluent_subject_config" "transaction_value" {
subject_name = confluent_schema.transaction_value.subject_name
compatibility_level = "FULL_TRANSITIVE" # money: protect both sides, all history
schema_registry_cluster { id = confluent_schema_registry_cluster.main.id }
}
Pin high-stakes subjects to FULL_TRANSITIVE, default the rest to BACKWARD, and never leave anything on NONE. Because the compat mode is now code, lowering it (e.g. to sneak a breaking change through) is itself a reviewable, auditable pull request.
Put the compatibility check in CI as a required status. This is the load-bearing automation. In a GitHub Actions workflow the producer’s pipeline runs a registry-side compatibility test before registering anything, and the job is a required status check on the protected branch so a red result cannot be merged:
# .github/workflows/data-contract.yml
jobs:
contract:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate compatibility (no register)
run: |
curl -sf -u "$SR_USER:$SR_PASS" \
-X POST "$SR_URL/compatibility/subjects/$SUBJECT/versions/latest?verbose=true" \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
--data @<(jq -n --arg s "$(cat schemas/transaction.avsc)" '{schema:$s}') \
| jq -e '.is_compatible == true' # fail the build if false
env:
SR_URL: ${{ vars.SCHEMA_REGISTRY_URL }}
SR_USER: ${{ secrets.SR_USER }} # injected from Vault, short-lived
SR_PASS: ${{ secrets.SR_PASS }}
The maven/gradle Confluent plugins (test-compatibility, register) do the same thing more ergonomically for JVM teams; the raw API call shows the mechanism. Registration runs only on merge to main, so the registry’s history tracks exactly what is in production.
Wire identity and secrets properly. The CI principal authenticates to the registry with an OIDC token federated from Okta or Entra ID, scoped by RBAC so a team can only evolve subjects it owns — a payments engineer cannot register a schema under telematics.*. Registry and bus credentials, broker mTLS certificates, and any signing keys are issued by HashiCorp Vault as short-lived dynamic secrets, so there are no static API keys sitting in GitHub Actions secrets or app config to leak or rotate. The serializer and the CI job both fetch a fresh credential per run.
Generate code from contracts, do not hand-write it. Once a schema is the source of truth, generate the producer’s and consumer’s data classes from it (Avro/Protobuf compilers in the build) so the application objects cannot drift from the registered schema — if the schema changes, the generated types change, and the compiler catches the mismatch. This closes the last gap where a developer “knows” the schema in their head and gets it subtly wrong.
Enterprise considerations
Failure modes. Enumerate them and design for each. (1) A producer bypasses the registry and writes raw bytes — caught at the consumer as a deserialization failure routed to the DLQ, and prevented going forward by requiring the registry-aware serializer and locking the topic ACLs so only registered producers can write. (2) The registry is unavailable — producers and consumers must cache schemas locally so a registry outage degrades to “cannot register new schemas” rather than “the whole pipeline stops”; run the registry HA (multiple nodes, replicated _schemas topic) and treat its availability as a tier-1 SLO. (3) A breaking change slips through because someone set the subject to NONE — prevented by IaC-managing compat modes and alerting on any subject not in an approved mode (Wiz or a custom policy check can flag this). (4) Poison messages stall a partition — the DLQ pattern plus a bounded retry keeps one bad event from blocking everything behind it. (5) Schema-version skew — consumers running far behind producers; surfaced as a Datadog metric so you see drift before it becomes an incident.
Security. A schema registry is a high-value target: it describes the shape of every sensitive data flow in the company, and write access to it is write access to what every consumer trusts. Lock it down: authentication via Okta/Entra (no anonymous access), RBAC so subject ownership maps to teams, mTLS on the registry and bus with certs from Vault, and the registry’s backing _schemas topic protected like production data. Run Wiz in DSPM mode across the topics and the registry to detect PII landing in a stream that the contract did not classify as sensitive, and to flag a registry or bus IAM policy that is over-permissioned or exposed. CrowdStrike Falcon sensors on the broker, registry, and Kafka Connect hosts give runtime threat detection on the data-plane nodes themselves — a registry host is infrastructure, and infrastructure gets attacked. For API-sourced data, Akamai validates inbound payloads against the same JSON Schema at the edge, rejecting malformed requests before they reach the origin, so the contract is enforced at the very first hop.
Cost. The registry itself is cheap — it stores small text artifacts and serves cached lookups — but the surrounding choices have real cost levers. (1) Schema ID caching on producers and consumers is non-negotiable: without it every message does a registry round-trip, adding latency and load; with it the registry handles a trickle of cache-miss lookups. (2) Avro/Protobuf over JSON on the wire is a direct cost saving on a high-volume bus — binary encoding plus a 5-byte schema pointer is dramatically smaller than repeating field names in JSON on every message, cutting both broker storage and inter-AZ/egress bandwidth, which on a busy Kafka estate is a meaningful line item. (3) Retention sized to actual replay needs — *_TRANSITIVE compatibility lets you replay long histories, but topic retention is what you pay for; do not keep 90 days because it felt safe if you only ever replay 7. (4) The CI compatibility checks are seconds of compute per PR — trivially cheap insurance against an outage whose cost is measured in failed settlements and regulatory exposure.
Scaling and federation. A single registry scales to thousands of subjects, but large enterprises hit organizational, not technical, limits: one central team cannot own every contract. The pattern that works is federated governance — central platform owns the registry, the compatibility policy, the CI templates, and the security baseline; each domain team owns its own contracts within that frame, registering and evolving subjects under its namespace. This is data-mesh thinking applied to schemas: contracts are the interface of each domain’s data products, the registry is the shared catalog, and the global policy (compat modes, naming, classification) is the thin waist everyone agrees on. For genuinely multi-cluster or multi-cloud estates, registries can be linked or schemas mirrored so a consumer in one region/cloud can resolve a schema produced in another.
Observability. Instrument the contract system as a first-class service. The metrics that matter: validation-failure rate (the leading indicator of a contract problem in production), DLQ depth and growth (quarantined data needing triage), schema-version skew (how far consumers lag producers), registry request latency and error rate (a tier-1 dependency in the write path), and compatibility-check pass/fail trends in CI (how often producers attempt breaking changes — a culture signal). Pipe these to Datadog or Dynatrace with alerts on validation-failure-rate spikes and DLQ growth, and trace registry calls so a latency regression there is attributable. The goal: the first person to know about a contract problem is an on-call dashboard, not a downstream team.
Governance and lineage. The contracts are also documentation that cannot go stale, because they are the runtime artifact. Feed the registry and contract metadata into a catalog so analysts can discover what exists and what each field means (units, enums, classification), and into lineage so you can answer “if I change this field, who breaks?” before you change it — turning the impact analysis from a guess into a query. Classify every contract (public / internal / PII / regulated) in the metadata so DSPM tooling and access policy can act on it. And keep the human process honest: a breaking change auto-raises a ServiceNow change request that captures the consumer sign-off and the migration plan, so the coordinated changes you cannot avoid are at least audited and orchestrated rather than sprung on people.
Reference enterprise example
Solstice Freight, a fictional global logistics operator (~6,000 employees, ~140 microservices across warehouse, telematics, customs, and last-mile domains), adopted this architecture after a year in which schema-drift incidents were their single largest source of data downtime — eleven Sev-2s, most of them a producer changing a field a downstream team depended on, found hours later in a broken dashboard or model.
Decisions they made. They standardized on Avro on a self-managed Kafka (MSK) backbone with Confluent Schema Registry, chosen for its mature compatibility engine. They defaulted every subject to BACKWARD and pinned the money- and customs-related subjects (where a regulatory report depends on the data) to FULL_TRANSITIVE, because customs declarations get replayed during audits and must be readable years later. The registry config, all topics, and the per-subject compat modes were managed in Terraform; humans authenticated through Okta, CI through OIDC scoped so each domain squad could only register under its own namespace. Each producer carried a contract: the Avro schema plus an Open Data Contract YAML declaring owner, freshness SLA, units, and a PII classification. GitHub Actions ran the registry compatibility check as a required status on every PR; a detected breaking change blocked the merge and auto-raised a ServiceNow change request that pulled in the listed consumers for sign-off. Validation failures and DLQ depth went to Datadog; Wiz scanned topics for unclassified PII and flagged two streams leaking driver phone numbers that no contract had marked sensitive; HashiCorp Vault issued all registry and broker credentials as short-lived secrets, retiring a set of static keys that had been sitting in CI.
The numbers. ~2,400 active subjects across four domains, ~3.1 billion events/day on the bus. After rollout, schema-drift Sev-2s went from eleven the prior year to zero — the breaking changes still happened (the CI gate caught and blocked 63 attempted breaking changes in the first six months), but they happened in pull requests instead of production. The DLQ caught ~40,000 genuinely malformed events/month, almost all from one legacy producer that predated the registry, which became a tracked migration. Registry run cost was negligible (~$1,400/month for the HA Confluent tier); the bus saw a ~22% storage and egress reduction purely from moving telematics topics off verbose JSON onto Avro, which more than paid for the registry. Wiring contracts and CI into the 140 services was the real cost — about a quarter of platform-engineering time for one quarter — but the firm’s own estimate put a single one of the previous year’s drift outages (a misreported customs feed) at more than the entire program.
The outcome. The line that landed with leadership was not the cost saving — it was that a breaking change became a failed build with a clear error message owned by the engineer who wrote it, at the moment they wrote it, instead of a midnight page owned by whichever downstream team noticed first. Data-product ownership got real: each squad now owns its contracts as the public interface of its data, the platform team owns the thin waist of policy, and “who breaks if I change this?” became a lineage query answered before the change, not an incident postmortem after it.
When to use it
Use this architecture when multiple independently-owned teams produce and consume data over a shared backbone; when a wrong-shaped record has real downstream cost (financial, regulatory, model-poisoning); when you replay history and need old data to stay readable; or when you simply have more than a handful of producers and the implicit-contract swamp has started to slow everyone down. That covers most event-driven enterprises — payments, logistics, ad-tech, IoT/telematics, and any data-mesh program where domains expose data products to each other.
Trade-offs to accept. Contracts add friction on purpose: a producer can no longer change a field on a whim, and a genuinely breaking change requires a coordinated migration through a change process. That is the point — but it is real work, and teams that have lived in a free-for-all will feel it. There is upfront cost to author contracts for an existing estate and retrofit serializers, and an ongoing discipline cost to keep contracts honest. The registry becomes a tier-1 dependency you must run HA and cache around. And contracts enforce structure and declared semantics, not correctness: the registry will happily let a producer send a perfectly-typed, perfectly-compatible amount of -1 if your contract did not declare the range — semantic and quality checks (great-expectations-style assertions, range/enum validation in the contract) are a complementary layer, not something the registry gives you for free.
Anti-patterns. (1) Compatibility mode NONE — disables the entire gate; if you see it on a subject, treat it as an incident. (2) Filtering or fixing bad shapes only in consumer code — turns every consumer into a swamp of defensive guards and still misses the case nobody anticipated; enforce at the registry instead. (3) No schema ID caching — a registry round-trip per message adds latency and makes the registry a throughput bottleneck and a single point of failure. (4) Hand-written data classes that “match” the schema — they drift; generate them. (5) A central team as the only approver — becomes a bottleneck teams route around; federate ownership, centralize policy. (6) Schema-only contracts with no semantics — the units/meaning changes that cause the worst outages slip right through a type check; capture meaning in the contract metadata and check it in CI.
Alternatives, and when they win. If you have exactly one producer and one consumer in one repo, a shared library and a code review are enough — skip the registry. If your data is purely batch into a warehouse, table contracts enforced by your transformation tool (dbt model contracts, or constraints in the warehouse) cover the same goal for that surface and may be all you need; the streaming registry and the warehouse contract layer are complementary and most large shops run both. If you want the strongest guarantee and your whole estate is gRPC, Protobuf with Buf’s breaking-change detection (buf breaking) gives an excellent CI-first contract experience tightly fit to that world. And if you are early — a small team, a single bus, twenty schemas — adopt the registry and the CI compatibility check first and add the governance, lineage, and federated-ownership machinery later; the compatibility gate alone prevents most of the pain, and the rest is what you grow into as the number of teams, not the number of messages, climbs.