Data Multi-cloud

Data Quality and Observability Architecture

The most expensive data incident is the one nobody notices for three weeks. A pipeline does not crash; it succeeds — and quietly emits wrong numbers. A currency-conversion job starts treating a null FX rate as 1.0. A schema change upstream renames gross_amount to amount_gross, the column maps to null, and every downstream sum silently halves. The dashboards stay green, the jobs stay green, and the first person to find out is a regulator, an auditor, or a customer. This article is a reference architecture for the discipline that prevents that: data quality (catching bad data at the gate) and data observability (knowing the health of every table and pipeline in production, the way SRE teams know the health of services). It is not a one-off validation script bolted onto a DAG — it is a governed, multi-cloud control plane that holds up from a single warehouse to a thousand-table lakehouse.

The business scenario

The recurring driver is regulated numbers that have to be right the first time. Consider Northwind Capital, a fictional mid-tier asset manager (~2,400 employees, ~$140B AUM) running a lakehouse on Databricks over AWS S3, feeding a regulatory reporting stack and a client-facing performance portal. Their data crosses three jurisdictions, so a single bad figure is not a bug — it is a MiFID II / SEC reporting breach with a fine attached, a “Dear CEO” letter, and a remediation program that costs more than the entire data platform.

Their pain is the classic one: pipelines that are operationally healthy but semantically broken. Last quarter a vendor changed a holdings feed’s date format from YYYY-MM-DD to MM/DD/YYYY without notice. The ingestion job did not fail — it parsed 03/04/2026 as March 4th in one table and April 3rd in another, and the NAV (net asset value) reconciliation was off by a day’s market movement across 1,800 funds. It surfaced when a portfolio manager noticed a number “felt wrong.” That manual catch was luck. The board’s question afterward was blunt: how many of these have we shipped that nobody noticed?

The naive fixes all fail predictably. Hand-written assertions in each job rot — they cover the bugs you already had, never the next one, and nobody updates them. Trusting upstream contracts fails because upstream is a third-party vendor who changes feeds without telling you. A nightly reconciliation report catches yesterday’s disaster today, after the bad data has already fanned out to 40 downstream tables and three dashboards the CEO reads. And “the data team will just know” does not survive past a few dozen tables; at a thousand, no human holds the mental model of what “normal” looks like for each one.

Data quality and observability thread this needle by splitting the problem in two. Quality is a gate: explicit, versioned expectations (column X is never null, sum(allocations) = 1.0, this enum has these six values) that a pipeline must pass before data is allowed to land in a trusted zone. Observability is a monitor: machine-learned baselines for freshness, volume, schema, and distribution across every table, so the unknown unknowns — the failures you did not write an assertion for — get caught by anomaly detection instead of by a portfolio manager’s gut. One is deterministic and you author it; the other is statistical and it authors itself. You need both.

Architecture overview

Data Quality and Observability Architecture — architecture

The design has two planes that share the same data but run on different triggers: an inline quality plane (synchronous, runs inside the pipeline and can block a bad batch from promoting) and an out-of-band observability plane (asynchronous, watches tables in production and detects anomalies after the fact). Keeping them mentally separate is the first step to operating this well — one stops bad data at the door, the other tells you when something slipped past or broke at the source.

Inline quality plane, following the data left to right: (1) raw data lands in a bronze / landing zone on S3 (or ADLS Gen2, or GCS — the pattern is cloud-agnostic) from vendor SFTP, Akamai-fronted partner APIs, and CDC streams. (2) A Databricks / Spark job, orchestrated by Airflow (or Dagster), reads the batch and runs a Great Expectations validation suite against it — a versioned set of expectations stored in Git. (3) On pass, the batch is promoted to the silver (trusted) zone; on fail, the job quarantines the batch to a side location, halts promotion, and raises an incident. (4) Trusted data flows to gold / serving tables (Delta, plus a Snowflake or BigQuery serving layer) that feed BI and the client portal. Crucially, the quality gate runs before promotion, so a failed batch never reaches a consumer.

Observability plane runs continuously and independently: (5) Monte Carlo connects read-only to the warehouse/lakehouse metadata and query history and learns a baseline for every table — when it normally updates (freshness), how many rows it normally gains (volume), what its schema is, and how each column’s distribution normally looks. (6) When a table breaks an SLA or a metric drifts beyond its learned threshold — a feed is six hours late, a table that gains ~2M rows/day gains 200, a column’s null rate jumps from 0.1% to 14% — it fires an anomaly. (7) Both planes converge on a single alert + incident router: a webhook into ServiceNow that opens an incident, sets priority by the blast radius (how many downstream assets are affected, which Monte Carlo’s lineage graph computes), and pages the owning team via the on-call schedule. (8) Everything emits to Datadog / Dynatrace for the operational view — pipeline latency, job success, gate pass-rates — and dashboards the platform team watches alongside service health.

The defining property of the whole topology: two independent detection mechanisms, deterministic and statistical, feeding one incident pipeline. Great Expectations catches what you can specify; Monte Carlo catches what you cannot. Neither is sufficient alone, and routing both through ServiceNow means a data incident is triaged with the same rigor as a production outage — owned, prioritized, SLA-tracked, and post-mortemed.

Component breakdown

Component Representative tool Role in the platform Key configuration choices
Orchestration Airflow (MWAA / Composer) or Dagster Schedules pipelines, runs the quality gate as a task, branches on pass/fail Gate is a blocking task; failure short-circuits promotion, not the whole DAG
Quality engine Great Expectations Deterministic validation: nullness, ranges, set membership, row counts, custom SQL Suites versioned in Git; Checkpoints run in-job; Data Docs published per run
Observability platform Monte Carlo ML baselines for freshness/volume/schema/distribution; lineage; anomaly detection Read-only metadata + sample access; monitors auto-applied by importance; field-level lineage on
Compute / processing Databricks (Spark) Reads batches, applies GE checkpoints, writes Delta with promotion logic Quarantine path on validation failure; idempotent, partition-scoped writes
Storage zones S3 / ADLS / GCS + Delta Lake Bronze (raw) → silver (validated) → gold (serving) medallion layout Promotion only on green gate; quarantine bucket with lifecycle expiry
Serving Snowflake / BigQuery + Delta Trusted tables for BI, regulatory reports, client portal Monte Carlo monitors these as the “customer-facing” tier; tightest SLAs
Incident routing ServiceNow (ITSM) Opens/prioritizes/assigns incidents; approvals for overrides; audit trail Webhook from both planes; priority = blast radius; CMDB-linked data assets
Operational observability Datadog / Dynatrace Pipeline latency, gate pass-rate, job health next to service health Custom metrics from GE/MC; SLO dashboards; on-call alerting for the platform itself
Identity / SSO Okta or Entra ID SSO + SCIM into Monte Carlo, the warehouse, ServiceNow; group-based access SAML/OIDC; engineers see only their domain’s monitors; least-privilege scopes
Secrets HashiCorp Vault Warehouse creds, vendor SFTP keys, ServiceNow API tokens — dynamic, rotated Dynamic DB credentials; short TTL leases; no static secrets in Airflow vars
Data posture (CSPM) Wiz Finds PII in the lake, public buckets, over-broad grants on data stores Data Security Posture Management scan of S3/warehouse; alerts on exposure
CI / IaC GitHub Actions / Jenkins + Terraform Validates expectation suites in CI; provisions the whole stack as code Suite changes peer-reviewed; terraform plan gates infra; monitors-as-code

A few choices deserve the why, because they are the ones teams get wrong.

Why a deterministic gate AND a statistical monitor — not one or the other. This is the central design decision, so it is worth being precise. Great Expectations encodes what you know must be true: a quantity is non-negative, a currency is one of your supported ISO codes, sum(target_weights) per portfolio equals 1.0 within tolerance. These are business rules, they are unambiguous, and you block on them — bad data does not get to land. But you can only write expectations for failure modes you have imagined. The date-format flip that broke Northwind’s NAV passed every expectation they had, because the dates were still valid dates. That class of failure — plausible but wrong — is what Monte Carlo’s anomaly detection exists for: it noticed the downstream nav_date distribution shifted in a way it had never seen and flagged it, even though no hand-written rule was violated. Deterministic catches the specifiable; statistical catches the surprising. An architecture with only one has a permanent blind spot.

Why blocking promotion, not blocking the pipeline. A naive integration fails the whole DAG on a validation error, which pages someone at 3 a.m. and stops all data — including the 95% that was fine. The right pattern is a quarantine-and-promote gate: the job always runs, validation failures route the offending batch (or rows) to a quarantine location and prevent promotion to silver, but good partitions still flow. The incident is raised for triage during business hours unless blast radius is high. Consumers never see bad data, engineers are not woken for a recoverable vendor hiccup, and the quarantined batch is sitting in S3 ready to reprocess once the feed is fixed.

Why freshness and volume SLAs are the highest-ROI monitors. Of all the things Monte Carlo can watch, freshness (did this table update when it should have?) and volume (did it get roughly the number of rows it always gets?) catch the overwhelming majority of real incidents for the least configuration, because almost every pipeline failure manifests as stale or wrong-sized before it manifests as anything subtle. A job that silently died shows up as a freshness breach. A feed that delivered an empty file shows up as a volume anomaly. You get these nearly for free from metadata — no row scanning, so no warehouse cost — and they should be the first monitors you turn on, on your most important tables, before you ever touch distribution-level checks.

Why route through ServiceNow instead of just Slack. A Slack ping is a notification; it has no owner, no priority, no SLA, and no audit trail — it scrolls away. Regulated data incidents need to be managed: assigned to the owning team, prioritized by impact, tracked to resolution, and post-mortemed, with a record an auditor can review. Opening a ServiceNow incident makes a data quality failure a first-class operational event with the same lifecycle as a production outage. It also gives you the approval workflow for the dangerous action — overriding a quarantine to force-promote a batch should require a change request, not a Slack thumbs-up.

Implementation guidance

Provision with IaC, and treat monitors as code. Use Terraform for the cloud substrate (S3/ADLS buckets, warehouse, IAM, networking) and keep both your Great Expectations suites and your Monte Carlo monitors in Git, reviewed like application code. Monte Carlo’s monitors-as-code lets you declare SLAs in YAML next to the table they govern, so a new table ships with its freshness and volume monitors in the same pull request — not bolted on weeks later after it has already broken once.

The deterministic gate, expressed as a Great Expectations suite, communicates the intent — these are business invariants, version-controlled:

# expectations/holdings_silver.yml — peer-reviewed, runs in-job before promotion
expectations:
  - expect_column_values_to_not_be_null:
      column: portfolio_id
  - expect_column_values_to_be_in_set:
      column: currency
      value_set: ["USD", "EUR", "GBP", "JPY", "CHF", "AUD"]
  - expect_column_values_to_be_between:
      column: market_value
      min_value: 0          # a holding's market value is never negative
  - expect_column_pair_values_a_to_be_greater_than_b:
      column_A: trade_date
      column_B: settlement_date
      or_equal: false        # you cannot settle before you trade
  - expect_multicolumn_sum_to_equal:
      column_list: ["alloc_equity", "alloc_fixed", "alloc_cash"]
      sum_total: 1.0
      # weights must sum to 100% — the kind of rule a date-flip won't violate,
      # which is exactly why you also need Monte Carlo watching distributions.

The statistical layer is declared, not coded — you tell Monte Carlo what matters, and it learns what normal is:

# monitors/holdings.yml — freshness + volume SLAs, the highest-ROI monitors
montecarlo:
  - table: prod.silver.holdings
    freshness:
      schedule: "0 6 * * 1-5"      # expected updated by 06:00 on weekdays
      breach_after: 2h             # page if >2h late — feeds the NAV cycle
    volume:
      sensitivity: high            # MC learns the daily row delta; flag deviation
    field_health:
      columns: [market_value, nav_date, currency]   # null-rate & distribution drift
    notify:
      channel: servicenow
      priority_from: blast_radius  # impact = count of downstream assets in lineage

The gate task in the orchestrator. In Airflow, the validation Checkpoint is a blocking task between ingest and promote. On success it promotes bronze→silver; on failure it routes to quarantine, opens the incident, and stops promotion — without failing the unrelated branches of the DAG:

ingest >> validate_gx >> branch
branch >> promote_to_silver      # taken only if the GE Checkpoint passed
branch >> quarantine_and_alert   # taken on validation failure; opens ServiceNow
promote_to_silver >> build_gold

Identity and secrets — no static credentials anywhere. Wire Okta or Entra ID as the SSO/SCIM source for Monte Carlo, the warehouse, and ServiceNow, so a data engineer sees only the monitors for their own domain and an offboarded user loses access everywhere at once. Pull every credential the pipeline needs — warehouse logins, vendor SFTP keys, the ServiceNow API token — from HashiCorp Vault with short-TTL dynamic leases, never from Airflow Variables or environment files. (KloudVin learned this the hard way: database passwords committed to a repo years ago are exactly the failure mode dynamic secrets eliminate — there is no static password to leak.)

Data posture and runtime. Point Wiz at the lake and warehouse as a Data Security Posture Management scan — it finds the PII you did not know was in a bronze bucket, the bucket someone made public, and the over-broad warehouse grant — because a quality incident and an exposure incident often share a root cause (a careless ingestion). On the compute that runs your Spark jobs, CrowdStrike Falcon provides runtime protection so a compromised node processing sensitive holdings data is detected, not assumed-safe.

Enterprise considerations

Security and least privilege. The observability plane should be read-only — Monte Carlo needs metadata, query logs, and small samples, nothing more; never grant it write access to the data it watches. Scope the pipeline’s own identity to exactly what it needs: read bronze, write silver/gold, write quarantine. Route warehouse and SFTP creds through Vault with rotation; SSO everything through Okta/Entra with SCIM deprovisioning. Run Wiz continuously so the data lake’s posture (public buckets, PII sprawl, stale grants) is monitored the same way the pipelines are, and keep Falcon on the processing nodes. The principle: the system that polices data quality must itself be the least-privileged, most-audited thing in the stack, because it can see everything.

Cost optimization. Observability cost is dominated by how much you scan, so engineer for it. (1) Lean on metadata-only monitors — freshness and volume read the catalog and query history, not the rows, so they are nearly free; turn these on broadly. (2) Reserve distribution/field-health scans for tier-1 tables — the regulatory and client-facing gold tables — not every bronze staging table, because column profiling runs queries and queries cost warehouse credits. (3) Sample, don’t full-scan, for distribution checks on large tables; a well-chosen sample catches drift without scanning a billion rows. (4) Right-size Great Expectations — validate on the incoming batch, not the full historical table, and push predicate-heavy checks down into Spark/SQL rather than pulling data into the driver. (5) Tier your SLAs — a tight 2-hour freshness SLA with paging belongs on the NAV feed, not on a quarterly reference table; over-monitoring low-value tables just generates noise and cost. A practical split: metadata monitors everywhere, deep monitors on the ~10% of tables that feed a regulator or a customer.

Scalability. Each plane scales independently. The quality gate scales with your Spark cluster — validation is just another transformation, and Great Expectations runs distributed over the batch, so it grows with compute. The observability plane scales by table count and is largely decoupled from data volume because it works off metadata; adding the 1,001st table is a monitor definition, not new infrastructure. The thing that does not scale is hand-authored expectations across thousands of tables — which is precisely why the architecture pairs a curated GE suite on your critical tables with broad auto-applied Monte Carlo coverage everywhere else. You author deep checks where correctness is non-negotiable and let statistical monitoring blanket the long tail.

Failure modes — and how the architecture handles each:

Failure mode Symptom Caught by Response
Job silently dies Table never updates Monte Carlo freshness SLA breach ServiceNow incident, on-call paged; rerun
Vendor sends empty file Row count collapses Monte Carlo volume anomaly Quarantine batch; hold promotion; alert
Schema change upstream New/renamed/dropped column GE schema expectation + MC schema monitor Block promotion; incident to data owner
Plausible-but-wrong values (date flip, unit error) Distribution shifts, totals look “off” Monte Carlo field-health / distribution drift Flag with lineage; portfolio team reviews
Business invariant violated (weights ≠ 100%) sum(allocations) ≠ 1.0 Great Expectations deterministic gate Hard block; row-level quarantine
Observability platform itself down No alerts firing Datadog/Dynatrace heartbeat on MC connector Page platform team; gate still independently blocks

That last row matters: because the two planes are independent, an outage of the observability platform does not disable the quality gate, and vice versa — there is no single point through which all protection flows.

Reliability and the override path. The dangerous operation is force-promoting a quarantined batch — sometimes the vendor data really is fine and the expectation was too strict. That action must not be a quiet command-line flag; it should require a ServiceNow change request with an approver, so every override is recorded, justified, and reviewable. Tune freshness breach_after windows to the real cadence (a feed that is routinely 20 minutes late should not page), and set distribution sensitivity per table — too sensitive and you get alert fatigue, too loose and you miss the drift. Alert fatigue is the silent killer of observability programs: a team that gets paged for noise stops reading the pages, and then misses the real one.

Observability of the observability (and governance). Emit the meta-metrics to Datadog or Dynatrace: gate pass-rate per pipeline (a falling pass-rate means an upstream feed is degrading), time-to-detection and time-to-resolution per incident (your data-SRE SLOs), monitor coverage (what fraction of tier-1 tables have freshness + volume SLAs), and quarantine volume (how much data is being held). For governance, make data ownership explicit — every monitored table maps to an owning team in the ServiceNow CMDB, so an incident routes to a human, not a void. Version expectation suites and monitor definitions in Git, gate suite changes through GitHub Actions / Jenkins CI so a reviewer signs off before a rule changes, and keep an audit trail of every incident and override for the regulator who will, eventually, ask to see it.

Reference enterprise example

Northwind Capital rolled this out in two phases against their Databricks-on-AWS lakehouse. Phase one (observability first) connected Monte Carlo read-only to Databricks Unity Catalog and Snowflake and turned on freshness + volume SLAs across all ~1,200 tables, with field-health monitors on the ~110 gold tables feeding regulatory reports and the client portal. This required zero pipeline changes and started catching incidents in week one — including a reference-data feed that had been silently 4 hours late for months, which nobody had noticed because it “usually finished by lunch.” Phase two (the gate) added Great Expectations checkpoints to their ~30 most critical ingestion jobs — holdings, transactions, FX rates, benchmark constituents — with quarantine-and-promote semantics, blocking ~50 deterministic business invariants.

Decisions they made. They routed both planes into ServiceNow, with priority set by Monte Carlo’s blast-radius (an anomaly on a table feeding the regulatory NAV report opened a P1; one on an internal analytics table opened a P3). They wired Okta SSO + SCIM so each of their four data-domain teams saw only their own monitors, and pulled all warehouse and vendor-SFTP credentials from HashiCorp Vault with 1-hour dynamic leases — no static secrets in Airflow. Wiz ran a DSPM scan that immediately flagged two bronze buckets containing client PII with broader grants than intended (an easy win that justified the program to the CISO on its own). The force-promote override was locked behind a ServiceNow change request after an engineer, early on, manually pushed a “looks fine” batch that turned out not to be.

The numbers. ~1,200 monitored tables; ~30 jobs with deterministic gates; ~110 gold tables on tight SLAs. Monte Carlo’s metadata monitors added negligible warehouse cost; field-health scans on the gold tier (sampled, not full) were the only meaningful query spend. Monthly platform run cost landed near ₹11.6 lakh (~$13,900): Monte Carlo SaaS the largest line, Databricks gate compute modest (validation rides existing jobs), Datadog/Dynatrace and the ServiceNow integration the remainder; Vault and Okta were shared enterprise services already in place. Mean time-to-detection for data incidents fell from an embarrassing ~9 days (when someone happened to notice) to ~25 minutes (freshness/volume SLA breach to ServiceNow incident).

The outcome. In the first quarter the gate blocked three would-be reporting incidents before they reached a regulator — including a transaction feed where a vendor’s currency field silently switched from ISO codes to full names ("US Dollar" instead of "USD"), which the expect_column_values_to_be_in_set check caught and quarantined in seconds. Monte Carlo caught a fourth that no expectation covered: a benchmark feed where the values were all valid numbers but the distribution had shifted because the vendor changed index methodology — exactly the plausible-but-wrong class that deterministic rules miss. The line that got the CFO’s attention: because every incident now carried lineage showing precisely which downstream reports and client-portal figures were affected, the team could tell compliance “these 12 reports are impacted, these 1,788 are clean” within minutes — turning what used to be a firm-wide fire drill into a scoped, auditable response.

When to use it

Use this architecture when you have data that feeds decisions where being wrong is expensive — regulatory reporting, financial figures, customer-facing numbers, ML feature stores — and a table count large enough that no human can hold “what normal looks like” in their head. That covers most of the enterprise data estate the moment it crosses a few dozen tables and one regulated consumer.

Trade-offs to accept. This is real infrastructure with ongoing cost: an observability SaaS bill, query spend for deep monitors, and the standing effort of curating expectation suites and tuning SLA thresholds so the signal stays trustworthy. Statistical detection has a warm-up period (Monte Carlo needs history to learn a baseline) and will produce false positives you must tune out — the first month is noisier than the steady state. And neither plane is free of judgment: an over-strict expectation blocks good data, an over-loose monitor misses real drift, and alert fatigue will quietly defeat the whole program if you let noise accumulate.

Anti-patterns. (1) Only deterministic checks — you cover the bugs you have imagined and stay blind to the plausible-but-wrong failures that cause the worst incidents. (2) Only statistical monitoring — no enforcement gate, so bad data still lands and you are merely informed of disasters rather than preventing them. (3) Failing the whole pipeline on any validation error — pages people for recoverable hiccups and stops good data; quarantine-and-promote instead. (4) Alerts into a Slack channel with no owner — notifications scroll away unmanaged; route to ServiceNow with ownership and SLAs. (5) Deep distribution scans on every table — needless warehouse cost and noise; reserve them for the tier-1 tables that feed a regulator or a customer. (6) A bare command-line override to force-promote quarantined data — make it a ServiceNow change request, or your gate is theater.

Alternatives, and when they win. If you have a handful of static tables and a small team, dbt tests in your transformation layer cover deterministic quality cheaply and may be all you need — graduate to a dedicated observability platform when table count and incident cost outgrow hand-authored tests. If you live entirely inside one cloud, the native services (AWS Glue Data Quality, Azure’s Microsoft Purview data quality, GCP Dataplex) integrate tightly and avoid a third-party bill — at the cost of the cross-cloud lineage and maturity that Monte Carlo brings. And if you are early — pre-regulatory, low volume, few consumers — start with freshness and volume monitors only on your most important tables; that single step catches the majority of incidents for almost no effort, and the full architecture here is the destination you grow into as the stakes rise, not always the starting line.

Data QualityObservabilityArchitectureEnterpriseData Engineering
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading