A retail bank’s Chief Data Officer gets the same question from three directions in the same week. The model-risk team asks which downstream reports consume the credit_score_v3 feature, because a regulator wants to know where an automated decision came from. A data scientist asks why the customer_ltv dashboard quietly dropped 8% overnight and which upstream table changed. And the privacy office asks for every column across the estate that holds a customer’s date of birth, so it can honor a deletion request under DPDP and GDPR before the deadline. Three questions, one root cause: nobody can answer “where did this data come from, where does it go, and who is allowed to touch it” without a forensic week of Slack archaeology. That is the gap a data catalog, lineage graph, and governance layer fill — and at enterprise scale it is not optional tooling, it is the difference between passing an audit and explaining yourself in a consent order.
This article is a reference architecture for building that capability properly: not a spreadsheet of table names, but a living catalog that harvests metadata automatically, computes column-level lineage end to end, enforces ownership through stewardship workflows, and turns “I need access to that dataset” into an auditable ServiceNow request instead of a favor. It holds up from a 200-table startup warehouse to a 200,000-asset multi-cloud estate.
The business scenario
The recurring driver is regulated data the business can no longer keep track of by hand. Our retail bank — call it the running example — runs a Snowflake warehouse fed by Kafka and dozens of batch jobs, a Databricks lakehouse for ML features, BI in Power BI and Looker, and a long tail of operational Postgres and Oracle databases behind microservices. Each system has its own notion of “the customer.” There are eleven different columns somewhere that all mean account balance, three of them stale. When BCBS 239 (risk-data aggregation) and the privacy regulators come knocking, the bank must produce, on demand, a defensible answer to provenance and access questions — and “we’ll get back to you” is itself a finding.
The naive fixes fail predictably. A wiki of table definitions is stale the day after it is written, because nobody updates docs when they alter a pipeline. Asking each team to maintain its own glossary produces a dozen incompatible glossaries and zero cross-system lineage. Grepping query logs by hand when an incident hits gives you lineage for exactly one question, once, and never again. And granting database access by email approval leaves no record a steward can defend two years later when an auditor asks who approved the analyst’s read on the PII schema and why.
A governance platform threads this needle. Metadata is harvested, not hand-typed — connectors crawl Snowflake, Databricks, Kafka, Power BI, and the rest on a schedule and on change, so the catalog reflects reality. Lineage is computed from SQL, from orchestrator run events, and from BI dataset definitions, so “what feeds this report” is a graph query, not an investigation. Ownership is a first-class property of every asset, so there is always a named steward to escalate to. And access is a workflow — a request, a policy check, an approval, a provisioning action, an audit record — not a Slack thumbs-up. The catalog becomes the system of record for meaning, provenance, and permission, sitting beside (never inside) the systems that hold the data itself.
The scenario scales cleanly. The small enterprise catalogs one warehouse, a few hundred assets, a handful of stewards. The large enterprise federates dozens of source systems across AWS, Azure, and on-prem, hundreds of thousands of assets, a stewardship org with a tiered escalation model, and a hard requirement that no PII column goes ungoverned. The same architecture serves both — the difference is connector count, the size of the graph store, and how deep the approval chains run, not the shape of the diagram.
Architecture overview
The platform has three planes that share a metadata core but run on different cadences: a harvest plane (scheduled + event-driven ingestion of technical metadata and lineage), a governance plane (humans curating glossary, ownership, classifications, and policy), and an access plane (request-to-grant workflows that actually change permissions in the source systems). Keeping them mentally separate is the first step to operating this well — most failed catalog projects collapse the second and third into “we’ll do governance later,” and never do.
At the center sits the metadata store and catalog — in this reference design, DataHub as the open-source backbone, with notes throughout on where Collibra is the better fit for a buy-over-build, heavily-regulated shop. DataHub holds every asset as a typed entity (dataset, column, dashboard, ML feature, pipeline) in a graph, backed by Elasticsearch for search and a relational store for the entity timeline; Collibra holds the same notions in its Data Intelligence Cloud with a stronger out-of-the-box workflow engine and operating-model templates.
Harvest path: scheduled ingestion recipes (DataHub) or Edge connectors (Collibra) crawl each source — Snowflake’s INFORMATION_SCHEMA and ACCESS_HISTORY/QUERY_HISTORY for tables and query lineage, the Unity Catalog API for Databricks tables and notebooks, Kafka schema registries for topics, the Power BI and Looker REST APIs for reports and their dataset bindings. Each connector authenticates with a read-only service identity whose credentials live in HashiCorp Vault and are injected at runtime, never baked into a recipe. Crawls run as containerized jobs orchestrated by Airflow or Argo and triggered both on a schedule and on change events (a Snowflake task, a Databricks job-completion webhook). The harvested metadata is emitted, not pushed blindly — DataHub ingestion produces Metadata Change Proposals onto a Kafka topic; the metadata service (GMS) validates them against the entity model and applies them, so a malformed crawl cannot corrupt the graph.
Lineage computation is the part that earns the platform its keep. Table-to-table lineage comes free from orchestrator run events and warehouse access history. Column-level lineage — report.gross_margin ← sales.net_revenue − costs.cogs, traced through five intermediate models — is parsed from SQL with SQLGlot/dbt manifests or harvested from Spark’s logical plans via the OpenLineage Spark agent. dbt is the highest-fidelity source: its manifest.json declares model dependencies and the compiled SQL gives column derivations, so a dbt-driven warehouse yields near-perfect lineage with almost no extra work. Everything emits in the OpenLineage standard, so Airflow, Spark, dbt, and Flink all speak one lineage vocabulary into the catalog.
Governance path: stewards work in the catalog UI. They link physical columns to business glossary terms (the canonical “Account Balance,” with a definition, an owner, and a steward), apply classification tags (PII, PCI, Confidential, Restricted) — which can be auto-suggested by a classifier scanning column names and sampled values — and resolve proposals raised by other users. New assets enter an onboarding workflow: an uncatalogued dataset appearing in a crawl raises a task assigned to the source system’s owner to classify and document it before it is consumable. Identity for every human is federated from Okta (or Microsoft Entra ID) via SAML/OIDC, so a steward’s group memberships — and the assets they own — flow from the IdP, not a local user table.
Access path: a user finds a dataset in the catalog and clicks Request Access. This does not grant anything directly. It opens a ServiceNow request (via DataHub Actions or Collibra’s workflow connector) carrying the asset URN, the requested role, the business justification, and the data’s classification. ServiceNow routes the approval — to the dataset’s steward, and, for PII/Restricted data, additionally to the data-protection officer — and on approval fires a fulfillment automation that calls the source system’s grant API (a Snowflake GRANT, a Unity Catalog permission, an Okta group assignment that maps to a Databricks entitlement). The catalog records the grant against the asset and the requester. The permission lives where the data lives; the catalog is the system of record for why it exists and who approved it.
Component breakdown
| Concern | Reference tool | Role in the platform | Key configuration choices |
|---|---|---|---|
| Catalog / metadata graph | DataHub (or Collibra) | System of record for assets, glossary, ownership, classifications, lineage | DataHub: GMS + Kafka MCP stream + Elasticsearch; Collibra: Data Intelligence Cloud with operating-model templates |
| Technical metadata ingestion | DataHub recipes / Collibra Edge | Crawl source schemas, owners, stats, usage | Read-only service identity per source; scheduled + change-triggered; stateful ingestion for soft-deletes |
| Column-level lineage | OpenLineage + dbt manifest + SQLGlot | Compute field-to-field derivation across the estate | dbt manifest as ground truth; OpenLineage Spark/Airflow agents; SQL parse for ad-hoc warehouse queries |
| Business glossary & stewardship | DataHub / Collibra workflows | Canonical terms, definitions, owners, proposal/approval | Term ↔ column links; onboarding workflow for new assets; tiered steward escalation |
| Classification | Native tags + classifier | Tag PII/PCI/Restricted; drive policy | Auto-suggest from name/value sampling; steward confirms; tags propagate down lineage |
| Identity & SSO | Okta / Entra ID | Federate users, groups, steward roles | SAML/OIDC to the catalog; SCIM for provisioning; group claims map to ownership |
| Access workflow | ServiceNow | Request → approval → provision → audit | Catalog item per asset class; approval routing by classification; fulfillment calls source grant APIs |
| Secrets | HashiCorp Vault | Connector credentials, source service accounts | Dynamic DB creds with short TTL; AppRole for ingestion jobs; no secrets in recipes |
| Data posture / CSPM | Wiz | Find ungoverned PII, public buckets, over-grants in the cloud estate | Data-scanning to flag sensitive stores; feed findings back as catalog tags |
| Runtime protection | CrowdStrike Falcon | Protect the catalog/ingestion hosts and detect anomalous metadata-store access | Sensor on GMS/connector nodes; correlate with audit log egress |
| Observability | Datadog / Dynatrace | Ingestion freshness, lineage-job latency, API health, SLOs | Trace each crawl; alert on stale connectors; dashboard steward-queue depth |
| IaC & CI | Terraform + GitHub Actions / Jenkins | Provision the platform and ship recipe/glossary changes as code | Catalog config in Git; PR review on glossary; plan/apply gated |
A few choices deserve the why, because they are the ones teams get wrong.
Why column-level, not just table-level, lineage. Table lineage tells you customer_360 depends on transactions. That is nearly useless for the questions that matter. The regulator does not ask “which tables feed the model,” they ask “which field drove this decision,” and the privacy office does not need “which tables hold customer data,” they need “which columns hold a date of birth so I can delete them.” Column lineage answers both, and it is the only thing that makes impact analysis real: change the type of one source column and the catalog lights up every downstream field, report, and ML feature that will break — before you ship the change, not after the dashboard goes red.
Why harvest metadata instead of curating it by hand. A catalog that depends on humans typing in table descriptions is stale within a sprint and abandoned within a quarter — this is the single most common way catalog initiatives die. The technical layer (schemas, columns, owners-from-source, usage stats, lineage) must be machine-harvested on a schedule so it tracks reality automatically; humans then add only the thin, high-value layer machines cannot infer — business meaning, sensitivity judgment, ownership intent. Get this split wrong and you are running a documentation project, not a governance platform.
Why access requests route through ServiceNow rather than living in the catalog. Two reasons. First, auditability: ServiceNow is already the enterprise system of record for change and access, with retention, reporting, and SOX controls the auditors already trust — reusing it means access grants land in the evidence trail they already sample. Second, separation of duties: the catalog describes and requests; the source system enforces; ServiceNow approves and records. No single component both decides and grants, which is exactly the control an auditor wants to see. The catalog never holds a standing privilege to mutate production permissions on its own.
Why DataHub vs Collibra. They solve the same problem from opposite ends.
| Dimension | DataHub | Collibra |
|---|---|---|
| Model | Open source, self-hosted (or Acryl managed) | Commercial SaaS |
| Strength | Engineering-native, real-time lineage, extensible entity model, Kafka-driven | Business-governance-native, rich workflow engine, operating-model templates, regulatory accelerators |
| Lineage fidelity | Excellent for code/SQL/dbt/Spark via OpenLineage | Good; stronger on business lineage and policy mapping than raw column parsing |
| Best fit | Tech-forward orgs with strong data engineering and a build culture | Heavily regulated enterprises buying a governed operating model with stewards and committees |
| Cost shape | Infra + ops effort (you run it) | License per user/connector; lower ops, higher subscription |
The honest guidance: a bank with a mature data-engineering org and a desire to own its stack picks DataHub; a bank whose data office is led by risk-and-compliance and wants a turnkey stewardship operating model picks Collibra. Many large estates run both — DataHub for engineering-grade technical lineage, Collibra as the business-facing policy and glossary system — federated so a term defined in Collibra resolves to assets harvested by DataHub. The architecture below is the same either way; only the box in the center changes its label.
Implementation guidance
Provision with IaC, and treat connectors as code. Use Terraform for the platform footprint — the Kubernetes cluster (EKS/AKS) running DataHub’s GMS, frontend, Elasticsearch, and the Kafka it rides on; the managed Postgres/MySQL for the entity store; the Vault namespace and policies; the ServiceNow integration user. Critically, treat ingestion recipes and glossary definitions as version-controlled artifacts, not clicks in a UI — a recipe is a YAML file in Git, reviewed in a pull request, applied by CI. This is what makes the catalog reproducible and the change to it reviewable.
A DataHub ingestion recipe for Snowflake communicates the intent — note the credentials are referenced, never inline:
source:
type: snowflake
config:
account_id: "${SNOWFLAKE_ACCOUNT}"
username: "${SNOWFLAKE_USER}" # injected from Vault at runtime
password: "${SNOWFLAKE_PASSWORD}" # short-TTL dynamic credential
role: "DATAHUB_READONLY" # least-privilege harvest role
include_table_lineage: true
include_column_lineage: true # parse ACCESS_HISTORY for field lineage
include_usage_stats: true
profiling:
enabled: true
sample_size: 10000 # stats without scanning full tables
classification:
enabled: true # auto-suggest PII tags from values
sink:
type: datahub-rest
config:
server: "https://datahub.gms.internal:8080"
The Snowflake side gets a dedicated DATAHUB_READONLY role with IMPORTED PRIVILEGES on the metadata schemas and SELECT only where profiling needs samples — never OWNERSHIP, never write. The job pulls its short-lived Snowflake password from Vault via AppRole at start; the credential expires before the next crawl, so a leaked recipe leaks nothing.
Wire lineage at the source, not after the fact. Add the OpenLineage Airflow plugin and Spark listener so every orchestrated run emits lineage as it executes — this is far more reliable than reconstructing lineage from logs later. For the warehouse, point DataHub at Snowflake ACCESS_HISTORY (column lineage from actual executed queries) and ingest the dbt manifest.json every build through the GitHub Actions pipeline, so model-level and column-level derivations land in the catalog the moment a transformation ships. The result is lineage that is current with the code, not a quarterly reconstruction.
Build the access workflow as an integration, not a feature. A DataHub Action (or Collibra workflow) listens for RequestAccess events and creates a ServiceNow request via the Table or Service Catalog API, stamping the asset URN, classification, and requested role into custom fields. ServiceNow’s flow engine routes approvals by classification (steward for internal data; steward plus DPO for PII/Restricted) and, on approval, calls a fulfillment script that hits the source’s grant API. Make the grant idempotent and reversible — record an expiry so time-boxed access auto-revokes, and write the grant back onto the asset in the catalog so “who can read this” is answerable from one place.
Federate identity once. Connect the catalog to Okta/Entra over OIDC for SSO and SCIM for user/group provisioning, so steward roles and ownership derive from IdP groups. A user removed from the risk-analytics group in Okta loses their catalog role and — because the access workflow keyed grants to that group — their downstream data access, in one revocation. This is the join that makes joiner/mover/leaver actually work across the data estate.
Enterprise considerations
Security & Zero Trust. The catalog is a map of where your sensitive data lives — which makes it a high-value target in its own right, and a component you must govern as carefully as the data. Apply: (a) least-privilege harvest identities — every connector is read-only, scoped to metadata and sampling, with credentials in Vault on short TTLs; a compromised connector cannot exfiltrate rows or alter grants. (b) No standing write to production permissions — the catalog requests; only ServiceNow-approved fulfillment grants, so there is no path from “catalog compromised” to “permissions changed” without an approval record. © Runtime protection — CrowdStrike Falcon sensors on the GMS, Elasticsearch, and connector nodes, watching for anomalous access to the metadata store and unexpected egress of the audit log. (d) Posture management — Wiz scans the cloud data estate for the gap the catalog cannot see itself: a PII-laden S3 bucket nobody catalogued, a Snowflake share granted too broadly, a public storage account; its findings feed back as catalog tags and steward tasks, closing the loop between what is governed and what exists. (e) Classification drives policy — a column tagged PII automatically tightens its access-approval chain and can trigger masking policies in the warehouse (Snowflake dynamic data masking, Unity Catalog column masks) bound to the catalog tag, so sensitivity is enforced, not just labeled.
Cost optimization. The spend here is operational, not token-metered, and the traps are different. (1) Right-size the graph store — Elasticsearch and the entity DB grow with asset count and lineage edge count; column lineage on a wide estate is the heavy part, so retain lineage history with a sensible window rather than forever, and archive stale assets via stateful soft-delete. (2) Crawl on change, not on a tight timer — polling every source every hour burns compute and source-system query credits (Snowflake bills the harvest queries); trigger on job-completion events and fall back to a daily full crawl. (3) Profile by sampling — never full-scan tables for statistics; a 10k-row sample gives distribution and PII signal at a fraction of the warehouse cost. (4) Buy vs build honestly — DataHub’s license is “free” but you pay in a platform team and on-call; Collibra’s subscription is real money but absorbs that ops load. For a mid-size estate the total-cost crossover usually favors managed (Acryl/Collibra) until you have engineers who would run it anyway.
Scalability. Each plane scales independently. Ingestion is horizontally parallel — connectors are stateless containers; add workers to crawl more sources concurrently, and DataHub’s Kafka-based MCP stream absorbs bursts so a big crawl does not block the API. The read path (search, lineage queries, the UI) scales by adding Elasticsearch nodes and GMS replicas; lineage graph traversal is the query to watch — deep column-lineage walks across hundreds of thousands of edges want a tuned graph index and pagination, or they become the slow query that pages someone. The governance plane scales with people, not machines: the bottleneck at large estates is steward capacity, which is why the operating model (tiered stewards, auto-classification to cut manual tagging, onboarding workflows that distribute the load to source owners) matters more than any server.
Reliability & DR (RTO/RPO). The catalog is a derived system — its great reliability advantage is that almost everything in it can be rebuilt by re-crawling the sources. The durable state you must protect is the human-added layer: glossary, ownership, classifications, and the access-approval history. Back the entity DB and Elasticsearch with regular snapshots and ship glossary/recipe definitions in Git (so they are inherently versioned and recoverable). A pragmatic target: RTO of a few hours (stand up the platform from IaC, restore the entity store, re-run ingestion to refresh technical metadata) and RPO measured in the snapshot interval for curated metadata — with the understanding that technical metadata and lineage self-heal on the next crawl. The access audit trail lives in ServiceNow, which carries its own enterprise DR posture, so the compliance record survives even a total catalog loss.
Observability. Instrument the harvest plane hard, because silent connector failure is the catalog’s worst failure mode — a crawl that quietly stops makes the catalog confidently wrong, which is more dangerous than obviously empty. Emit per-connector freshness (time since last successful crawl) and completeness metrics to Datadog/Dynatrace, and alert when a source goes stale past its SLA. Trace each ingestion run end to end (extract → validate → emit → index) with timing, and watch lineage-job latency and steward-queue depth — the latter is the leading indicator of governance debt. A dashboard the data office actually reads: assets uncatalogued, PII columns unclassified, stale connectors, and access requests aging past SLA.
Governance (of the platform itself). Version the entity model and glossary in Git; review glossary changes in a pull request like code, because a definition is a contract. Pin connector versions so an upgrade does not silently change how lineage is parsed. Apply Terraform-enforced policy so no source can be onboarded without an owner and a classification step, and so every new dataset enters the onboarding workflow rather than appearing untagged. Log every access grant and every classification change for audit — and, because the catalog itself records who can see what, treat its access (who can read the map of your PII) as privileged and review it on the same cadence as the data it describes.
Reference enterprise example
Sterling Federal, a fictional retail and commercial bank (~6,000 employees, ~140 branches), built this platform after a regulatory exam flagged that it could not produce data lineage for its capital-adequacy reports within the required window. Their estate: a Snowflake warehouse (~9,000 tables), a Databricks lakehouse with ~600 ML features, Kafka with ~400 topics, Power BI and Looker with ~1,200 reports, and a long tail of Oracle/Postgres behind core-banking microservices.
Decisions they made. They chose DataHub (self-hosted on EKS) for engineering-grade lineage, with Collibra layered as the business-facing policy and committee system for the risk-and-compliance org — federated so Collibra glossary terms resolved to DataHub-harvested assets. They drove column lineage from dbt manifests (their warehouse was ~80% dbt) plus Snowflake ACCESS_HISTORY for the hand-written remainder, and OpenLineage agents on Airflow and Spark. Every connector ran read-only with Vault AppRole dynamic credentials. Access went through ServiceNow: a Request Access in DataHub opened a catalog item, routed to the dataset steward and — for the ~1,900 columns auto-classified PII — additionally to the DPO, with fulfillment firing Snowflake GRANTs and Unity Catalog permissions, time-boxed and written back to the catalog. Okta federated identity and SCIM-provisioned steward roles. Wiz flagged ~30 ungoverned sensitive data stores in the first scan, each raised as a steward onboarding task. Datadog watched connector freshness; CrowdStrike Falcon protected the GMS and metadata store.
The numbers. ~11,200 assets under management, ~1.4 million column-lineage edges. Median lineage-query response for a five-hop column trace ~600 ms. The platform ran on EKS at roughly ₹6.2 lakh (~$7,400)/month — DataHub compute + Elasticsearch ~$3,200, the Collibra subscription the largest single line (business-facing, ~120 named users), Datadog/Falcon/Wiz allocations and Snowflake harvest query credits the rest. The one-time back-catalog classification ran auto-suggest across the estate and a six-week steward sprint to confirm.
The outcome. The capital-report lineage that previously took a team a week now rendered as a column-lineage graph in under a minute, and the bank cleared the exam finding. A right-to-be-forgotten request that once meant emailing every team for “where is this customer’s data” became a single classification query returning every PII column and its owning system, cutting fulfillment from weeks to a day. Access reviews — previously a quarterly fire drill — became a standing report: every grant in the catalog carried an approver, a justification, and an expiry. And when a steward proposed changing a source column’s type, impact analysis showed the eleven downstream reports and three ML features it would break before the change shipped, turning a would-be incident into a planned migration.
When to use it
Use this architecture when you have data spread across multiple systems, regulated or sensitive data you must account for, lineage or provenance obligations (BCBS 239, SOX, GDPR/DPDP, model-risk governance), and access decisions that need an audit trail. That covers most enterprises past their first warehouse — banking, insurance, healthcare, and any business where “where did this number come from” and “who approved this access” are questions someone will eventually be required to answer under oath.
Trade-offs to accept. A catalog is a parallel system you must keep faithful to a moving estate — harvest has to be relentless or the map drifts from the territory, and a stale catalog is worse than none because people trust it. Column lineage is genuinely hard for hand-written SQL, stored procedures, and black-box ETL tools; coverage will be excellent where you use dbt/Spark/OpenLineage and patchy where you don’t, and closing the gap is ongoing work. And governance is fundamentally a people program — the tooling enables stewards, but without a real operating model (named owners, escalation, incentives to classify) the prettiest catalog stays empty.
Anti-patterns. (1) Hand-curating technical metadata — it goes stale and the project dies; harvest it. (2) Table-only lineage — it cannot answer the impact and privacy questions that justify the platform; invest in column-level. (3) Granting access in the catalog directly — you lose separation of duties and the audit trail; route through ServiceNow to the source system. (4) Connectors with write or broad-read privileges — the catalog becomes an exfiltration path; read-only, sampled, Vault-credentialed. (5) Classification as a one-time project — new sensitive data appears constantly; make onboarding a standing workflow with Wiz catching what slips through. (6) No freshness alerting — a silently dead connector makes the catalog confidently wrong.
Alternatives, and when they win. If your whole estate is one cloud, the native catalog — AWS Glue Data Catalog + DataZone, Azure Purview / Microsoft Fabric, or Google Dataplex — is the path of least resistance and integrates deepest with that cloud’s permissions; you give up the multi-cloud, vendor-neutral federation DataHub/Collibra provide. If you are small and dbt-centric, dbt’s own docs and exposures plus a lightweight lineage viewer may be all the governance you need until scale or regulation forces the step up. And if your only pain is discovery — people can’t find datasets — a catalog without the access-workflow and policy planes is a fine starting point; add the governance and access machinery when an auditor, a regulator, or a breach makes provenance and approval non-negotiable. The architecture here is the destination for a regulated multi-cloud estate, not always the first step.