feat(ingest): reconcile upstream lineage URN casing against DataHub#18004
Draft
puneetagarwal-datahub wants to merge 7 commits into
Draft
feat(ingest): reconcile upstream lineage URN casing against DataHub#18004puneetagarwal-datahub wants to merge 7 commits into
puneetagarwal-datahub wants to merge 7 commits into
Conversation
…casing Add AutoNormalizeLineageUrnsProcessor, a framework work-unit processor that reconciles the casing of upstream warehouse references in lineage against the casing DataHub already stores, so casing mismatches between sources (e.g. an uppercase Snowflake table referenced as lowercase by a BI tool, or vice versa) no longer produce two disconnected lineage nodes. - Config-driven via FlagsConfig.normalize_lineage_urn_casing (enabled + upstream_platforms list); opt-in, default off; no-op without a backend. Intended for BI-tool ingestions, not the warehouse ingestion itself. - Reuses SchemaResolverProvider to bulk-load each upstream platform once, then resolves locally. New SchemaResolver.resolve_urn_casing() resolves bidirectionally via a normalized-URN index, preferring exact matches and leaving ambiguous collisions unchanged. - Fixes table-level (UpstreamLineage, dashboardInfo) and column-level (FineGrainedLineage field paths, via match_columns_to_schema) references. Only upstream references are touched; the entity and downstream fields are not. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a LineageMatchType enum (EXACT/NORMALIZED) and an optional matchType field to the Upstream and FineGrainedLineage aspects, so the UI can explain whether an upstream reference matched an existing entity exactly or was healed via casing normalization. Schema versions of the embedding aspects (upstreamLineage, dataJobInputOutput) are bumped to v2 accordingly. The lineage URN casing processor now populates matchType: EXACT when a reference already matched an existing entity, NORMALIZED when it was rewritten to heal a casing mismatch, and leaves it unset when no reconciliation was performed. Fine-grained entries aggregate to NORMALIZED if any field was rewritten, else EXACT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
❌ 35 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
Document the lineage URN casing normalization feature: the problem it solves, how resolution works (bidirectional, exact-wins, collision-safe), what it fixes (table + column level across upstreamLineage/fineGrainedLineage/dashboardInfo), configuration reference, where to enable it (BI ingestions, not warehouse), the matchType explainability field, and its requirements/limitations. Registered in the ingestion guides sidebar. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…_lowercase The guide implied no mitigation exists. Acknowledge the current convert_urns_to_lowercase workaround, explain its limits (cross-source coordination, unavailable on Looker/Tableau, lowercasing flattens display casing and can merge distinct case-sensitive tables), and frame this feature as resolving to the existing entity's casing rather than flattening identities. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tion Keep the normalize_lineage_urn_casing flag description coherent with the PR/ guide framing: clarify it resolves references to the existing entity's casing rather than lowercasing every URN like convert_urns_to_lowercase. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add tests for real mixed-case identifiers (e.g. `DataHub` vs `datahub`): heal lower->mixed and mixed->lower (both directions), upper->mixed, exact match wins without mis-routing when both casings exist, and ambiguous third-casing left unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The guide claimed a Snowflake-uppercase + BI-lowercase setup 'stays broken regardless' because Looker/Tableau lack the flag. That's overstated — you can lowercase the warehouse side to connect them. Clarify the real limitation: the only lever is forcing lowercase, so you cannot connect them while preserving the warehouse's original casing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DataHub compares URNs as exact, case-sensitive strings. When two sources refer to the same physical table with different casing — e.g. a warehouse emits
urn:li:dataset:(urn:li:dataPlatform:snowflake,DB.SCHEMA.TABLE,PROD)(uppercase) while a BI tool references...,db.schema.table,PROD)(lowercase) — DataHub treats them as two different entities, so the lineage edge between them is not drawn and downstream multi-hop lineage silently breaks.The only existing mitigation is the per-source
convert_urns_to_lowercaseflag, which keeps lineage connected by lowercasing every URN. That requires coordinating the flag across all sources, isn't available on some BI connectors (Looker, Tableau), and on case-sensitive platforms risks merging genuinely distinct tables (MyTablevsmytable) while also losing the warehouse's display casing.This PR adds an opt-in ingestion-framework work unit processor that takes a different approach: instead of flattening identities to lowercase, it resolves each upstream warehouse reference to the casing of the entity that already exists in DataHub — per ingestion, with no global coordination, preserving the warehouse's original casing, and only when the match is unambiguous.
What it does
AutoNormalizeLineageUrnsProcessor(work unit processor). For each configured upstream platform it bulk-loads that platform's URNs + schemas once via the existingSchemaResolverProvider, then resolves every lineage reference locally (no per-URN round trips).ordersvsOrderson a case-sensitive platform) and no-match cases are left unchanged.UpstreamLineage,DashboardInfo) and column-level references (FineGrainedLineagefield paths, using the resolver's schema info to correct column casing).matchType(EXACT/NORMALIZED) discriminator to theUpstreamandFineGrainedLineageaspects, populated by the processor, so the UI can later explain whether a reference was matched exactly or healed via normalization.Configuration
Opt-in, disabled by default. Enable on a BI-tool ingestion recipe and point it at the upstream warehouse platform(s):
No-op without a DataHub backend connection. Only reconciles against entities that already exist at ingestion time.
Testing
dashboardInforefs,matchTypepopulation, and the enable/disable gate.SchemaResolvertest suite passes unchanged.matchTypepersisting through GMS.Notes
Checklist
metadata-ingestion/docs/dev_guides/lineage_urn_casing.md🤖 Generated with Claude Code