fix(ingestion/oracle): fix profiling crashes and silent table exclusions by acrylJonny · Pull Request #16396 · datahub-project/datahub

acrylJonny · 2026-03-02T10:28:31Z

Summary

Fixes three latent bugs that crash or silently break Oracle ingestion when profiling is enabled (all integration test configs have profiling disabled, so none of these were caught). Also fixes a class of URN construction bugs that cause view lineage and usage lineage to produce non-matching entity references when connecting via service_name in Oracle Multitenant (PDB) environments.

Bug 1 — Invalid SQL for sample value queries

AND ROWNUM <= N was appended to a bare SELECT * FROM table with no WHERE clause, producing invalid SQL and crashing all column sample value collection for Oracle.

Bug 2 — `AttributeError: self.report` in DBA mode

OracleInspectorObjectWrapper called self.report.failure() / self.report.warning() in get_db_name, get_pk_constraint, and get_foreign_keys, but the class had no report attribute. Fixed by adding report: SQLSourceReport to the constructor and passing self.report at the single instantiation site. Only affects data_dictionary_mode: DBA.

Bug 3 — Profiling silently excludes all tables when limits are `null`

In SQL, NULL < NULL evaluates to NULL not TRUE, so setting profile_table_row_limit or profile_table_size_limit to null caused the entire WHERE clause to evaluate to NULL, silently excluding every table from profiling. Fixed with IS NULL short-circuit checks.

View lineage and usage URN construction for `service_name` connections

When using service_name (Oracle Multitenant / PDB) with add_database_name_to_urn: true, several issues caused entity URNs, view lineage URNs, and V$SQL usage lineage URNs to diverge:

DB_NAME_QUERY now prefers CON_NAME over DB_NAME: In multitenant Oracle, sys_context('USERENV','DB_NAME') returns the CDB name (often NONE), while sys_context('USERENV','CON_NAME') returns the PDB name. The query now uses CON_NAME unless connected to the CDB root, eliminating the none.edw.table URN problem in most setups without any config changes.
New urn_db_name config field: An explicit override for the rare cases where auto-detection still returns the wrong value (e.g. service name does not route directly to the target PDB).
get_identifier now uses urn_db_name: Entity URNs now include the DB prefix from urn_db_name when add_database_name_to_urn: true, so all three URN construction paths (entities, view lineage, V$SQL lineage) produce the same result.
Casing normalisation across all paths: urn_db_name values that are ALL_UPPERCASE are lowercased (matching Oracle's normalize_name convention) so entity URNs and lineage URNs share the same DB component regardless of whether convert_urns_to_lowercase is set.
get_db_schema "None" string bug: The fallback for 2-part identifiers was calling str(self.config.database) which produced the literal string "None" when database was unset, causing the SQL parser to construct None.EDW.TABLE URNs for view definitions.

`eager_graph_load` for restricted ingestion runs

Oracle now enables eager_graph_load when not ingesting both tables and views (e.g. when running with a restricted table_pattern). This pre-fetches all known schemas from DataHub before processing view definitions and V$SQL queries, so column-level lineage can resolve references to tables outside the current scope. Matches the pattern used by Snowflake.

Ingestion stage reporting

Oracle now reports METADATA_EXTRACTION, QUERIES_EXTRACTION, and LINEAGE_EXTRACTION using the shared constants from ingestion_stage.py, consistent with Snowflake and Redshift.

Testing

Added regression test for the WHERE ROWNUM fix (no integration coverage since profiling is disabled everywhere in integration configs)
Added test that report.failure() is called in the DBA mode error path
Added tests for urn_db_name in get_db_name, get_identifier, and get_db_schema
Added tests for the get_db_schema "None" regression (returns None not "None")
Added tests for eager_graph_load being True/False based on ingestion config
Updated DB_NAME_QUERY test to assert presence of both CON_NAME and DB_NAME
Fixed broken OracleInspectorObjectWrapper test setup (missing report constructor arg)
Fixed mypy errors in integration tests (tuple[str] message, .impact attribute, missing constructor arg)

github-actions · 2026-03-02T10:28:43Z

Linear: ING-1793

codecov · 2026-03-02T10:32:45Z

Codecov Report

❌ Patch coverage is 94.59459% with 2 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...gestion/src/datahub/ingestion/source/sql/oracle.py	94.44%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

rajatoss · 2026-03-02T11:01:56Z

Connector Tests Results

Connector tests failed for commit 1f2cc00

View full test logs →

Autogenerated by the connector-tests CI pipeline.

kyungsoo-datahub · 2026-03-03T23:21:39Z

-            return regular
+            db = self.database
+            if not db and self.urn_db_name:
+                # Replicate Oracle's normalize_name: ALL_UPPERCASE → lowercase.


Would be nice to have common function to nomalize urn_db_name for this function and get_db_schema function.

Done — extracted _normalize_db_name() at module level. Both get_identifier and get_db_schema now call it instead of duplicating the inline expression.

kyungsoo-datahub · 2026-03-03T23:22:23Z



+def test_oracle_sample_query_uses_where_rownum():
+    from datahub.ingestion.source.sql.sqlalchemy_data_reader import (


Move all imports on the top of the file.

kyungsoo-datahub · 2026-03-03T23:36:14Z

+            generate_usage_statistics=self.config.include_usage_stats,
+            generate_operations=self.config.include_operational_stats,
+            usage_config=self.config if self.config.include_usage_stats else None,
+            eager_graph_load=not (


Users with restricted table_pattern now get eager_graph_load=true automatically. On large DataHub instances, this could cause memory/performance problems. Consider making this configurable.

Added lazy_schema_resolver: bool = Field(default=False) to OracleConfig, mirroring Snowflake's implementation exactly. When set, it suppresses the automatic eager load even when table_pattern is restricted.

kyungsoo-datahub · 2026-03-03T23:42:29Z

+        # V$SQL second: observed queries are added to the aggregator after the
+        # schema resolver is populated. _generate_aggregator_workunits is a no-op
+        # in the parent call above (overridden below) so lineage is not emitted yet.
+        with self.report.new_stage(QUERIES_EXTRACTION):


Remove self.report.new_stage(QUERIES_EXTRACTION): . The inner _populate_aggregator_from_queries() already has the same stage.

Done — get_workunits_internal now calls _populate_aggregator_from_queries() directly. That method owns the stage itself, so there's no duplicate.

kyungsoo-datahub · 2026-03-04T03:34:57Z

@@ -350,7 +354,16 @@ class ProcedureDependencies(BaseModel):
 )

 DB_NAME_QUERY = """


Multitenant Oracle users with service_name will get new URNs (PDB name instead of CDB). Add entry to docs/how/updating-datahub.md with migration path and mention the urn_db_name workaround.

Added an entry to docs/how/updating-datahub.md explaining the CDB/PDB distinction, why service_name connections now correctly use the PDB name, and how to use urn_db_name to pin the old value if needed to avoid re-creating existing entities.

kyungsoo-datahub · 2026-03-04T03:37:52Z

        default=None,
        description="If using, omit `service_name`.",
    )
+    urn_db_name: Optional[str] = Field(


Clarify when urn_db_name should be used. Currently if both database and urn_db_name are set, entity and lineage URNs diverge.

Updated the field description to make clear it only applies when service_name is used (i.e. database is not set), and added an explicit warning: "Do not set this alongside database; only one should be used."

Could you use model validator?
If both database and urn_db_name are set, this PR takes urn_db_name and database will be silently ignored. Instead, it's good to raise error so that a customer recognizes which value would be used when both values are set.

Example:

datahub/metadata-ingestion/src/datahub/ingestion/source/snowplow/snowplow_config.py

Lines 478 to 495 in 755f0bf

@model_validator(mode="after")

def validate_connections(self) -> "SnowplowSourceConfig":

"""Validate that at least one connection type is configured."""

if self.bdp_connection is None and self.iglu_connection is None:

raise ValueError(

"Either bdp_connection or iglu_connection must be configured. "

"BDP connection is required for managed Snowplow deployments. "

"Iglu connection is required for open-source deployments."

)

# Iglu-only mode: automatic discovery via /api/schemas endpoint

if self.bdp_connection is None and self.iglu_connection is not None:

logging.getLogger(__name__).info(

"Iglu-only mode: will use automatic schema discovery via /api/schemas endpoint. "

"Requires Iglu Server 0.6+ with list schemas support."

)

return self

kyungsoo-datahub · 2026-03-04T12:57:31Z

        default=None,
        description="If using, omit `service_name`.",
    )
+    urn_db_name: Optional[str] = Field(


Could you use model validator?
If both database and urn_db_name are set, this PR takes urn_db_name and database will be silently ignored. Instead, it's good to raise error so that a customer recognizes which value would be used when both values are set.

Example:

datahub/metadata-ingestion/src/datahub/ingestion/source/snowplow/snowplow_config.py

Lines 478 to 495 in 755f0bf

@model_validator(mode="after")

def validate_connections(self) -> "SnowplowSourceConfig":

"""Validate that at least one connection type is configured."""

if self.bdp_connection is None and self.iglu_connection is None:

raise ValueError(

"Either bdp_connection or iglu_connection must be configured. "

"BDP connection is required for managed Snowplow deployments. "

"Iglu connection is required for open-source deployments."

)

# Iglu-only mode: automatic discovery via /api/schemas endpoint

if self.bdp_connection is None and self.iglu_connection is not None:

logging.getLogger(__name__).info(

"Iglu-only mode: will use automatic schema discovery via /api/schemas endpoint. "

"Requires Iglu Server 0.6+ with list schemas support."

)

return self

kyungsoo-datahub · 2026-03-04T13:50:59Z


 ### Breaking Changes

+- #16396: Oracle connector: When connecting via `service_name` to a multitenant Oracle database with `add_database_name_to_urn: true`, the database component of URNs will now reflect the Pluggable Database (PDB) name instead of the Container Database (CDB) name. In Oracle Multitenant architecture, a CDB is the top-level container (e.g. `cdb`) and a PDB is an individual tenant database within it (e.g. `mypdb`); `service_name` typically routes to the PDB, so the PDB name is the correct identifier for your datasets. If your existing metadata was ingested with the old CDB-based URNs, re-ingesting will create new dataset entities under the corrected URNs. To preserve the old URN shape and avoid re-creating entities, set `urn_db_name` explicitly in your recipe to match your previous CDB name.


This entry should mention that container URNs are also affected, not just dataset URNs. Container creation uses get_db_name() regardless of add_database_name_to_urn.

Sure, that's added now

kyungsoo-datahub

LGTM

- fix(ingest/teradata): set DATABASE context for view HELP commands (#16208) - fix(redshift): use boundary-aware segment stitching for query reconstruction (#16253) - fix(ingestion): update save button style (#16427) - improvement(ui): design review changes for dataset summary and ingestion page (#16429) - fix(ingestion/oracle): fix profiling crashes and silent table exclusions (#16396) - docs(release): v0.3.16.5-acryl (#16428) - fix(ui): Remove filter we don't support from run results tab (#16433) - feat(agent-context): support sql search filters in mcp tools (#16403)

fix(ingestion/oracle): fix profiling crashes and silent table exclusions

5d09402

github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 2, 2026

github-actions Bot deployed to datahub-wheels (Preview) March 2, 2026 10:30 View deployment

datahub-cyborg Bot added the needs-review Label for PRs that need review from a maintainer. label Mar 2, 2026

vercel Bot deployed to Preview March 2, 2026 10:43 View deployment

datahub-cyborg Bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 2, 2026

further fixes

277b89d

github-actions Bot deployed to datahub-wheels (Preview) March 3, 2026 10:52 View deployment

vercel Bot deployed to Preview March 3, 2026 11:06 View deployment

maggiehays added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 3, 2026

Merge branch 'master' into oracle-bug-fixes

97a28ba

github-actions Bot deployed to datahub-wheels (Preview) March 3, 2026 11:47 View deployment

vercel Bot deployed to Preview March 3, 2026 12:02 View deployment

kyungsoo-datahub reviewed Mar 4, 2026

View reviewed changes

maggiehays added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 4, 2026

addressing feedback

4d19b62

github-actions Bot deployed to datahub-wheels (Preview) March 4, 2026 11:03 View deployment

maggiehays added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 4, 2026

vercel Bot deployed to Preview March 4, 2026 11:17 View deployment

kyungsoo-datahub reviewed Mar 4, 2026

View reviewed changes

maggiehays added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 4, 2026

addressing feedback v2

1f2cc00

maggiehays added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 4, 2026

vercel Bot deployed to Preview March 4, 2026 16:16 View deployment

github-actions Bot deployed to datahub-wheels (Preview) March 4, 2026 16:24 View deployment

kyungsoo-datahub approved these changes Mar 4, 2026

View reviewed changes

acrylJonny merged commit a1b5edb into master Mar 4, 2026
61 of 62 checks passed

acrylJonny deleted the oracle-bug-fixes branch March 4, 2026 17:08



		def test_oracle_sample_query_uses_where_rownum():
		from datahub.ingestion.source.sql.sqlalchemy_data_reader import (

		@@ -350,7 +354,16 @@ class ProcedureDependencies(BaseModel):
		)

		DB_NAME_QUERY = """

	@model_validator(mode="after")
	def validate_connections(self) -> "SnowplowSourceConfig":
	"""Validate that at least one connection type is configured."""
	if self.bdp_connection is None and self.iglu_connection is None:
	raise ValueError(
	"Either bdp_connection or iglu_connection must be configured. "
	"BDP connection is required for managed Snowplow deployments. "
	"Iglu connection is required for open-source deployments."
	)

	# Iglu-only mode: automatic discovery via /api/schemas endpoint
	if self.bdp_connection is None and self.iglu_connection is not None:
	logging.getLogger(__name__).info(
	"Iglu-only mode: will use automatic schema discovery via /api/schemas endpoint. "
	"Requires Iglu Server 0.6+ with list schemas support."
	)

	return self


		### Breaking Changes

		- #16396: Oracle connector: When connecting via `service_name` to a multitenant Oracle database with `add_database_name_to_urn: true`, the database component of URNs will now reflect the Pluggable Database (PDB) name instead of the Container Database (CDB) name. In Oracle Multitenant architecture, a CDB is the top-level container (e.g. `cdb`) and a PDB is an individual tenant database within it (e.g. `mypdb`); `service_name` typically routes to the PDB, so the PDB name is the correct identifier for your datasets. If your existing metadata was ingested with the old CDB-based URNs, re-ingesting will create new dataset entities under the corrected URNs. To preserve the old URN shape and avoid re-creating entities, set `urn_db_name` explicitly in your recipe to match your previous CDB name.

Conversation

acrylJonny commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug 1 — Invalid SQL for sample value queries

Bug 2 — AttributeError: self.report in DBA mode

Bug 3 — Profiling silently excludes all tables when limits are null

View lineage and usage URN construction for service_name connections

eager_graph_load for restricted ingestion runs

Ingestion stage reporting

Testing

Uh oh!

github-actions Bot commented Mar 2, 2026

Uh oh!

codecov Bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rajatoss commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Connector Tests Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyungsoo-datahub left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

acrylJonny commented Mar 2, 2026 •

edited

Loading

Bug 2 — `AttributeError: self.report` in DBA mode

Bug 3 — Profiling silently excludes all tables when limits are `null`

View lineage and usage URN construction for `service_name` connections

`eager_graph_load` for restricted ingestion runs

codecov Bot commented Mar 2, 2026 •

edited

Loading

rajatoss commented Mar 2, 2026 •

edited

Loading