Skip to content

fix(ingestion/oracle): fix profiling crashes and silent table exclusions#16396

Merged
acrylJonny merged 5 commits into
masterfrom
oracle-bug-fixes
Mar 4, 2026
Merged

fix(ingestion/oracle): fix profiling crashes and silent table exclusions#16396
acrylJonny merged 5 commits into
masterfrom
oracle-bug-fixes

Conversation

@acrylJonny

@acrylJonny acrylJonny commented Mar 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes three latent bugs that crash or silently break Oracle ingestion when profiling is enabled (all integration test configs have profiling disabled, so none of these were caught). Also fixes a class of URN construction bugs that cause view lineage and usage lineage to produce non-matching entity references when connecting via service_name in Oracle Multitenant (PDB) environments.

Bug 1 — Invalid SQL for sample value queries

AND ROWNUM <= N was appended to a bare SELECT * FROM table with no WHERE clause, producing invalid SQL and crashing all column sample value collection for Oracle.

Bug 2 — AttributeError: self.report in DBA mode

OracleInspectorObjectWrapper called self.report.failure() / self.report.warning() in get_db_name, get_pk_constraint, and get_foreign_keys, but the class had no report attribute. Fixed by adding report: SQLSourceReport to the constructor and passing self.report at the single instantiation site. Only affects data_dictionary_mode: DBA.

Bug 3 — Profiling silently excludes all tables when limits are null

In SQL, NULL < NULL evaluates to NULL not TRUE, so setting profile_table_row_limit or profile_table_size_limit to null caused the entire WHERE clause to evaluate to NULL, silently excluding every table from profiling. Fixed with IS NULL short-circuit checks.

View lineage and usage URN construction for service_name connections

When using service_name (Oracle Multitenant / PDB) with add_database_name_to_urn: true, several issues caused entity URNs, view lineage URNs, and V$SQL usage lineage URNs to diverge:

  • DB_NAME_QUERY now prefers CON_NAME over DB_NAME: In multitenant Oracle, sys_context('USERENV','DB_NAME') returns the CDB name (often NONE), while sys_context('USERENV','CON_NAME') returns the PDB name. The query now uses CON_NAME unless connected to the CDB root, eliminating the none.edw.table URN problem in most setups without any config changes.

  • New urn_db_name config field: An explicit override for the rare cases where auto-detection still returns the wrong value (e.g. service name does not route directly to the target PDB).

  • get_identifier now uses urn_db_name: Entity URNs now include the DB prefix from urn_db_name when add_database_name_to_urn: true, so all three URN construction paths (entities, view lineage, V$SQL lineage) produce the same result.

  • Casing normalisation across all paths: urn_db_name values that are ALL_UPPERCASE are lowercased (matching Oracle's normalize_name convention) so entity URNs and lineage URNs share the same DB component regardless of whether convert_urns_to_lowercase is set.

  • get_db_schema "None" string bug: The fallback for 2-part identifiers was calling str(self.config.database) which produced the literal string "None" when database was unset, causing the SQL parser to construct None.EDW.TABLE URNs for view definitions.

eager_graph_load for restricted ingestion runs

Oracle now enables eager_graph_load when not ingesting both tables and views (e.g. when running with a restricted table_pattern). This pre-fetches all known schemas from DataHub before processing view definitions and V$SQL queries, so column-level lineage can resolve references to tables outside the current scope. Matches the pattern used by Snowflake.

Ingestion stage reporting

Oracle now reports METADATA_EXTRACTION, QUERIES_EXTRACTION, and LINEAGE_EXTRACTION using the shared constants from ingestion_stage.py, consistent with Snowflake and Redshift.

Testing

  • Added regression test for the WHERE ROWNUM fix (no integration coverage since profiling is disabled everywhere in integration configs)
  • Added test that report.failure() is called in the DBA mode error path
  • Added tests for urn_db_name in get_db_name, get_identifier, and get_db_schema
  • Added tests for the get_db_schema "None" regression (returns None not "None")
  • Added tests for eager_graph_load being True/False based on ingestion config
  • Updated DB_NAME_QUERY test to assert presence of both CON_NAME and DB_NAME
  • Fixed broken OracleInspectorObjectWrapper test setup (missing report constructor arg)
  • Fixed mypy errors in integration tests (tuple[str] message, .impact attribute, missing constructor arg)

@github-actions github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 2, 2026
@github-actions

github-actions Bot commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

Linear: ING-1793

@datahub-cyborg datahub-cyborg Bot added the needs-review Label for PRs that need review from a maintainer. label Mar 2, 2026
@codecov

codecov Bot commented Mar 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.59459% with 2 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...gestion/src/datahub/ingestion/source/sql/oracle.py 94.44% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@rajatoss

rajatoss commented Mar 2, 2026

Copy link
Copy Markdown
Member

Connector Tests Results

Connector tests failed for commit 1f2cc00

View full test logs →

Autogenerated by the connector-tests CI pipeline.

@datahub-cyborg datahub-cyborg Bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 2, 2026
@maggiehays maggiehays added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 3, 2026
return regular
db = self.database
if not db and self.urn_db_name:
# Replicate Oracle's normalize_name: ALL_UPPERCASE → lowercase.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have common function to nomalize urn_db_name for this function and get_db_schema function.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — extracted _normalize_db_name() at module level. Both get_identifier and get_db_schema now call it instead of duplicating the inline expression.



def test_oracle_sample_query_uses_where_rownum():
from datahub.ingestion.source.sql.sqlalchemy_data_reader import (

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move all imports on the top of the file.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

generate_usage_statistics=self.config.include_usage_stats,
generate_operations=self.config.include_operational_stats,
usage_config=self.config if self.config.include_usage_stats else None,
eager_graph_load=not (

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users with restricted table_pattern now get eager_graph_load=true automatically. On large DataHub instances, this could cause memory/performance problems. Consider making this configurable.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added lazy_schema_resolver: bool = Field(default=False) to OracleConfig, mirroring Snowflake's implementation exactly. When set, it suppresses the automatic eager load even when table_pattern is restricted.

# V$SQL second: observed queries are added to the aggregator after the
# schema resolver is populated. _generate_aggregator_workunits is a no-op
# in the parent call above (overridden below) so lineage is not emitted yet.
with self.report.new_stage(QUERIES_EXTRACTION):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove self.report.new_stage(QUERIES_EXTRACTION): . The inner _populate_aggregator_from_queries() already has the same stage.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — get_workunits_internal now calls _populate_aggregator_from_queries() directly. That method owns the stage itself, so there's no duplicate.

@@ -350,7 +354,16 @@ class ProcedureDependencies(BaseModel):
)

DB_NAME_QUERY = """

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multitenant Oracle users with service_name will get new URNs (PDB name instead of CDB). Add entry to docs/how/updating-datahub.md with migration path and mention the urn_db_name workaround.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an entry to docs/how/updating-datahub.md explaining the CDB/PDB distinction, why service_name connections now correctly use the PDB name, and how to use urn_db_name to pin the old value if needed to avoid re-creating existing entities.

default=None,
description="If using, omit `service_name`.",
)
urn_db_name: Optional[str] = Field(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify when urn_db_name should be used. Currently if both database and urn_db_name are set, entity and lineage URNs diverge.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the field description to make clear it only applies when service_name is used (i.e. database is not set), and added an explicit warning: "Do not set this alongside database; only one should be used."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use model validator?
If both database and urn_db_name are set, this PR takes urn_db_name and database will be silently ignored. Instead, it's good to raise error so that a customer recognizes which value would be used when both values are set.

Example:

@model_validator(mode="after")
def validate_connections(self) -> "SnowplowSourceConfig":
"""Validate that at least one connection type is configured."""
if self.bdp_connection is None and self.iglu_connection is None:
raise ValueError(
"Either bdp_connection or iglu_connection must be configured. "
"BDP connection is required for managed Snowplow deployments. "
"Iglu connection is required for open-source deployments."
)
# Iglu-only mode: automatic discovery via /api/schemas endpoint
if self.bdp_connection is None and self.iglu_connection is not None:
logging.getLogger(__name__).info(
"Iglu-only mode: will use automatic schema discovery via /api/schemas endpoint. "
"Requires Iglu Server 0.6+ with list schemas support."
)
return self

@maggiehays maggiehays added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 4, 2026
@maggiehays maggiehays added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 4, 2026
default=None,
description="If using, omit `service_name`.",
)
urn_db_name: Optional[str] = Field(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use model validator?
If both database and urn_db_name are set, this PR takes urn_db_name and database will be silently ignored. Instead, it's good to raise error so that a customer recognizes which value would be used when both values are set.

Example:

@model_validator(mode="after")
def validate_connections(self) -> "SnowplowSourceConfig":
"""Validate that at least one connection type is configured."""
if self.bdp_connection is None and self.iglu_connection is None:
raise ValueError(
"Either bdp_connection or iglu_connection must be configured. "
"BDP connection is required for managed Snowplow deployments. "
"Iglu connection is required for open-source deployments."
)
# Iglu-only mode: automatic discovery via /api/schemas endpoint
if self.bdp_connection is None and self.iglu_connection is not None:
logging.getLogger(__name__).info(
"Iglu-only mode: will use automatic schema discovery via /api/schemas endpoint. "
"Requires Iglu Server 0.6+ with list schemas support."
)
return self

Comment thread metadata-ingestion/src/datahub/ingestion/source/sql/oracle.py Outdated
Comment thread docs/how/updating-datahub.md Outdated

### Breaking Changes

- #16396: Oracle connector: When connecting via `service_name` to a multitenant Oracle database with `add_database_name_to_urn: true`, the database component of URNs will now reflect the Pluggable Database (PDB) name instead of the Container Database (CDB) name. In Oracle Multitenant architecture, a CDB is the top-level container (e.g. `cdb`) and a PDB is an individual tenant database within it (e.g. `mypdb`); `service_name` typically routes to the PDB, so the PDB name is the correct identifier for your datasets. If your existing metadata was ingested with the old CDB-based URNs, re-ingesting will create new dataset entities under the corrected URNs. To preserve the old URN shape and avoid re-creating entities, set `urn_db_name` explicitly in your recipe to match your previous CDB name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry should mention that container URNs are also affected, not just dataset URNs. Container creation uses get_db_name() regardless of add_database_name_to_urn.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's added now

@maggiehays maggiehays added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 4, 2026
@maggiehays maggiehays added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 4, 2026

@kyungsoo-datahub kyungsoo-datahub left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@acrylJonny acrylJonny merged commit a1b5edb into master Mar 4, 2026
61 of 62 checks passed
@acrylJonny acrylJonny deleted the oracle-bug-fixes branch March 4, 2026 17:08
david-leifker pushed a commit that referenced this pull request May 27, 2026
- fix(ingest/teradata): set DATABASE context for view HELP commands (#16208)
- fix(redshift): use boundary-aware segment stitching for query reconstruction (#16253)
- fix(ingestion): update save button style (#16427)
- improvement(ui): design review changes for dataset summary and ingestion page (#16429)
- fix(ingestion/oracle): fix profiling crashes and silent table exclusions (#16396)
- docs(release): v0.3.16.5-acryl (#16428)
- fix(ui): Remove filter we don't support from run results tab (#16433)
- feat(agent-context): support sql search filters in mcp tools (#16403)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants