fix(ingestion/oracle): fix profiling crashes and silent table exclusions#16396
Conversation
|
Linear: ING-1793 |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Connector Tests ResultsConnector tests failed for commit Autogenerated by the connector-tests CI pipeline. |
| return regular | ||
| db = self.database | ||
| if not db and self.urn_db_name: | ||
| # Replicate Oracle's normalize_name: ALL_UPPERCASE → lowercase. |
There was a problem hiding this comment.
Would be nice to have common function to nomalize urn_db_name for this function and get_db_schema function.
There was a problem hiding this comment.
Done — extracted _normalize_db_name() at module level. Both get_identifier and get_db_schema now call it instead of duplicating the inline expression.
|
|
||
|
|
||
| def test_oracle_sample_query_uses_where_rownum(): | ||
| from datahub.ingestion.source.sql.sqlalchemy_data_reader import ( |
There was a problem hiding this comment.
Move all imports on the top of the file.
| generate_usage_statistics=self.config.include_usage_stats, | ||
| generate_operations=self.config.include_operational_stats, | ||
| usage_config=self.config if self.config.include_usage_stats else None, | ||
| eager_graph_load=not ( |
There was a problem hiding this comment.
Users with restricted table_pattern now get eager_graph_load=true automatically. On large DataHub instances, this could cause memory/performance problems. Consider making this configurable.
There was a problem hiding this comment.
Added lazy_schema_resolver: bool = Field(default=False) to OracleConfig, mirroring Snowflake's implementation exactly. When set, it suppresses the automatic eager load even when table_pattern is restricted.
| # V$SQL second: observed queries are added to the aggregator after the | ||
| # schema resolver is populated. _generate_aggregator_workunits is a no-op | ||
| # in the parent call above (overridden below) so lineage is not emitted yet. | ||
| with self.report.new_stage(QUERIES_EXTRACTION): |
There was a problem hiding this comment.
Remove self.report.new_stage(QUERIES_EXTRACTION): . The inner _populate_aggregator_from_queries() already has the same stage.
There was a problem hiding this comment.
Done — get_workunits_internal now calls _populate_aggregator_from_queries() directly. That method owns the stage itself, so there's no duplicate.
| @@ -350,7 +354,16 @@ class ProcedureDependencies(BaseModel): | |||
| ) | |||
|
|
|||
| DB_NAME_QUERY = """ | |||
There was a problem hiding this comment.
Multitenant Oracle users with service_name will get new URNs (PDB name instead of CDB). Add entry to docs/how/updating-datahub.md with migration path and mention the urn_db_name workaround.
There was a problem hiding this comment.
Added an entry to docs/how/updating-datahub.md explaining the CDB/PDB distinction, why service_name connections now correctly use the PDB name, and how to use urn_db_name to pin the old value if needed to avoid re-creating existing entities.
| default=None, | ||
| description="If using, omit `service_name`.", | ||
| ) | ||
| urn_db_name: Optional[str] = Field( |
There was a problem hiding this comment.
Clarify when urn_db_name should be used. Currently if both database and urn_db_name are set, entity and lineage URNs diverge.
There was a problem hiding this comment.
Updated the field description to make clear it only applies when service_name is used (i.e. database is not set), and added an explicit warning: "Do not set this alongside database; only one should be used."
There was a problem hiding this comment.
Could you use model validator?
If both database and urn_db_name are set, this PR takes urn_db_name and database will be silently ignored. Instead, it's good to raise error so that a customer recognizes which value would be used when both values are set.
Example:
| default=None, | ||
| description="If using, omit `service_name`.", | ||
| ) | ||
| urn_db_name: Optional[str] = Field( |
There was a problem hiding this comment.
Could you use model validator?
If both database and urn_db_name are set, this PR takes urn_db_name and database will be silently ignored. Instead, it's good to raise error so that a customer recognizes which value would be used when both values are set.
Example:
|
|
||
| ### Breaking Changes | ||
|
|
||
| - #16396: Oracle connector: When connecting via `service_name` to a multitenant Oracle database with `add_database_name_to_urn: true`, the database component of URNs will now reflect the Pluggable Database (PDB) name instead of the Container Database (CDB) name. In Oracle Multitenant architecture, a CDB is the top-level container (e.g. `cdb`) and a PDB is an individual tenant database within it (e.g. `mypdb`); `service_name` typically routes to the PDB, so the PDB name is the correct identifier for your datasets. If your existing metadata was ingested with the old CDB-based URNs, re-ingesting will create new dataset entities under the corrected URNs. To preserve the old URN shape and avoid re-creating entities, set `urn_db_name` explicitly in your recipe to match your previous CDB name. |
There was a problem hiding this comment.
This entry should mention that container URNs are also affected, not just dataset URNs. Container creation uses get_db_name() regardless of add_database_name_to_urn.
There was a problem hiding this comment.
Sure, that's added now
- fix(ingest/teradata): set DATABASE context for view HELP commands (#16208) - fix(redshift): use boundary-aware segment stitching for query reconstruction (#16253) - fix(ingestion): update save button style (#16427) - improvement(ui): design review changes for dataset summary and ingestion page (#16429) - fix(ingestion/oracle): fix profiling crashes and silent table exclusions (#16396) - docs(release): v0.3.16.5-acryl (#16428) - fix(ui): Remove filter we don't support from run results tab (#16433) - feat(agent-context): support sql search filters in mcp tools (#16403)
Summary
Fixes three latent bugs that crash or silently break Oracle ingestion when profiling is enabled (all integration test configs have profiling disabled, so none of these were caught). Also fixes a class of URN construction bugs that cause view lineage and usage lineage to produce non-matching entity references when connecting via
service_namein Oracle Multitenant (PDB) environments.Bug 1 — Invalid SQL for sample value queries
AND ROWNUM <= Nwas appended to a bareSELECT * FROM tablewith noWHEREclause, producing invalid SQL and crashing all column sample value collection for Oracle.Bug 2 —
AttributeError: self.reportin DBA modeOracleInspectorObjectWrappercalledself.report.failure()/self.report.warning()inget_db_name,get_pk_constraint, andget_foreign_keys, but the class had noreportattribute. Fixed by addingreport: SQLSourceReportto the constructor and passingself.reportat the single instantiation site. Only affectsdata_dictionary_mode: DBA.Bug 3 — Profiling silently excludes all tables when limits are
nullIn SQL,
NULL < NULLevaluates toNULLnotTRUE, so settingprofile_table_row_limitorprofile_table_size_limittonullcaused the entireWHEREclause to evaluate toNULL, silently excluding every table from profiling. Fixed withIS NULLshort-circuit checks.View lineage and usage URN construction for
service_nameconnectionsWhen using
service_name(Oracle Multitenant / PDB) withadd_database_name_to_urn: true, several issues caused entity URNs, view lineage URNs, and V$SQL usage lineage URNs to diverge:DB_NAME_QUERYnow prefersCON_NAMEoverDB_NAME: In multitenant Oracle,sys_context('USERENV','DB_NAME')returns the CDB name (oftenNONE), whilesys_context('USERENV','CON_NAME')returns the PDB name. The query now usesCON_NAMEunless connected to the CDB root, eliminating thenone.edw.tableURN problem in most setups without any config changes.New
urn_db_nameconfig field: An explicit override for the rare cases where auto-detection still returns the wrong value (e.g. service name does not route directly to the target PDB).get_identifiernow usesurn_db_name: Entity URNs now include the DB prefix fromurn_db_namewhenadd_database_name_to_urn: true, so all three URN construction paths (entities, view lineage, V$SQL lineage) produce the same result.Casing normalisation across all paths:
urn_db_namevalues that are ALL_UPPERCASE are lowercased (matching Oracle'snormalize_nameconvention) so entity URNs and lineage URNs share the same DB component regardless of whetherconvert_urns_to_lowercaseis set.get_db_schema"None"string bug: The fallback for 2-part identifiers was callingstr(self.config.database)which produced the literal string"None"whendatabasewas unset, causing the SQL parser to constructNone.EDW.TABLEURNs for view definitions.eager_graph_loadfor restricted ingestion runsOracle now enables
eager_graph_loadwhen not ingesting both tables and views (e.g. when running with a restrictedtable_pattern). This pre-fetches all known schemas from DataHub before processing view definitions and V$SQL queries, so column-level lineage can resolve references to tables outside the current scope. Matches the pattern used by Snowflake.Ingestion stage reporting
Oracle now reports
METADATA_EXTRACTION,QUERIES_EXTRACTION, andLINEAGE_EXTRACTIONusing the shared constants fromingestion_stage.py, consistent with Snowflake and Redshift.Testing
WHERE ROWNUMfix (no integration coverage since profiling is disabled everywhere in integration configs)report.failure()is called in the DBA mode error pathurn_db_nameinget_db_name,get_identifier, andget_db_schemaget_db_schema"None"regression (returnsNonenot"None")eager_graph_loadbeingTrue/Falsebased on ingestion configDB_NAME_QUERYtest to assert presence of bothCON_NAMEandDB_NAMEOracleInspectorObjectWrappertest setup (missingreportconstructor arg)tuple[str]message,.impactattribute, missing constructor arg)