refactor(ingestion): workunit processors#17852
Open
sgomezvillamor wants to merge 35 commits into
Open
Conversation
…urce overrides - Add WorkunitProcessor ABC with NAME constant, should_enable(), create(), process() - Add WorkunitProcessorContext and WorkunitProcessorReport dataclasses - Add WorkunitProcessorReport tracking in SourceReport.workunit_processor_reports - Add get_excluded_workunit_processors() and get_allowed_workunit_processors() hooks on Source - Add get_stale_entity_state_type() hook for custom checkpoint state types - Create 14 processor classes in workunit_processors/ package replacing all old auto_work_units/ functions and source_helpers functions - Refactor StaleEntityRemovalHandler constructor to take direct dependencies instead of a source reference; remove create() classmethod - Remove ~62 get_workunit_processors() overrides from sources (base class handles them) - Iceberg: use get_excluded_workunit_processors() for parallelism-incompatible processors - PowerBI: minimal get_workunit_processors() override for modified_since mode - Utility sources (datahub_apply, datahub_gc, dataprocess_cleanup, lineage, rdf, sql_queries, datahub_source, file): use get_allowed_workunit_processors() https://claude.ai/code/session_0171jAojWBsvNMDz7havHgDy
- test_stateful_ingestion: remove DummySource.get_workunit_processors() override and StaleEntityRemovalHandler.create() call (base class handles it automatically) - test_sql_queries: update processor assertions to use new NAME constants instead of partial function identity checks - test_auto_validate_input_fields: update import to workunit_processors package - test_ensure_aspect_size: update import to workunit_processors package https://claude.ai/code/session_0171jAojWBsvNMDz7havHgDy
…essor API Replace the manual get_workunit_processors() + StaleEntityRemovalHandler.create() pattern with the new automatic wiring approach. Document the new hook methods: - get_excluded_workunit_processors() for parallelism-incompatible processors - get_allowed_workunit_processors() for utility/minimal sources - get_stale_entity_state_type() for custom checkpoint state types https://claude.ai/code/session_0171jAojWBsvNMDz7havHgDy
❌ 1 Tests Failed:
View the top 2 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
Update test files and the old compatibility shim to use the new WorkunitProcessorContext-based API introduced in the processor refactoring: - test_auto_validate_input_fields.py: use ValidateInputFieldsProcessor.create(ctx) and processor.process(stream); check processor.report.num_input_fields_filtered - test_ensure_aspect_size.py: use EnsureAspectSizeProcessor.create(ctx) via WorkunitProcessorContext; set payload_constraint directly for the constraint test - auto_validate_input_fields.py: remove reference to SourceReport.num_input_fields_filtered which was moved to ValidateInputFieldsReport https://claude.ai/code/session_0171jAojWBsvNMDz7havHgDy
…ass type Use a TypeVar bound on the classmethod so that mypy infers the concrete subclass type when calling ProcessorSubclass.create(ctx), rather than the base WorkunitProcessor. This fixes mypy attr-defined errors when tests access subclass-specific attributes (payload_constraint, ensure_view_properties_size, etc.) on the result of create(). https://claude.ai/code/session_0171jAojWBsvNMDz7havHgDy
- test_ensure_aspect_size: fix patch paths (auto_work_units → workunit_processors), rename ensure_aspect_size() → process(), fix report.warnings → ctx.source_report.warnings - test_thoughtspot_source: update _references_stale_handler to accept StaleEntityRemovalProcessor (new wrapper) alongside StaleEntityRemovalHandler - test_notion_source: replace hasattr(source, stale_entity_removal_handler) check with a processor-chain check using stateful_ingestion config - test_informatica: replace StaleEntityRemovalHandler.create() mock (removed) with a direct check for StaleEntityRemovalProcessor in the processor chain - test_dataplex: replace deleted get_workunit_processors() override test with an assertion that the override no longer exists (base class handles it now) https://claude.ai/code/session_0171jAojWBsvNMDz7havHgDy
- source.py: use getattr(self, "platform", None) instead of infer_platform() to preserve pre-refactor behavior of getattr(source, "platform", "default"). Sources without a platform attribute now correctly fall through to "default" in StaleEntityRemovalHandler._init_job_id(), fixing the golden file failure for the file source (job ID was changing from default_stale_entity_removal to metadata-file_stale_entity_removal). - test_thoughtspot_source.py: fix I001 ruff violation — move workunit_processors.stale_entity_removal import to after all source.* imports (alphabetically workunit_processors > source). - test_dataplex_source.py: fix F401 ruff violation — remove unused StatefulIngestionSourceBase import that was added inside test function body. https://claude.ai/code/session_0171jAojWBsvNMDz7havHgDy
…backward-compat stale removal job IDs Add `source_platform` field to `WorkunitProcessorContext` to carry the fully-inferred platform (including @platform_name decorator fallback). `AutoBrowsePathV2Processor` uses this for `_prepend_platform_instance`, fixing the Dremio platform-instance browse path regression. The existing `platform` field (raw `self.platform` attribute) is kept unchanged so `StaleEntityRemovalHandler._init_job_id()` still produces `"default_stale_entity_removal"` for sources without an explicit `platform` attribute (file, Dremio, etc.), matching pre-refactor behavior. Also applies ruff format fix to test_thoughtspot_source.py (long lines). https://claude.ai/code/session_0171jAojWBsvNMDz7havHgDy
…stion tests StatefulIngestionSourceBase requires pipeline_name when stateful ingestion is enabled. The new tests (test_stale_entity_removal_handler_registered and test_stateful_ingestion_processor_wired_up) created PipelineContext without pipeline_name, triggering ConfigurationError at source instantiation. https://claude.ai/code/session_0171jAojWBsvNMDz7havHgDy
DatahubIngestionCheckpointingProvider.create() raises ValueError when ctx.graph is None. Tests that construct sources with stateful_ingestion enabled need both pipeline_name and a mock graph in PipelineContext, matching the pattern used by other stateful source tests (e.g. ThoughtSpot). Fixes: tests.unit.informatica.test_source.TestSourceLifecycle::test_stale_entity_removal_handler_registered tests.unit.ingestion.source.notion.test_notion_source::test_stateful_ingestion_processor_wired_up https://claude.ai/code/session_0171jAojWBsvNMDz7havHgDy
Line 1485 in test_stale_entity_removal_handler_registered was 90 chars (2 over the 88-char limit), causing ruff format --check to fail. Split PipelineContext args across lines to stay within the limit. https://claude.ai/code/session_0171jAojWBsvNMDz7havHgDy
Remove the NAME constant and get_name() method from WorkunitProcessor and all processor implementations. Processor names are now derived directly from class names (__name__). Updated get_allowed_workunit_processors() and get_excluded_workunit_processors() to accept Union[str, Type[WorkunitProcessor]], allowing sources to pass processor classes directly for better type safety and IDE support. Changes: - Remove NAME constant from 14 processor classes - Remove get_name() from WorkunitProcessor base class - Update Source._get_source_workunit_processors() with _to_name() helper - Update type signatures to Union[str, Type[WorkunitProcessor]] - Add WorkunitProcessor to TYPE_CHECKING imports in source.py - Update 12 source files to use processor classes instead of .NAME - Update 2 test files to use .__name__ instead of .NAME - Add explicit List[Type[WorkunitProcessor]] type annotations where needed Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implement proper generic typing for WorkunitProcessor using the established
codebase pattern (same as Sink class). This provides type-safe report access
without boilerplate while extracting report types at runtime.
Benefits:
- self.report is properly typed throughout processor methods
- No need for _report property casting
- No redundant report_class attribute
- Single source of truth: WorkunitProcessor[MyReport]
- Uses existing get_class_from_annotation() utility (robust, tested)
Changes:
- Made WorkunitProcessor generic over _ReportT
- Created empty report classes for all 14 processors
- Updated create() to extract report class via get_class_from_annotation()
- Removed _report properties from ValidateInputFieldsProcessor and EnsureAspectSizeProcessor
- Replaced all self._report with self.report (now properly typed)
All processors now follow consistent pattern:
@DataClass
class MyProcessorReport(WorkunitProcessorReport):
num_processed: int = 0
class MyProcessor(WorkunitProcessor[MyProcessorReport]):
def process(self, stream):
self.report.num_processed += 1 # ✓ Properly typed!
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
treff7es
approved these changes
Jun 12, 2026
Add comprehensive metrics tracking to AutoBrowsePathV2ProcessorReport: Invariant violations: - num_out_of_batch: URN seen in multiple batches - num_out_of_order: Child container processed before parent Browse path emission by source: - num_browse_path_v2_emitted: From source-generated BrowsePathsV2 - num_container_or_legacy_emitted: Derived from Container or legacy BrowsePaths - num_fallback_emitted: Fallback for root containers/dataFlow/dataJob Implementation uses local variables for per-invocation tracking (telemetry) and accumulates totals in report for overall observability. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…terializeReferencedTagsTermsProcessor - Track invalid URNs that couldn't be materialized - Change invalid URN message from info to warning level Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Track number of datasets patched with lastModified from operation timestamps Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Track unexpected metadata types and status aspects emitted Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…sProcessorReport Move total_schema_aspects, schemas_with_duplicates, and duplicated_field_paths from instance variables to report class fields Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…cessorReport Move total_schema_aspects, schemas_with_empty_fields, and empty_field_paths from instance variables to report class fields Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…InputFieldsReport Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove unnecessary local variables and increment report fields directly since telemetry is sent only once after processing the entire stream Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update tests to use the create() classmethod instead of direct instantiation to properly initialize the report attribute Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR standardizes workunit processing across all DataHub ingestion sources by introducing a common processor interface with automatic discovery and pluggable observability.
Main Contributions
1. Standardized Processor Interface
All processors now inherit from a common
WorkunitProcessorbase class with:create(),should_enable(),process())WorkunitProcessorContext2. Pluggable Sub-Reports for Observability
Each processor can define its own typed report class for metrics:
3. Automatic Processor Discovery
Eliminated ~72 source overrides of
get_workunit_processors(). Sources no longer need to manually instantiate processors - they're auto-discovered and enabled based on:should_enable()conditions (e.g., config flags)Before:
After:
Observability
Startup logs show which processors are enabled:
Report structure with processor-specific metrics:
Architecture
New module structure:
datahub.ingestion.api.workunit_processor- Base class and contextdatahub.ingestion.workunit_processors/- All processor implementations:auto_*- Enrichment processors (add metadata)validate_*- Validation processors (remove invalid data)ensure_*- Enforcement processors (enforce constraints)Processor lifecycle:
should_enable()conditionscreate()with dependency injectionprocess()pipelineImplementation Details
WorkunitProcessor[ReportT])get_class_from_annotation()utilitydocs/how/updating-datahub.mdTesting
https://claude.ai/code/session_0171jAojWBsvNMDz7havHgDy