-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Open
Labels
feature-requestRequest for a new feature to be addedRequest for a new feature to be addedingestionPR or Issue related to the ingestion of metadataPR or Issue related to the ingestion of metadata
Description
Summary
Add native support for DataFlow-to-DataFlow (pipeline-to-pipeline) lineage in the DataHub model, enabling direct lineage edges between orchestration pipelines without requiring intermediate DataJob entities.
Background
In PR #15499 (Azure Data Factory connector), we implemented pipeline-to-pipeline lineage using the existing DataJobInputOutput.inputDatajobs field. This approach works by:
- Creating a DataJob for the
ExecutePipelineactivity (parent) - Creating a DataJob for the first activity in the child pipeline
- Connecting them via
inputDatajobsrelationship
While this approach is functional and displays correctly in the lineage graph UI, it has limitations:
- Requires routing through DataJob entities even when the logical relationship is between DataFlows (pipelines)
- Users searching for "which pipelines call this pipeline" must understand the underlying DataJob representation
- Other orchestrators (Airflow, Dagster, Prefect) may have similar parent-child DAG/pipeline relationships
Proposed Enhancement
Add a new relationship type to the DataHub model:
DataFlow → calls → DataFlow
This could be modeled as:
- A new
inputDataFlows/outputDataFlowsfield on theDataFlowInfoaspect - Or a new
DataFlowLineageaspect similar toUpstreamLineagefor datasets
Use Cases
- Azure Data Factory:
ExecutePipelineactivity calls child pipelines - Airflow:
TriggerDagRunOperatortriggers other DAGs - Dagster: Jobs can invoke other jobs
- Prefect: Flows can call sub-flows
- dbt: Projects can reference other projects
Current Workaround
The ADF connector uses DataJobInputOutput.inputDatajobs between:
- Parent:
ExecutePipelineactivity DataJob - Child: First activity of the child pipeline DataJob
Custom properties provide additional context:
calls_pipeline: Name of the child pipelinechild_pipeline_urn: URN of the child DataFlowchild_first_activity: Name of the first activity in the child
Related
- PR feat(azure-data-factory): add Azure Data Factory connector #15499: Azure Data Factory connector implementation
- Potential impact on other orchestrator connectors
Metadata
Metadata
Assignees
Labels
feature-requestRequest for a new feature to be addedRequest for a new feature to be addedingestionPR or Issue related to the ingestion of metadataPR or Issue related to the ingestion of metadata