Skip to content

[Feature Request] First-class DataFlow-to-DataFlow lineage support for orchestrators #15534

@askumar27

Description

@askumar27

Summary

Add native support for DataFlow-to-DataFlow (pipeline-to-pipeline) lineage in the DataHub model, enabling direct lineage edges between orchestration pipelines without requiring intermediate DataJob entities.

Background

In PR #15499 (Azure Data Factory connector), we implemented pipeline-to-pipeline lineage using the existing DataJobInputOutput.inputDatajobs field. This approach works by:

  1. Creating a DataJob for the ExecutePipeline activity (parent)
  2. Creating a DataJob for the first activity in the child pipeline
  3. Connecting them via inputDatajobs relationship

While this approach is functional and displays correctly in the lineage graph UI, it has limitations:

  • Requires routing through DataJob entities even when the logical relationship is between DataFlows (pipelines)
  • Users searching for "which pipelines call this pipeline" must understand the underlying DataJob representation
  • Other orchestrators (Airflow, Dagster, Prefect) may have similar parent-child DAG/pipeline relationships

Proposed Enhancement

Add a new relationship type to the DataHub model:

DataFlow → calls → DataFlow

This could be modeled as:

  1. A new inputDataFlows / outputDataFlows field on the DataFlowInfo aspect
  2. Or a new DataFlowLineage aspect similar to UpstreamLineage for datasets

Use Cases

  1. Azure Data Factory: ExecutePipeline activity calls child pipelines
  2. Airflow: TriggerDagRunOperator triggers other DAGs
  3. Dagster: Jobs can invoke other jobs
  4. Prefect: Flows can call sub-flows
  5. dbt: Projects can reference other projects

Current Workaround

The ADF connector uses DataJobInputOutput.inputDatajobs between:

  • Parent: ExecutePipeline activity DataJob
  • Child: First activity of the child pipeline DataJob

Custom properties provide additional context:

  • calls_pipeline: Name of the child pipeline
  • child_pipeline_urn: URN of the child DataFlow
  • child_first_activity: Name of the first activity in the child

Related

Metadata

Metadata

Labels

feature-requestRequest for a new feature to be addedingestionPR or Issue related to the ingestion of metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions