Skip to content

MobileTeleSystems/data-rentgen

Repository files navigation

Data.Rentgen logo

Repo Status Docker image PyPI PyPI License PyPI Python Version Documentation Build Status Coverage pre-commit.ci

What is Data.Rentgen?

Data.Rentgen is a Data Motion Lineage service, compatible with OpenLineage specification.

Note: service is under active development, and is not ready to use yet.

Goals

  • Collect lineage events produced by OpenLineage clients & integrations.
  • Store operation-grained events for better detalization (instead of job grained Marquez).
  • Provide API for fetching job/run ↔ dataset lineage, not dataset ↔ dataset lineage (like Datahub and OpenMetadata).

Features

  • Support consuming large amounts of lineage events, use Apache Kafka as event buffer.
  • Store data in tables partitioned by event timestamp, to speed up lineage graph resolution.
  • Lineage graph is build with user-specified time boundaries (unlike Marquez where lineage is build only for last job run).
  • Lineage graph can be build with different granularity. e.g. merge all individual Spark operations into Spark applicationId or Spark applicationName.
  • Column-level lineage support.
  • Authentication support.

Non-goals

  • This is not a Data Catalog, DataRentgen doesn't track dataset schema change, owner and so on. Use Datahub or OpenMetadata instead.
  • Static Data Lineage like view → table is not supported.

Limitations

  • For now, only Apache Spark, Apache Airflow, Apache Flink and DBT are supported as lineage event sources. OpenLineage also supports Hive, Trino and other lineage sources. DataRentgen support may be added later.
  • Unlike Marquez, DataRentgen parses only limited set of facets send by OpenLineage, and doesn't store custom facets. This can be changed in future.

Documentation

See https://data-rentgen.readthedocs.io/

Screenshots

Lineage graph

Dataset downstream lineage

Dataset downstream lineage graph

Dataset upstream lineage

Dataset upstream lineage graph

Direct column-level lineage

Dataset direct column-level lineage graph

Inirect column-level lineage

Dataset indirect column-level lineage graph

Datasets

Datasets list

Runs

Runs list

Spark application

Spark application details

Spark run

Spark run details

Spark operation

Spark operation details

Airflow DagRun

Airflow DagRun details

Airflow TaskInstance

Airflow TaskInstance details

Contributors 4

  •  
  •  
  •  
  •  

Languages