Data.Rentgen is a Data Motion Lineage service, compatible with OpenLineage specification.
Note: service is under active development, and is not ready to use yet.
- Collect lineage events produced by OpenLineage clients & integrations.
- Store operation-grained events for better detalization (instead of job grained Marquez).
- Provide API for fetching job/run ↔ dataset lineage, not dataset ↔ dataset lineage (like Datahub and OpenMetadata).
- Support consuming large amounts of lineage events, use Apache Kafka as event buffer.
- Store data in tables partitioned by event timestamp, to speed up lineage graph resolution.
- Lineage graph is build with user-specified time boundaries (unlike Marquez where lineage is build only for last job run).
- Lineage graph can be build with different granularity. e.g. merge all individual Spark operations into Spark applicationId or Spark applicationName.
- Column-level lineage support.
- Authentication support.
- This is not a Data Catalog, DataRentgen doesn't track dataset schema change, owner and so on. Use Datahub or OpenMetadata instead.
- Static Data Lineage like view → table is not supported.
- For now, only Apache Spark, Apache Airflow, Apache Flink and DBT are supported as lineage event sources. OpenLineage also supports Hive, Trino and other lineage sources. DataRentgen support may be added later.
- Unlike Marquez, DataRentgen parses only limited set of facets send by OpenLineage, and doesn't store custom facets. This can be changed in future.
See https://data-rentgen.readthedocs.io/
Dataset downstream lineage
Dataset upstream lineage
Direct column-level lineage
Inirect column-level lineage