Proposal: Version dataset schemas separately to datasets

## Problem

Many real-world jobs have these characteristics:

- Runs very frequently
- Output dataset schema(s) change _infrequently_

Because of Marquez's philosophy of versioning every output dataset on every run, this means a lot of data is generated that is for all intents and purposes duplicate, in the `dataset_versions_field_mapping` table.

Taking an oversimplified example, a dataset with a stable 20-column schema that is written to by a job that runs every 10 minutes for 30 days will result in 864,000 rows in `dataset_versions_field_mapping` when arguably there are only 20 meaningful rows of data. This is regardless of how many of those columns are actually touched by the job.

A dataset must still be versioned each time it occurs as an output, since it may contain facets that relate to the state of its structure and/or data at that moment in time beyond just the schema - including column-level lineage. But the schema itself doesn't need to be re-versioned unless it has actually changed.

## Solution

Introduce the concept of a `dataset_schema_version`, so that:

- When a dataset is first created (or, the next time it's versioned after we introduce this change), a `dataset_schema_version` is created which is referenced by the `dataset_version` in a new column.
- A new mapping table `dataset_schema_versions_field_mapping` links a `dataset_schema_version` to each `dataset_field`.
- When the dataset is next versioned, we compare the current version's schema to the incoming schema - if they are equal, then we re-use the same `dataset_schema_version` for this new `dataset_version` and don't create any new field mappings.

Considerations:

- How do we determine if the schemas are "equal"? Certainly the same fields must be there (by name and type), and in the same order - what about description? Looking at the way `dataset_fields` is keyed, a description change isn't treated as a new field, so I don't think this should factor into the `equals`.
- As it stands, this won't change anything for `column_lineage` which suffers from a similar problem to `dataset_versions_field_mapping` in getting new rows every run even when the lineage doesn't change. We could look at this separately.
- Which read queries will we need to update for this? How can we keep it backwards compatible and (ideally) avoid copying data between tables? Would we eventually have a plan to retire the old table?

cc @wslulciuc @phixMe 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Version dataset schemas separately to datasets #2676

Problem

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Version dataset schemas separately to datasets #2676

Description

Problem

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions