Description
Problem
Many real-world jobs have these characteristics:
- Runs very frequently
- Output dataset schema(s) change infrequently
Because of Marquez's philosophy of versioning every output dataset on every run, this means a lot of data is generated that is for all intents and purposes duplicate, in the dataset_versions_field_mapping
table.
Taking an oversimplified example, a dataset with a stable 20-column schema that is written to by a job that runs every 10 minutes for 30 days will result in 864,000 rows in dataset_versions_field_mapping
when arguably there are only 20 meaningful rows of data. This is regardless of how many of those columns are actually touched by the job.
A dataset must still be versioned each time it occurs as an output, since it may contain facets that relate to the state of its structure and/or data at that moment in time beyond just the schema - including column-level lineage. But the schema itself doesn't need to be re-versioned unless it has actually changed.
Solution
Introduce the concept of a dataset_schema_version
, so that:
- When a dataset is first created (or, the next time it's versioned after we introduce this change), a
dataset_schema_version
is created which is referenced by thedataset_version
in a new column. - A new mapping table
dataset_schema_versions_field_mapping
links adataset_schema_version
to eachdataset_field
. - When the dataset is next versioned, we compare the current version's schema to the incoming schema - if they are equal, then we re-use the same
dataset_schema_version
for this newdataset_version
and don't create any new field mappings.
Considerations:
- How do we determine if the schemas are "equal"? Certainly the same fields must be there (by name and type), and in the same order - what about description? Looking at the way
dataset_fields
is keyed, a description change isn't treated as a new field, so I don't think this should factor into theequals
. - As it stands, this won't change anything for
column_lineage
which suffers from a similar problem todataset_versions_field_mapping
in getting new rows every run even when the lineage doesn't change. We could look at this separately. - Which read queries will we need to update for this? How can we keep it backwards compatible and (ideally) avoid copying data between tables? Would we eventually have a plan to retire the old table?
Metadata
Metadata
Assignees
Type
Projects
Status
In Progress