Skip to content

Proposal: Version dataset schemas separately to datasets #2676

Open
@davidjgoss

Description

@davidjgoss

Problem

Many real-world jobs have these characteristics:

  • Runs very frequently
  • Output dataset schema(s) change infrequently

Because of Marquez's philosophy of versioning every output dataset on every run, this means a lot of data is generated that is for all intents and purposes duplicate, in the dataset_versions_field_mapping table.

Taking an oversimplified example, a dataset with a stable 20-column schema that is written to by a job that runs every 10 minutes for 30 days will result in 864,000 rows in dataset_versions_field_mapping when arguably there are only 20 meaningful rows of data. This is regardless of how many of those columns are actually touched by the job.

A dataset must still be versioned each time it occurs as an output, since it may contain facets that relate to the state of its structure and/or data at that moment in time beyond just the schema - including column-level lineage. But the schema itself doesn't need to be re-versioned unless it has actually changed.

Solution

Introduce the concept of a dataset_schema_version, so that:

  • When a dataset is first created (or, the next time it's versioned after we introduce this change), a dataset_schema_version is created which is referenced by the dataset_version in a new column.
  • A new mapping table dataset_schema_versions_field_mapping links a dataset_schema_version to each dataset_field.
  • When the dataset is next versioned, we compare the current version's schema to the incoming schema - if they are equal, then we re-use the same dataset_schema_version for this new dataset_version and don't create any new field mappings.

Considerations:

  • How do we determine if the schemas are "equal"? Certainly the same fields must be there (by name and type), and in the same order - what about description? Looking at the way dataset_fields is keyed, a description change isn't treated as a new field, so I don't think this should factor into the equals.
  • As it stands, this won't change anything for column_lineage which suffers from a similar problem to dataset_versions_field_mapping in getting new rows every run even when the lineage doesn't change. We could look at this separately.
  • Which read queries will we need to update for this? How can we keep it backwards compatible and (ideally) avoid copying data between tables? Would we eventually have a plan to retire the old table?

cc @wslulciuc @phixMe

Metadata

Metadata

Assignees

Labels

dbdb.perfThis issue or pull request improves DB performanceproposal

Type

No type

Projects

  • Status

    In Progress

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions