Skip to content

Proposal: Version dataset schemas separately to datasets #2676

Open
@davidjgoss

Description

@davidjgoss

Problem

Many real-world jobs have these characteristics:

  • Runs very frequently
  • Output dataset schema(s) change infrequently

Because of Marquez's philosophy of versioning every output dataset on every run, this means a lot of data is generated that is for all intents and purposes duplicate, in the dataset_versions_field_mapping table.

Taking an oversimplified example, a dataset with a stable 20-column schema that is written to by a job that runs every 10 minutes for 30 days will result in 864,000 rows in dataset_versions_field_mapping when arguably there are only 20 meaningful rows of data. This is regardless of how many of those columns are actually touched by the job.

A dataset must still be versioned each time it occurs as an output, since it may contain facets that relate to the state of its structure and/or data at that moment in time beyond just the schema - including column-level lineage. But the schema itself doesn't need to be re-versioned unless it has actually changed.

Solution

Introduce the concept of a dataset_schema_version, so that:

  • When a dataset is first created (or, the next time it's versioned after we introduce this change), a dataset_schema_version is created which is referenced by the dataset_version in a new column.
  • A new mapping table dataset_schema_versions_field_mapping links a dataset_schema_version to each dataset_field.
  • When the dataset is next versioned, we compare the current version's schema to the incoming schema - if they are equal, then we re-use the same dataset_schema_version for this new dataset_version and don't create any new field mappings.

Considerations:

  • How do we determine if the schemas are "equal"? Certainly the same fields must be there (by name and type), and in the same order - what about description? Looking at the way dataset_fields is keyed, a description change isn't treated as a new field, so I don't think this should factor into the equals.
  • As it stands, this won't change anything for column_lineage which suffers from a similar problem to dataset_versions_field_mapping in getting new rows every run even when the lineage doesn't change. We could look at this separately.
  • Which read queries will we need to update for this? How can we keep it backwards compatible and (ideally) avoid copying data between tables? Would we eventually have a plan to retire the old table?

cc @wslulciuc @phixMe

Metadata

Metadata

Assignees

Labels

dbdb.perfThis issue or pull request improves DB performanceproposal

Type

No type

Projects

Status

In Progress

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions