-
Notifications
You must be signed in to change notification settings - Fork 381
Description
There has been some discussion (mostly in #1977) about reworking the versioning system for DatasetVersion.
Motivation
The current DatasetVersion versioning system leads to confusion (e.g. #1883). DatasetVersion has a uuid field (of type UUID) and a version field (also of type UUID). In a practical sense, I think these fields are redundant.
Additionally, external data systems might already support dataset versioning (e.g. delta, iceberg). It'd make sense for Marquez to support these.
Proposal
I propose that a Version's uuid field should assume the functionality currently provided by Version's version field, and add an additional field external_version to support dataset versions provided by external applications. This would have a downstream impact on JobVersion.
Work required
- Update Version.getValue() to be of type
String - Drop
DatasetVersion'sversionfield - Add a field to
DatasetVersion:external_version(String) - Drop
JobVersion'sversionfield - Add a field to
JobVersion:external_version(String).- I'm not sure if this is currently necessary, but it seems reasonable to assume that data applications might support job versions tied to code in the future if they don't already.
- Use OpenLineage's
DatasetVersionDatasetFacetfacet to support external dataset versions. - Upstream/downstream code changes to support 1-6 (e.g. updating queries to use
dv.uuidinstead ofdv.version) - Database migrations
If this proposal is accepted, I'll open an official proposal.