Skip to content

Proposal: Update DatasetVersion versioning #2071

@RNHTTR

Description

@RNHTTR

There has been some discussion (mostly in #1977) about reworking the versioning system for DatasetVersion.

Motivation

The current DatasetVersion versioning system leads to confusion (e.g. #1883). DatasetVersion has a uuid field (of type UUID) and a version field (also of type UUID). In a practical sense, I think these fields are redundant.

Additionally, external data systems might already support dataset versioning (e.g. delta, iceberg). It'd make sense for Marquez to support these.

Proposal

I propose that a Version's uuid field should assume the functionality currently provided by Version's version field, and add an additional field external_version to support dataset versions provided by external applications. This would have a downstream impact on JobVersion.

Work required

  1. Update Version.getValue() to be of type String
  2. Drop DatasetVersion's version field
  3. Add a field to DatasetVersion: external_version (String)
  4. Drop JobVersion's version field
  5. Add a field to JobVersion: external_version (String).
    1. I'm not sure if this is currently necessary, but it seems reasonable to assume that data applications might support job versions tied to code in the future if they don't already.
  6. Use OpenLineage's DatasetVersionDatasetFacet facet to support external dataset versions.
  7. Upstream/downstream code changes to support 1-6 (e.g. updating queries to use dv.uuid instead of dv.version)
  8. Database migrations

If this proposal is accepted, I'll open an official proposal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions