Skip to content

Dataset fields with the same name allowed with different types #2379

@wslulciuc

Description

@wslulciuc

The Marquez DB model currently allows dataset fields of the same name, but of different types, to be associated with a given dataset (introduced in DB migration V12). For example, let's say we post an OpenLineage event for job my-job with an output dataset my-output where the output dataset has fields a with types VARCHAR and INT:

$ curl -X POST http://localhost:5000/api/v1/lineage \
  -H 'Content-Type: application/json' \
  -d '{
        "eventType": "COMPLETE",
        "eventTime": "2020-12-28T20:52:00.001+10:00",
        "run": {
          "runId": "d46e465b-d358-4d32-83d4-df660ff614dd"
        },
        "job": {
          "namespace": "my-namespace",
          "name": "my-job"
        },
        "outputs": [{
          "namespace": "my-namespace",
          "name": "my-output",
          "facets": {
            "schema": {
              "_producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
              "_schemaURL": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/spec/OpenLineage.json#/definitions/SchemaDatasetFacet",
              "fields": [
                { "name": "a", "type": "VARCHAR"},
                { "name": "a", "type": "INT"}
              ]
            }
          }
        }],     
        "producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client"
      }'

In the UI, we know see the metadata for dataset my-output:

Screen Shot 2023-01-23 at 1 54 09 PM

What's expected?

The Marquez model should consider a dataset field of the same name as a duplicate without considering the data type. We don't want to drop the metadata for the OL event completely, so the backend logic will be to insert a dataset field on a first come first serve basis. In other words, using the example job my-job above, the output dataset my-output with fields a of types VARCHAR and INT will only register field a of type VARCHAR; field a of type INT will be ignored.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

No status

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions