-
Notifications
You must be signed in to change notification settings - Fork 381
Description
The Marquez DB model currently allows dataset fields of the same name, but of different types, to be associated with a given dataset (introduced in DB migration V12). For example, let's say we post an OpenLineage event for job my-job with an output dataset my-output where the output dataset has fields a with types VARCHAR and INT:
$ curl -X POST http://localhost:5000/api/v1/lineage \
-H 'Content-Type: application/json' \
-d '{
"eventType": "COMPLETE",
"eventTime": "2020-12-28T20:52:00.001+10:00",
"run": {
"runId": "d46e465b-d358-4d32-83d4-df660ff614dd"
},
"job": {
"namespace": "my-namespace",
"name": "my-job"
},
"outputs": [{
"namespace": "my-namespace",
"name": "my-output",
"facets": {
"schema": {
"_producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
"_schemaURL": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/spec/OpenLineage.json#/definitions/SchemaDatasetFacet",
"fields": [
{ "name": "a", "type": "VARCHAR"},
{ "name": "a", "type": "INT"}
]
}
}
}],
"producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client"
}'In the UI, we know see the metadata for dataset my-output:
What's expected?
The Marquez model should consider a dataset field of the same name as a duplicate without considering the data type. We don't want to drop the metadata for the OL event completely, so the backend logic will be to insert a dataset field on a first come first serve basis. In other words, using the example job my-job above, the output dataset my-output with fields a of types VARCHAR and INT will only register field a of type VARCHAR; field a of type INT will be ignored.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
