Description
In OpenLineage, it's valid to declare a field in a dataset schema facet without a type
, like this:
{
"namespace": "default",
"name": "SomeTable",
"facets": {
"schema": {
"_producer": "https://openlineage.acme.com",
"_schemaURL": "https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json",
"fields": [
{
"name": "Id"
},
{
"name": "Name"
}
]
}
}
}
If you actually do this and push it into Marquez, the performance of the column-lineage
endpoint degrades rapidly - even with just a few thousand runs, I got to ~50s when running locally. When loading the same data into a fresh Marquez instance but with type
s included, performance was snappy (<100ms) again.
It may have something to do with this unique constraint on the dataset_fields
table:
ALTER TABLE dataset_fields ADD UNIQUE (dataset_uuid, name, type);
Postgres treats null values as thought they are always distinct by default so this allows duplicates to mount up. Postgres 15 adds UNIQUE NULLS NOT DISTINCT
but this constraint would fail to create in cases where nulls have already been pushed in multiple times. In the shorter term we could consider defaulting the type to UNKNOWN
when processing the message in Marquez.
Metadata
Metadata
Assignees
Type
Projects
Status
Todo