Skip to content

Omitted type for schema fields causes issues #2651

Open
@davidjgoss

Description

@davidjgoss

In OpenLineage, it's valid to declare a field in a dataset schema facet without a type, like this:

{
    "namespace": "default",
    "name": "SomeTable",
    "facets": {
        "schema": {
            "_producer": "https://openlineage.acme.com",
            "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json",
            "fields": [
                {
                    "name": "Id"
                },
                {
                    "name": "Name"
                }
            ]
        }
    }
}

If you actually do this and push it into Marquez, the performance of the column-lineage endpoint degrades rapidly - even with just a few thousand runs, I got to ~50s when running locally. When loading the same data into a fresh Marquez instance but with types included, performance was snappy (<100ms) again.

It may have something to do with this unique constraint on the dataset_fields table:

ALTER TABLE dataset_fields ADD UNIQUE (dataset_uuid, name, type);

Postgres treats null values as thought they are always distinct by default so this allows duplicates to mount up. Postgres 15 adds UNIQUE NULLS NOT DISTINCT but this constraint would fail to create in cases where nulls have already been pushed in multiple times. In the shorter term we could consider defaulting the type to UNKNOWN when processing the message in Marquez.

Metadata

Metadata

Assignees

No one assigned

    Labels

    dbdb.perfThis issue or pull request improves DB performance

    Type

    No type

    Projects

    • Status

      Todo

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions