Omitted type for schema fields causes issues

In OpenLineage, it's valid to declare a field in a dataset schema facet without a `type`, like this:

```json
{
    "namespace": "default",
    "name": "SomeTable",
    "facets": {
        "schema": {
            "_producer": "https://openlineage.acme.com",
            "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json",
            "fields": [
                {
                    "name": "Id"
                },
                {
                    "name": "Name"
                }
            ]
        }
    }
}
```

If you actually do this and push it into Marquez, the performance of the `column-lineage` endpoint degrades rapidly - even with just a few thousand runs, I got to ~50s when running locally. When loading the same data into a fresh Marquez instance but with `type`s included, performance was snappy (<100ms) again.

It may have something to do with this unique constraint on the `dataset_fields` table:

```sql
ALTER TABLE dataset_fields ADD UNIQUE (dataset_uuid, name, type);
```

Postgres treats null values as thought they are always distinct by default so this allows duplicates to mount up. [Postgres 15](https://www.postgresql.org/docs/15/release-15.html#id-1.11.6.5.5.3.4) adds `UNIQUE NULLS NOT DISTINCT` but this constraint would fail to create in cases where nulls have already been pushed in multiple times. In the shorter term we could consider defaulting the type to `UNKNOWN` when processing the message in Marquez.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Omitted type for schema fields causes issues #2651

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Omitted type for schema fields causes issues #2651

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions