Optimization for ColumnLineageDao.getLineageRowsForDataset

Marquez version: `0.42.0`
Related issue: https://github.com/MarquezProject/marquez/issues/2418
Related PR: https://github.com/MarquezProject/marquez/pull/2419 for the previous merge.
Discussion thread: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1701766646176309?thread_ts=1676320609.289789&cid=C01E8MQGJP7


Current query:

```
WITH selected_column_lineage AS (
    SELECT DISTINCT ON (cl.output_dataset_field_uuid, cl.input_dataset_field_uuid) cl.*
    FROM column_lineage cl
    JOIN dataset_fields df ON df.uuid = cl.output_dataset_field_uuid
    JOIN datasets_view dv ON dv.uuid = df.dataset_uuid
    WHERE ARRAY[($1,$2)]::DATASET_NAME[] && dv.dataset_symlinks -- array of string pairs is cast onto array of DATASET_NAME types to be checked if it has non-empty intersection with dataset symlinks
    ORDER BY output_dataset_field_uuid, input_dataset_field_uuid, updated_at DESC, updated_at
  ),
  dataset_fields_view AS (
    SELECT d.namespace_name as namespace_name, d.name as dataset_name, df.name as field_name, df.type, df.uuid
    FROM dataset_fields df
    INNER JOIN datasets_view d ON d.uuid = df.dataset_uuid
  )
  SELECT
    output_fields.namespace_name,
    output_fields.dataset_name,
    output_fields.field_name,
    output_fields.type,
    ARRAY_AGG(DISTINCT ARRAY[
      input_fields.namespace_name,
      input_fields.dataset_name,
      CAST(c.input_dataset_version_uuid AS VARCHAR),
      input_fields.field_name,
      c.transformation_description,
      c.transformation_type
    ]) AS inputFields,
    null as dataset_version_uuid
  FROM selected_column_lineage c
  INNER JOIN dataset_fields_view output_fields ON c.output_dataset_field_uuid = output_fields.uuid
  LEFT JOIN dataset_fields_view input_fields ON c.input_dataset_field_uuid = input_fields.uuid
  GROUP BY
    output_fields.namespace_name,
    output_fields.dataset_name,
    output_fields.field_name,
    output_fields.type
```

I suspect that the query has joins with views at multiple places and as views can't and don't have indexes which is causing the issues. `datasets_view` while creating derived table `selected_column_lineage` and `dataset_fields_view` in the main select query.

Also, the derived table (`selected_column_lineage`) without indexes. It is getting joined against `dataset_fields_view` in the main select query.

We have one database that has `62981` rows in `dataset_view` and one that has `549987`. Things work just fine in the former case with less rows.

I believe the issue happens at scale.

With recent test with larger data sizes queries don't tend to finish. We have seen examples of query running for `3+` days.

```
+-----+----------------------------------------------------+
|pid       | query_duration                                      |
+-----+----------------------------------------------------+
|18258 | 3 days 8 hours 6 mins 11.680846 secs |
|14029 | 3 days 7 hours 55 mins 57.796614 secs|
|25136 | 3 days 7 hours 55 mins 57.796616 secs|
|12053 | 3 days 8 hours 6 mins 11.680867 secs |
|12054 | 3 days 8 hours 6 mins 11.680869 secs |
|14028 | 3 days 7 hours 55 mins 57.796674 secs|
+-----+----------------------------------------------------+
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimization for ColumnLineageDao.getLineageRowsForDataset #2697

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimization for ColumnLineageDao.getLineageRowsForDataset #2697

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions