Skip to content

Optimization for ColumnLineageDao.getLineageRowsForDataset #2697

Open
@ddave09

Description

@ddave09

Marquez version: 0.42.0
Related issue: #2418
Related PR: #2419 for the previous merge.
Discussion thread: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1701766646176309?thread_ts=1676320609.289789&cid=C01E8MQGJP7

Current query:

WITH selected_column_lineage AS (
    SELECT DISTINCT ON (cl.output_dataset_field_uuid, cl.input_dataset_field_uuid) cl.*
    FROM column_lineage cl
    JOIN dataset_fields df ON df.uuid = cl.output_dataset_field_uuid
    JOIN datasets_view dv ON dv.uuid = df.dataset_uuid
    WHERE ARRAY[($1,$2)]::DATASET_NAME[] && dv.dataset_symlinks -- array of string pairs is cast onto array of DATASET_NAME types to be checked if it has non-empty intersection with dataset symlinks
    ORDER BY output_dataset_field_uuid, input_dataset_field_uuid, updated_at DESC, updated_at
  ),
  dataset_fields_view AS (
    SELECT d.namespace_name as namespace_name, d.name as dataset_name, df.name as field_name, df.type, df.uuid
    FROM dataset_fields df
    INNER JOIN datasets_view d ON d.uuid = df.dataset_uuid
  )
  SELECT
    output_fields.namespace_name,
    output_fields.dataset_name,
    output_fields.field_name,
    output_fields.type,
    ARRAY_AGG(DISTINCT ARRAY[
      input_fields.namespace_name,
      input_fields.dataset_name,
      CAST(c.input_dataset_version_uuid AS VARCHAR),
      input_fields.field_name,
      c.transformation_description,
      c.transformation_type
    ]) AS inputFields,
    null as dataset_version_uuid
  FROM selected_column_lineage c
  INNER JOIN dataset_fields_view output_fields ON c.output_dataset_field_uuid = output_fields.uuid
  LEFT JOIN dataset_fields_view input_fields ON c.input_dataset_field_uuid = input_fields.uuid
  GROUP BY
    output_fields.namespace_name,
    output_fields.dataset_name,
    output_fields.field_name,
    output_fields.type

I suspect that the query has joins with views at multiple places and as views can't and don't have indexes which is causing the issues. datasets_view while creating derived table selected_column_lineage and dataset_fields_view in the main select query.

Also, the derived table (selected_column_lineage) without indexes. It is getting joined against dataset_fields_view in the main select query.

We have one database that has 62981 rows in dataset_view and one that has 549987. Things work just fine in the former case with less rows.

I believe the issue happens at scale.

With recent test with larger data sizes queries don't tend to finish. We have seen examples of query running for 3+ days.

+-----+----------------------------------------------------+
|pid       | query_duration                                      |
+-----+----------------------------------------------------+
|18258 | 3 days 8 hours 6 mins 11.680846 secs |
|14029 | 3 days 7 hours 55 mins 57.796614 secs|
|25136 | 3 days 7 hours 55 mins 57.796616 secs|
|12053 | 3 days 8 hours 6 mins 11.680867 secs |
|12054 | 3 days 8 hours 6 mins 11.680869 secs |
|14028 | 3 days 7 hours 55 mins 57.796674 secs|
+-----+----------------------------------------------------+

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    • Status

      Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions