Description
Marquez version: 0.42.0
Related issue: #2418
Related PR: #2419 for the previous merge.
Discussion thread: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1701766646176309?thread_ts=1676320609.289789&cid=C01E8MQGJP7
Current query:
WITH selected_column_lineage AS (
SELECT DISTINCT ON (cl.output_dataset_field_uuid, cl.input_dataset_field_uuid) cl.*
FROM column_lineage cl
JOIN dataset_fields df ON df.uuid = cl.output_dataset_field_uuid
JOIN datasets_view dv ON dv.uuid = df.dataset_uuid
WHERE ARRAY[($1,$2)]::DATASET_NAME[] && dv.dataset_symlinks -- array of string pairs is cast onto array of DATASET_NAME types to be checked if it has non-empty intersection with dataset symlinks
ORDER BY output_dataset_field_uuid, input_dataset_field_uuid, updated_at DESC, updated_at
),
dataset_fields_view AS (
SELECT d.namespace_name as namespace_name, d.name as dataset_name, df.name as field_name, df.type, df.uuid
FROM dataset_fields df
INNER JOIN datasets_view d ON d.uuid = df.dataset_uuid
)
SELECT
output_fields.namespace_name,
output_fields.dataset_name,
output_fields.field_name,
output_fields.type,
ARRAY_AGG(DISTINCT ARRAY[
input_fields.namespace_name,
input_fields.dataset_name,
CAST(c.input_dataset_version_uuid AS VARCHAR),
input_fields.field_name,
c.transformation_description,
c.transformation_type
]) AS inputFields,
null as dataset_version_uuid
FROM selected_column_lineage c
INNER JOIN dataset_fields_view output_fields ON c.output_dataset_field_uuid = output_fields.uuid
LEFT JOIN dataset_fields_view input_fields ON c.input_dataset_field_uuid = input_fields.uuid
GROUP BY
output_fields.namespace_name,
output_fields.dataset_name,
output_fields.field_name,
output_fields.type
I suspect that the query has joins with views at multiple places and as views can't and don't have indexes which is causing the issues. datasets_view
while creating derived table selected_column_lineage
and dataset_fields_view
in the main select query.
Also, the derived table (selected_column_lineage
) without indexes. It is getting joined against dataset_fields_view
in the main select query.
We have one database that has 62981
rows in dataset_view
and one that has 549987
. Things work just fine in the former case with less rows.
I believe the issue happens at scale.
With recent test with larger data sizes queries don't tend to finish. We have seen examples of query running for 3+
days.
+-----+----------------------------------------------------+
|pid | query_duration |
+-----+----------------------------------------------------+
|18258 | 3 days 8 hours 6 mins 11.680846 secs |
|14029 | 3 days 7 hours 55 mins 57.796614 secs|
|25136 | 3 days 7 hours 55 mins 57.796616 secs|
|12053 | 3 days 8 hours 6 mins 11.680867 secs |
|12054 | 3 days 8 hours 6 mins 11.680869 secs |
|14028 | 3 days 7 hours 55 mins 57.796674 secs|
+-----+----------------------------------------------------+
Metadata
Metadata
Assignees
Type
Projects
Status
Todo