Skip to content

Add Lineage Information to EXPLAIN (TYPE IO) #24952

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

predator4ann
Copy link

@predator4ann predator4ann commented Feb 7, 2025

Description

This PR introduces the following changes to support column lineage information in EXPLAIN (TYPE IO):

  • A new outputColumns field is added to the query output metadata. This field provides detailed column lineage information, enabling users to trace the origin of each column in the result set.
  • Supported for INSERT and CREATE TABLE statements.

Additional context and related issues

#22606

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

Examples

  • INSERT
trino> EXPLAIN (TYPE IO) INSERT INTO hive.tmp.dst_table SELECT c1, c2 FROM tmp.src_table;
                                                                                                                                                                                                                                             >
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 {                                                                                                                                                                                                                                           >
   "operationType" : "INSERT",                                                                                                                                                                                                               >
   "inputTableColumnInfos" : [ {                                                                                                                                                                                                             >
     "table" : {                                                                                                                                                                                                                             >
       "catalog" : "hive",                                                                                                                                                                                                                   >
       "schemaTable" : {                                                                                                                                                                                                                     >
         "schema" : "tmp",                                                                                                                                                                                                                   >
         "table" : "src_table"                                                                                                                                                                                                               >
       }                                                                                                                                                                                                                                     >
     },                                                                                                                                                                                                                                      >
     "tableDetail" : "{\n  \"schemaName\" : \"tmp\",\n  \"tableName\" : \"src_table\",\n  \"partitionColumns\" : [ ],\n  \"dataColumns\" : [ {\n    \"baseColumnName\" : \"c1\",\n    \"baseHiveColumnIndex\" : 0,\n    \"baseHiveType\" : \">
     "constraint" : {                                                                                                                                                                                                                        >
       "none" : false,                                                                                                                                                                                                                       >
       "columnConstraints" : [ ]                                                                                                                                                                                                             >
     },                                                                                                                                                                                                                                      >
     "estimate" : {                                                                                                                                                                                                                          >
       "outputRowCount" : "NaN",                                                                                                                                                                                                             >
       "outputSizeInBytes" : "NaN",                                                                                                                                                                                                          >
       "cpuCost" : "NaN",                                                                                                                                                                                                                    >
       "maxMemory" : 0.0,                                                                                                                                                                                                                    >
       "networkCost" : 0.0                                                                                                                                                                                                                   >
     }                                                                                                                                                                                                                                       >
   } ],                                                                                                                                                                                                                                      >
   "outputTable" : {                                                                                                                                                                                                                         >
     "catalog" : "hive",                                                                                                                                                                                                                     >
     "schemaTable" : {                                                                                                                                                                                                                       >
       "schema" : "tmp",                                                                                                                                                                                                                     >
       "table" : "dst_table"                                                                                                                                                                                                                 >
     }                                                                                                                                                                                                                                       >
   },                                                                                                                                                                                                                                        >
   "estimate" : {                                                                                                                                                                                                                            >
     "outputRowCount" : "NaN",                                                                                                                                                                                                               >
     "outputSizeInBytes" : "NaN",                                                                                                                                                                                                            >
     "cpuCost" : "NaN",                                                                                                                                                                                                                      >
     "maxMemory" : "NaN",                                                                                                                                                                                                                    >
     "networkCost" : "NaN"                                                                                                                                                                                                                   >
   },                                                                                                                                                                                                                                        >
   "outputColumns" : [ {                                                                                                                                                                                                                     >
     "columnName" : "c1",                                                                                                                                                                                                                    >
     "columnType" : "varchar",                                                                                                                                                                                                               >
     "sourceColumns" : [ {                                                                                                                                                                                                                   >
       "catalog" : "hive",                                                                                                                                                                                                                   >
       "schema" : "tmp",                                                                                                                                                                                                                     >
       "table" : "src_table",                                                                                                                                                                                                                >
       "columnName" : "c1"                                                                                                                                                                                                                   >
     } ]                                                                                                                                                                                                                                     >
   }, {                                                                                                                                                                                                                                      >
     "columnName" : "c2",                                                                                                                                                                                                                    >
     "columnType" : "varchar",                                                                                                                                                                                                               >
     "sourceColumns" : [ {                                                                                                                                                                                                                   >
       "catalog" : "hive",                                                                                                                                                                                                                   >
       "schema" : "tmp",                                                                                                                                                                                                                     >
       "table" : "src_table",                                                                                                                                                                                                                >
       "columnName" : "c2"                                                                                                                                                                                                                   >
     } ]                                                                                                                                                                                                                                     >
   } ]                                                                                                                                                                                                                                       >
 }                                                                                                                                                                                                                                           >
(1 row)
  • CREATE TABLE


trino> EXPLAIN (TYPE IO) CREATE TABLE hive.tmp.test_explain_io AS SELECT c1, c2 FROM hive.tmp.src_table;
                                                                                                                                                                                                                                             >
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 {                                                                                                                                                                                                                                           >
   "operationType" : "CREATE",                                                                                                                                                                                                               >
   "inputTableColumnInfos" : [ {                                                                                                                                                                                                             >
     "table" : {                                                                                                                                                                                                                             >
       "catalog" : "hive",                                                                                                                                                                                                                   >
       "schemaTable" : {                                                                                                                                                                                                                     >
         "schema" : "tmp",                                                                                                                                                                                                                   >
         "table" : "src_table"                                                                                                                                                                                                               >
       }                                                                                                                                                                                                                                     >
     },                                                                                                                                                                                                                                      >
     "tableDetail" : "{\n  \"schemaName\" : \"tmp\",\n  \"tableName\" : \"src_table\",\n  \"partitionColumns\" : [ ],\n  \"dataColumns\" : [ {\n    \"baseColumnName\" : \"c1\",\n    \"baseHiveColumnIndex\" : 0,\n    \"baseHiveType\" : \">
     "constraint" : {                                                                                                                                                                                                                        >
       "none" : false,                                                                                                                                                                                                                       >
       "columnConstraints" : [ ]                                                                                                                                                                                                             >
     },                                                                                                                                                                                                                                      >
     "estimate" : {                                                                                                                                                                                                                          >
       "outputRowCount" : "NaN",                                                                                                                                                                                                             >
       "outputSizeInBytes" : "NaN",                                                                                                                                                                                                          >
       "cpuCost" : "NaN",                                                                                                                                                                                                                    >
       "maxMemory" : 0.0,                                                                                                                                                                                                                    >
       "networkCost" : 0.0                                                                                                                                                                                                                   >
     }                                                                                                                                                                                                                                       >
   } ],                                                                                                                                                                                                                                      >
   "outputTable" : {                                                                                                                                                                                                                         >
     "catalog" : "hive",                                                                                                                                                                                                                     >
     "schemaTable" : {                                                                                                                                                                                                                       >
       "schema" : "tmp",                                                                                                                                                                                                                     >
       "table" : "test_explain_io"                                                                                                                                                                                                              >
     }                                                                                                                                                                                                                                       >
   },                                                                                                                                                                                                                                        >
   "estimate" : {                                                                                                                                                                                                                            >
     "outputRowCount" : "NaN",                                                                                                                                                                                                               >
     "outputSizeInBytes" : "NaN",                                                                                                                                                                                                            >
     "cpuCost" : "NaN",                                                                                                                                                                                                                      >
     "maxMemory" : "NaN",                                                                                                                                                                                                                    >
     "networkCost" : "NaN"                                                                                                                                                                                                                   >
   },                                                                                                                                                                                                                                        >
   "outputColumns" : [ {                                                                                                                                                                                                                     >
     "columnName" : "c1",                                                                                                                                                                                                                    >
     "columnType" : "varchar",                                                                                                                                                                                                               >
     "sourceColumns" : [ {                                                                                                                                                                                                                   >
       "catalog" : "hive",                                                                                                                                                                                                                   >
       "schema" : "tmp",                                                                                                                                                                                                                     >
       "table" : "src_table",                                                                                                                                                                                                                >
       "columnName" : "c1"                                                                                                                                                                                                                   >
     } ]                                                                                                                                                                                                                                     >
   }, {                                                                                                                                                                                                                                      >
     "columnName" : "c2",                                                                                                                                                                                                                    >
     "columnType" : "varchar",                                                                                                                                                                                                               >
     "sourceColumns" : [ {                                                                                                                                                                                                                   >
       "catalog" : "hive",                                                                                                                                                                                                                   >
       "schema" : "tmp",                                                                                                                                                                                                                     >
       "table" : "src_table",                                                                                                                                                                                                                >
       "columnName" : "c2"                                                                                                                                                                                                                   >
     } ]                                                                                                                                                                                                                                     >
   } ]                                                                                                                                                                                                                                       >
 }                                                                                                                                                                                                                                           >
(1 row)

@cla-bot cla-bot bot added the cla-signed label Feb 7, 2025
@github-actions github-actions bot added the hive Hive connector label Feb 7, 2025
@predator4ann predator4ann force-pushed the feature/add_lineage_info_to_explain_typeio branch from 4766517 to ee119fa Compare February 8, 2025 07:13
@predator4ann
Copy link
Author

@Praveen2112 PTAL

@predator4ann predator4ann force-pushed the feature/add_lineage_info_to_explain_typeio branch 3 times, most recently from abee3ea to aef0a7f Compare February 24, 2025 14:44
@predator4ann
Copy link
Author

predator4ann commented Feb 25, 2025

@Praveen2112 Hi Praveen, a previous concurrency issue has been fixed

@RedEminence
Copy link

RedEminence commented Feb 25, 2025

@predator4ann Hello, just wanted to thank you for this feature. It'd be very useful for me, so I fetched it locally to try on my data. In most cases it works great, but I found a problem with complex queries with UNIONs.
Consider this example:

CREATE SCHEMA memory."test_schema"

CREATE TABLE memory."test_schema".table_1 (
   id varchar,
   col_1 varchar,
   col_2 varchar,
   col_3 varchar,
   col_4 varchar,
   col_5 varchar,
   col_6 varchar,
   partition_date date
)

CREATE TABLE memory."test_schema".table_2 (
	rules array(ROW(segments ROW(col_12 varchar, col_13 varchar)))
)

CREATE TABLE memory."test_schema".table_3 (
   rules array(ROW(segments ROW(col_12 varchar, col_13 varchar)))
)

CREATE TABLE memory."test_schema".table_4 (
   id bigint,
   col_7 bigint,
   col_8 integer,
   col_9 varchar,
   col_10 varchar,
   col_11 varchar
)

CREATE TABLE memory."test_schema".table_1_2 AS SELECT * FROM memory."test_schema".table_1

CREATE TABLE memory."test_schema".table_1_3 AS SELECT * FROM memory."test_schema".table_1
              

EXPLAIN (TYPE IO) CREATE TABLE memory."test_schema".test AS 
WITH
    segments_0 AS
        (
SELECT
	COALESCE(LOWER(col_3),
	'') AS col_14,
	COALESCE(LOWER(col_4),
	'') AS col_15,
	REPLACE(REPLACE(COALESCE(LOWER(col_5),
	''),
	'\\',
	'|'),
	'/',
	'|') AS col_16
FROM
	memory."test_schema".table_1
GROUP BY
	1,
	2,
	3
        ),
    segments_1 AS
        (
SELECT
	segments_0.col_14,
	segments_0.col_15,
	segments_0.col_16,
	result_old.segments AS col_12_old,
	result_old.segments AS col_13_old,
	result_old.segments AS segment_old,
	result_new.segments AS col_12_new,
	result_new.segments AS col_13_new,
	result_new.segments AS segment_new
FROM
	segments_0,
	memory."test_schema".table_2,
	memory."test_schema".table_3,
	UNNEST(memory."test_schema".table_2.rules) AS result_old,
	UNNEST(memory."test_schema".table_3.rules) AS result_new
        ),
    prep AS
        (
SELECT
	id,
	col_1,
	col_2,
	COALESCE(LOWER(col_3),
	'') AS col_14,
	COALESCE(LOWER(col_4),
	'') AS col_15,
	REPLACE(REPLACE(COALESCE(LOWER(col_5),
	''),
	'\\',
	'|'),
	'/',
	'|') AS col_16,
	col_6,
	col_5 AS col_9_id,
	col_5 AS col_7,
	col_5 AS col_8,
	partition_date,
	CASE
		WHEN col_5 = '01' THEN '1'
		ELSE col_5
	END AS col_18,
	CASE
		WHEN col_5 = '02' THEN '2'
		ELSE col_5
	END
                    AS col_17
FROM
	memory."test_schema".table_1
UNION ALL
SELECT
	'test' AS id,
	CONCAT(col_2,
	col_1) AS col_1,
	col_2,
	COALESCE(LOWER(col_5),
	'') AS col_14,
	COALESCE(LOWER(col_5),
	'') AS col_15,
	REPLACE(REPLACE(COALESCE(LOWER(col_5),
	''),
	'\\',
	'|'),
	'/',
	'|') AS col_16,
	col_6,
	col_5 AS col_9_id,
	col_5 AS col_7,
	col_5 AS col_8,
	partition_date,
	'test' AS col_18,
	CASE
		WHEN col_5 = '02' THEN '2'
		ELSE col_5
	END
                    AS col_17
FROM
	memory."test_schema".table_1_2
UNION ALL
SELECT
	id,
	CONCAT(col_2,
	col_1) AS col_1,
	col_2,
	COALESCE(LOWER(col_5),
	'') AS col_14,
	COALESCE(LOWER(col_5),
	'') AS col_15,
	REPLACE(REPLACE(COALESCE(LOWER(col_5),
	''),
	'\\',
	'|'),
	'/',
	'|') AS col_16,
	col_6,
	col_5 AS col_9_id,
	COALESCE(CAST(col_5 AS VARCHAR),
	CAST(col_6 AS VARCHAR),
	CAST(partition_date AS VARCHAR)) AS col_7,
	col_5 AS col_8,
	partition_date,
	'test' AS col_18,
	CASE
		WHEN col_5 = '02' THEN '2'
		ELSE col_5
	END
                    AS col_17
FROM
	memory."test_schema".table_1_3
        ),
    table_6 AS
        (
SELECT
	prep.id,
	prep.col_1,
	prep.col_2,
	prep.col_6,
	prep.col_14,
	prep.col_15,
	prep.col_16,
	prep.col_18,
	prep.col_17,
	prep.col_9_id,
	geo_1.col_9,
	partition_date,
	CASE
		WHEN prep.col_9_id IS NOT NULL
		AND prep.col_7 IS NULL THEN CAST(geo_1.col_7 AS VARCHAR)
		ELSE prep.col_7
	END AS col_7,
	prep.col_8
FROM
	prep
LEFT JOIN memory."test_schema".table_4 geo_1 ON
	prep.col_9_id = CAST(geo_1.id AS VARCHAR)
        ),
    table_5 AS
        (
SELECT
	table_6.id,
	table_6.col_1,
	table_6.col_2,
	table_6.col_6,
	table_6.col_14,
	table_6.col_15,
	table_6.col_16,
	table_6.col_18,
	table_6.col_17,
	table_6.col_9_id,
	table_6.col_9,
	table_6.col_7,
	table_6.partition_date,
	CASE
		WHEN table_6.col_8 IS NOT NULL THEN table_6.col_8
		WHEN table_6.col_8 IS NULL
			AND table_6.col_7 IS NOT NULL THEN CAST(geo_2.col_8 AS VARCHAR)
			WHEN table_6.col_8 IS NULL
				AND table_6.col_7 IS NULL
				AND table_6.col_9_id IS NOT NULL THEN CAST(geo_3.col_8 AS VARCHAR)
			END AS col_8
		FROM
			table_6
		LEFT JOIN (
			SELECT
				col_7,
				col_8
			FROM
				memory."test_schema".table_4
			GROUP BY
				1,
				2) AS geo_2 ON
			table_6.col_7 = CAST(geo_2.col_7 AS VARCHAR)
		LEFT JOIN memory."test_schema".table_4 AS geo_3 ON
			table_6.col_9_id = CAST(geo_3.id AS VARCHAR)
        ),
    test_6 AS
        (
SELECT
	table_5.id,
	table_5.col_1,
	table_5.col_2,
	table_5.col_6,
	table_5.col_14 AS col_14,
	table_5.col_15,
	table_5.col_16,
	table_5.col_18,
	table_5.col_17,
	table_5.col_9_id,
	table_5.col_9,
	table_5.col_7,
	geo_4.col_10,
	table_5.col_8,
	table_5.partition_date,
	geo_5.col_11
FROM
	table_5
LEFT JOIN (
	SELECT
		col_7,
		col_10
	FROM
		memory."test_schema".table_4
	GROUP BY
		1,
		2) AS geo_4 ON
	table_5.col_7 = CAST(geo_4.col_7 AS VARCHAR)
LEFT JOIN (
	SELECT
		col_8,
		col_11
	FROM
		memory."test_schema".table_4
	GROUP BY
		1,
		2) AS geo_5 ON
	table_5.col_8 = CAST(geo_5.col_8 AS VARCHAR)
        )
SELECT
	g.id,
	g.col_1,
	g.col_2,
	g.col_6,
	g.col_14 AS col_3,
	g.col_15 AS col_4,
	g.col_16 AS col_5,
	g.col_18 AS col_18_utm_col_5,
	g.col_17 AS col_17_utm_col_5,
	g.col_9_id AS col_9_id_utm_col_5,
	g.col_9 AS city_utm_col_5,
	g.col_7 AS col_7_utm_col_5,
	g.col_10 AS col_10_utm_col_5,
	g.col_8 AS col_8_utm_col_5,
	g.col_11 AS col_11_utm_col_5,
	s.col_12_old,
	s.col_13_old,
	s.segment_old,
	s.col_12_new,
	s.col_13_new,
	s.segment_new,
	g.partition_date
FROM
	test_6 g
LEFT JOIN segments_1 s
                   ON
	s.col_14 = g.col_14
	AND s.col_15 = g.col_15
	AND s.col_16 = g.col_16

As a result we would get this output:

"outputColumns" : [ {
    "columnName" : "id",
    "columnType" : "varchar",
    "sourceColumns" : [ {
      "catalog" : "memory",
      "schema" : "test_schema",
      "table" : "table_1",
      "columnName" : "id"
    } ]
  }, 

As you can see for this "id" column sourceColumns lack another source which is "id" from "table_1_3". Have you encountered this issue? It is probably due to

public Set<SourceColumn> getSourceColumns()
which does not return all UNION'd columns. But I might be mistaken of course.

@predator4ann
Copy link
Author

predator4ann commented Feb 26, 2025

@predator4ann Hello, just wanted to thank you for this feature. It'd be very useful for me, so I fetched it locally to try on my data. In most cases it works great, but I found a problem with complex queries with UNIONs. Consider this example:

@RedEminence Hi, thank you for using and feedback. I think the issue is probably due to the need to add the implementation of the visitUnion() method in StatementAnalyzer to capture lineage information in this situation. I will try to solve this.

@predator4ann
Copy link
Author

predator4ann commented Feb 26, 2025

@RedEminence Hi, I proposed a PR to solve the issue of UNION lineage information missing #25149, maybe you can refer to it.

@RedEminence
Copy link

Thank you so much! I'll take a look when I have a chance

@Praveen2112 Praveen2112 self-requested a review February 27, 2025 05:51
@predator4ann
Copy link
Author

@Praveen2112 would you please help to review this PR?

Copy link

github-actions bot commented Apr 9, 2025

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions github-actions bot added the stale label Apr 9, 2025
@predator4ann predator4ann force-pushed the feature/add_lineage_info_to_explain_typeio branch from aef0a7f to 19271c1 Compare April 16, 2025 09:12
@github-actions github-actions bot removed the stale label Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

2 participants