-
Notifications
You must be signed in to change notification settings - Fork 273
Description
Describe the bug
First we want to get the column lineage with insert sql(spark sql, running in hadoop Hadoop ).But In some cases, the results of field lineage analysis based on Spark SQL are incorrect,such as when no target table field information is specified after insert table xxx expression:
SQL
Paste the SQL text here. For example:
INSERT OVERWRITE table data_ds.data_metric_di PARTITION(ds=20251013)
SELECT uin,
device,
play_cnt,
sum(if(play_progress_cnt>90,
1,
0)) play_progress90_cnt,
FROM data_dwd.data_metric_ori
WHERE ds=20251013To Reproduce
Note here we refer to SQL provided in prior step as stored in a file named test.sql
API (Application Programming Interface): provide the python code you're using and the output.
For example:
from sqllineage.runner import LineageRunner
from sqllineage.core.metadata.dummy import DummyMetaDataProvider
if __name__ == '__main__':
sql ='''
INSERT OVERWRITE table data_ds.data_metric_di PARTITION(ds=20251013)
SELECT uin,
device,
play_cnt,
sum(if(play_progress_cnt>90,
1,
0)) play_progress90_cnt,
FROM data_dwd.data_metric_ori
WHERE ds=20251013
'''
metadata = dict()
metadata = {"data_ds.data_metric_di":["ds","uin","device","pcr_cnt","pcr90_cnt"], "data_dwd.data_metric_ori":["ds","uin","device","play_cnt","play_progress_cnt"]}
provider = DummyMetaDataProvider(metadata)
rs = LineageRunner(sql=sql, dialect='sparksql', metadata_provider=provider,verbose=True)
columns = rs.get_column_lineage(exclude_subquery_columns=True)
print("column_lineage:", columns)[Table: <default>.analyze]
the output is as follows:
[(Column: data_dwd.data_metric_ori.device, Column: data_ds.data_metric_di.device), (Column: data_dwd.data_metric_ori.play_cnt, Column: data_ds.data_metric_di.play_cnt), (Column: data_dwd.data_metric_ori.play_progress_cnt, Column: data_ds.data_metric_di.play_progress90_cnt), (Column: data_dwd.data_metric_ori.uin, Column: data_ds.data_metric_di.uin)]
Expected behavior
A clear and concise description of what you expected to happen, and the output in accordance with the To Reproduce section.
The sql is legal and can be used to insert data correctly in the Hadoop environment .But we can see that (Column: data_dwd.data_metric_ori.play_progress_cnt, Column: data_ds.data_metric_di.play_progress90_cnt) is wrong,it should be (Column: data_dwd.data_metric_ori.play_progress_cnt, Column: data_ds.data_metric_di.pcr90_cnt) ,as we have already declared the columns of the target table data_ds.data_metric_di.Is there any way to solve this problem?
Python version (available via python --version)
- 3.10
SQLLineage version (available via sqllineage --version):
- 1.5.3
Additional context
Add any other context about the problem here.