Skip to content

[column lineage] Something wrong with column lineage(spark sql) #733

@kingbaiyulong

Description

@kingbaiyulong

Describe the bug
First we want to get the column lineage with insert sql(spark sql, running in hadoop Hadoop ).But In some cases, the results of field lineage analysis based on Spark SQL are incorrect,such as when no target table field information is specified after insert table xxx expression:

SQL
Paste the SQL text here. For example:

INSERT OVERWRITE table data_ds.data_metric_di PARTITION(ds=20251013)
SELECT uin,
         device,
         play_cnt,
         sum(if(play_progress_cnt>90,
         1,
         0)) play_progress90_cnt,
        
FROM data_dwd.data_metric_ori
WHERE ds=20251013

To Reproduce
Note here we refer to SQL provided in prior step as stored in a file named test.sql

API (Application Programming Interface): provide the python code you're using and the output.
For example:

from sqllineage.runner import LineageRunner
from sqllineage.core.metadata.dummy import DummyMetaDataProvider

if __name__ == '__main__':
    sql ='''
    INSERT OVERWRITE table data_ds.data_metric_di PARTITION(ds=20251013)
    SELECT uin,
            device,
            play_cnt,
            sum(if(play_progress_cnt>90,
            1,
            0)) play_progress90_cnt,
        
    FROM data_dwd.data_metric_ori
    WHERE ds=20251013
    '''
    metadata = dict()
    metadata = {"data_ds.data_metric_di":["ds","uin","device","pcr_cnt","pcr90_cnt"], "data_dwd.data_metric_ori":["ds","uin","device","play_cnt","play_progress_cnt"]}
    provider = DummyMetaDataProvider(metadata)
    rs = LineageRunner(sql=sql, dialect='sparksql', metadata_provider=provider,verbose=True)
    columns = rs.get_column_lineage(exclude_subquery_columns=True)
    print("column_lineage:", columns)
[Table: <default>.analyze]

the output is as follows:
[(Column: data_dwd.data_metric_ori.device, Column: data_ds.data_metric_di.device), (Column: data_dwd.data_metric_ori.play_cnt, Column: data_ds.data_metric_di.play_cnt), (Column: data_dwd.data_metric_ori.play_progress_cnt, Column: data_ds.data_metric_di.play_progress90_cnt), (Column: data_dwd.data_metric_ori.uin, Column: data_ds.data_metric_di.uin)]

Expected behavior
A clear and concise description of what you expected to happen, and the output in accordance with the To Reproduce section.

The sql is legal and can be used to insert data correctly in the Hadoop environment .But we can see that (Column: data_dwd.data_metric_ori.play_progress_cnt, Column: data_ds.data_metric_di.play_progress90_cnt) is wrong,it should be (Column: data_dwd.data_metric_ori.play_progress_cnt, Column: data_ds.data_metric_di.pcr90_cnt) ,as we have already declared the columns of the target table data_ds.data_metric_di.Is there any way to solve this problem?

Python version (available via python --version)

  • 3.10

SQLLineage version (available via sqllineage --version):

  • 1.5.3

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions