Skip to content

[BUG]: Datastage : Lookup Stages Generate Incomplete .select() Statements #2115

@Fatiine

Description

@Fatiine

Is there an existing issue for this?

  • I have searched the existing issues

Category of Bug / Issue

Converter bug

Current Behavior

Description

Lookup stages (DataStage joins with reference data) generate .select() statements with missing source column references, causing immediate AnalysisException on execution.

Example: Multi-Table Lookup

DataStage: Lookup joining main stream with 2 reference tables

Current Transpilation (INCORRECT):

result_df = main_df \
    .join(ref_table1, [...], 'LEFT_OUTER') \
    .join(ref_table2, [...], 'LEFT_OUTER') \
    .select(
        main_df.COL1.alias('COL1'),
        main_df.COL2.alias('COL2'),
        .alias('REF_COL1'),      # ❌ Missing source DataFrame
        .alias('REF_COL2')       # ❌ Missing source DataFrame
    )

Expected Behavior

Expected Transpilation :

TGT_TABLE = dbutils.widgets.get("TGT_TABLE")
spark.sql(f"""TRUNCATE TABLE {TGT_TABLE}""") 'left') \
    .join(ref_table2, [...], 'left') \
    .select(
        main_df.COL1.alias('COL1'),
        main_df.COL2.alias('COL2'),
        main_df.JOIN_KEY.alias('JOIN_KEY'),
        ref_table1.REF_COL1.alias('REF_COL1'),
        ref_table2.REF_COL2.alias('REF_COL2')
    )

Steps To Reproduce

1. Create a job with lookup stage:

Save as test_job_lookup.xml:

<?xml version="1.0" encoding="UTF-8"?>
<DSExport>
  <Job Identifier="TestLookup">
    <Record Identifier="V1" Type="PxLookup">
      <Property Name="Name">LKP_1</Property>
      <Collection Name="InputPins">
        <SubRecord>
          <Property Name="Name">MainInput</Property>
          <Collection Name="Columns">
            <SubRecord>
              <Property Name="Name">ID</Property>
            </SubRecord>
            <SubRecord>
              <Property Name="Name">JOIN_KEY</Property>
            </SubRecord>
          </Collection>
        </SubRecord>
        <SubRecord>
          <Property Name="Name">RefInput</Property>
          <Property Name="LookupKey">REF_KEY</Property>
          <Collection Name="Columns">
            <SubRecord>
              <Property Name="Name">REF_KEY</Property>
            </SubRecord>
            <SubRecord>
              <Property Name="Name">REF_VALUE</Property>
            </SubRecord>
          </Collection>
        </SubRecord>
      </Collection>
      <Collection Name="OutputPins">
        <SubRecord>
          <Property Name="Name">Output</Property>
          <Collection Name="Columns">
            <SubRecord>
              <Property Name="Name">ID</Property>
              <Property Name="Source">MainInput.ID</Property>
            </SubRecord>
            <SubRecord>
              <Property Name="Name">JOIN_KEY</Property>
              <Property Name="Source">MainInput.JOIN_KEY</Property>
            </SubRecord>
            <SubRecord>
              <Property Name="Name">REF_VALUE</Property>
              <Property Name="Source">RefInput.REF_VALUE</Property>
            </SubRecord>
          </Collection>
        </SubRecord>
      </Collection>
    </Record>
  </Job>
</DSExport>

2. Run transpilation:

databricks labs lakebridge transpile --input-source test_job_lookup.xml --output-folder transpiled --debug 

3. Observe output

Relevant log output or Exception details

Logs Confirmation

  • I ran the command line with --debug
  • I have attached the lsp-server.log under USER_HOME/.databricks/labs/remorph-transpilers/<converter_name>/lib/lsp-server.log

Sample Query

Operating System

macOS

Version

latest via Databricks CLI

Metadata

Metadata

Assignees

No one assigned

    Labels

    bb converterIssues related to BB converter

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions