Skip to content

[BUG]: DataStage System Variables Not Converted #2114

@Fatiine

Description

@Fatiine

Is there an existing issue for this?

  • I have searched the existing issues

Category of Bug / Issue

Converter bug

Current Behavior

Description

DataStage system variables like @NUMPARTITIONS, @PARTITIONNUM, @INROWNUM are not converted to PySpark equivalents, resulting in undefined variable errors.

Example: Parallel-Safe ID Generation

TGT_TABLE = dbutils.widgets.get("TGT_TABLE")
spark.sql(rf"""TRUNCATE TABLE #TGT_TABLE#""")(litrow_number() * lit(NUMPARTITIONS)) + PARTITIONNUM)) - 
    (lit(NUMPARTITIONS) - 1)
)

Issues:

  • NUMPARTITIONS is undefined → should be spark.sparkContext.defaultParallelism or configurable
  • PARTITIONNUM is undefined → should be spark_partition_id()
  • litrow_number() is invalid syntax → should be row_number().over(window_spec)
  • Missing Window import

Expected Behavior

Expected Transpilation:

from pyspark.sql.window import Window

window_spec = Window.orderBy(monotonically_increasing_id())

df = df \
    .withColumn('ROW_NUM', row_number().over(window_spec)) \
    .withColumn('partition_id', spark_partition_id()) \
    .withColumn('ID',
        (col('MAX_ID') + 
         ((col('ROW_NUM') * lit(spark.sparkContext.defaultParallelism)) + col('partition_id'))) - 
        (lit(spark.sparkContext.defaultParallelism) - 1)
    )

Steps To Reproduce

1. Create a job with transformer using system variables:

Save as test_job_sysvar.xml:

<?xml version="1.0" encoding="UTF-8"?>
<DSExport>
  <Job Identifier="TestSysVar">
    <Record Identifier="V1" Type="CTransformerStage">
      <Property Name="Name">TRF_1</Property>
      <Collection Name="OutputColumns">
        <SubRecord>
          <Property Name="Name">PARTITION_ID</Property>
          <Property Name="Derivation">@PARTITIONNUM</Property>
        </SubRecord>
        <SubRecord>
          <Property Name="Name">TOTAL_PARTITIONS</Property>
          <Property Name="Derivation">@NUMPARTITIONS</Property>
        </SubRecord>
        <SubRecord>
          <Property Name="Name">ROW_NUMBER</Property>
          <Property Name="Derivation">@INROWNUM</Property>
        </SubRecord>
        <SubRecord>
          <Property Name="Name">UNIQUE_ID</Property>
          <Property Name="Derivation">MAX_ID + ((@INROWNUM * @NUMPARTITIONS) + @PARTITIONNUM)</Property>
        </SubRecord>
      </Collection>
    </Record>
  </Job>
</DSExport>

2. Run transpilation:

databricks labs lakebridge transpile --input-source test_job_sysvar.xml --output-folder transpiled --debug 

3. Observe output:

df = df.select(
    col('PARTITION_ID').alias('PARTITION_ID'),  # References undefined PARTITIONNUM
    lit(NUMPARTITIONS).alias('TOTAL_PARTITIONS'),  # ❌ Undefined variable
    litrow_number().alias('ROW_NUMBER'),  # ❌ Invalid syntax
    (col('MAX_ID') + ((litrow_number() * lit(NUMPARTITIONS)) + PARTITIONNUM)).alias('UNIQUE_ID')  # ❌ Multiple errors
)

Relevant log output or Exception details

Logs Confirmation

  • I ran the command line with --debug
  • I have attached the lsp-server.log under USER_HOME/.databricks/labs/remorph-transpilers/<converter_name>/lib/lsp-server.log

Sample Query

Operating System

macOS

Version

latest via Databricks CLI

Metadata

Metadata

Labels

bb converterIssues related to BB converter

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions