-
Notifications
You must be signed in to change notification settings - Fork 78
Open
Labels
bb converterIssues related to BB converterIssues related to BB converter
Description
Is there an existing issue for this?
- I have searched the existing issues
Category of Bug / Issue
Converter bug
Current Behavior
Description
DataStage system variables like @NUMPARTITIONS, @PARTITIONNUM, @INROWNUM are not converted to PySpark equivalents, resulting in undefined variable errors.
Example: Parallel-Safe ID Generation
TGT_TABLE = dbutils.widgets.get("TGT_TABLE")
spark.sql(rf"""TRUNCATE TABLE #TGT_TABLE#""")(litrow_number() * lit(NUMPARTITIONS)) + PARTITIONNUM)) -
(lit(NUMPARTITIONS) - 1)
)Issues:
NUMPARTITIONSis undefined → should bespark.sparkContext.defaultParallelismor configurablePARTITIONNUMis undefined → should bespark_partition_id()litrow_number()is invalid syntax → should berow_number().over(window_spec)- Missing
Windowimport
Expected Behavior
Expected Transpilation:
from pyspark.sql.window import Window
window_spec = Window.orderBy(monotonically_increasing_id())
df = df \
.withColumn('ROW_NUM', row_number().over(window_spec)) \
.withColumn('partition_id', spark_partition_id()) \
.withColumn('ID',
(col('MAX_ID') +
((col('ROW_NUM') * lit(spark.sparkContext.defaultParallelism)) + col('partition_id'))) -
(lit(spark.sparkContext.defaultParallelism) - 1)
)Steps To Reproduce
1. Create a job with transformer using system variables:
Save as test_job_sysvar.xml:
<?xml version="1.0" encoding="UTF-8"?>
<DSExport>
<Job Identifier="TestSysVar">
<Record Identifier="V1" Type="CTransformerStage">
<Property Name="Name">TRF_1</Property>
<Collection Name="OutputColumns">
<SubRecord>
<Property Name="Name">PARTITION_ID</Property>
<Property Name="Derivation">@PARTITIONNUM</Property>
</SubRecord>
<SubRecord>
<Property Name="Name">TOTAL_PARTITIONS</Property>
<Property Name="Derivation">@NUMPARTITIONS</Property>
</SubRecord>
<SubRecord>
<Property Name="Name">ROW_NUMBER</Property>
<Property Name="Derivation">@INROWNUM</Property>
</SubRecord>
<SubRecord>
<Property Name="Name">UNIQUE_ID</Property>
<Property Name="Derivation">MAX_ID + ((@INROWNUM * @NUMPARTITIONS) + @PARTITIONNUM)</Property>
</SubRecord>
</Collection>
</Record>
</Job>
</DSExport>2. Run transpilation:
databricks labs lakebridge transpile --input-source test_job_sysvar.xml --output-folder transpiled --debug 3. Observe output:
df = df.select(
col('PARTITION_ID').alias('PARTITION_ID'), # References undefined PARTITIONNUM
lit(NUMPARTITIONS).alias('TOTAL_PARTITIONS'), # ❌ Undefined variable
litrow_number().alias('ROW_NUMBER'), # ❌ Invalid syntax
(col('MAX_ID') + ((litrow_number() * lit(NUMPARTITIONS)) + PARTITIONNUM)).alias('UNIQUE_ID') # ❌ Multiple errors
)Relevant log output or Exception details
Logs Confirmation
- I ran the command line with
--debug - I have attached the
lsp-server.logunder USER_HOME/.databricks/labs/remorph-transpilers/<converter_name>/lib/lsp-server.log
Sample Query
Operating System
macOS
Version
latest via Databricks CLI
Metadata
Metadata
Assignees
Labels
bb converterIssues related to BB converterIssues related to BB converter