-
Notifications
You must be signed in to change notification settings - Fork 78
Description
Is there an existing issue for this?
- I have searched the existing issues
Category of Bug / Issue
Converter bug
Current Behavior
Description
When transpiling DataStage jobs to Databricks/PySpark, DataStage-specific functions are left unconverted in the generated Python code. These functions are embedded within expr(f"""...""") blocks and will cause runtime errors because they are not valid Spark SQL or PySpark syntax.
Affected Functions: IF(), TRANSLATE(), Char(), Len(), SUBSTRING(), UStringToString(), Convert(), IIF(), TimestampToString()
Impact: Generated notebooks will fail immediately upon execution with syntax errors.
Environment
Lakebridge Version: v0.10.12
Source: IBM InfoSphere DataStage 11.7
Target: Databricks/PySpark
Actual Behavior
Generated PySpark code contains DataStage-specific functions that are not valid in Spark SQL:
Generated code (line 50 in transpiled file)
DataTransformJob = SourceDataFrame.select(
expr(f"""IF ( trim(TRANSLATE(OFFICE_CODE,Char ( 00 ),Char ( 32 ))) = '' ,
'DEFAULT_KEY' ,
IF ( trim(TRANSLATE(SUBSTRING ( OFFICE_CODE , 6 , 4 ),Char ( 00 ),Char ( 32 ))) = '' ,
'UNKNOWN' ,
trim(TRANSLATE(OFFICE_CODE,Char ( 00 ),Char ( 32 )))
)
) as OUTPUT_CODE""")
)
Expected Behavior
from pyspark.sql.functions import when, trim, regexp_replace, substring, length, lit
DataTransformJob = SourceDataFrame.select(
when(
trim(regexp_replace(col('OFFICE_CODE'), '\x00', ' ')) == '',
lit('DEFAULT_KEY')
).when(
trim(regexp_replace(substring(col('OFFICE_CODE'), 6, 4), '\x00', ' ')) == '',
lit('UNKNOWN')
).when(
length(trim(regexp_replace(col('OFFICE_CODE'), '\x00', ' '))) < 9,
lit('UNKNOWN')
).otherwise(
trim(regexp_replace(col('OFFICE_CODE'), '\x00', ' '))
).alias('OUTPUT_CODE')
)
Steps To Reproduce
1: Create Test DataStage Job with Transformation
Create a DataStage job (TestTransformJob.dsx) with a Transformer stage containing these expressions:
<Job Identifier="TestTransformJob">
<Stage Type="CTransformerStage" Name="Transformer_1">
<!-- Test Case 1: Simple IF function -->
<Column Name="STATUS_CODE">
<Derivation>IF(TRIM(STATUS_INPUT) = '', 'UNKNOWN', STATUS_INPUT)</Derivation>
</Column>
<!-- Test Case 2: TRANSLATE with Char function -->
<Column Name="CLEAN_TEXT">
<Derivation>TRANSLATE(TEXT_FIELD, Char(0), Char(32))</Derivation>
</Column>
<!-- Test Case 3: Nested functions -->
<Column Name="PROCESSED_VALUE">
<Derivation>
IF(Len(TRIM(TRANSLATE(RAW_VALUE, Char(0), Char(32)))) = 0,
'EMPTY',
SUBSTRING(RAW_VALUE, 1, 10))
</Derivation>
</Column>
<!-- Test Case 4: Complex nested IF -->
<Column Name="CATEGORY">
<Derivation>
IF(TYPE_A = 'Y', 'CATEGORY_A',
IF(TYPE_B = 'Y', 'CATEGORY_B',
IF(TYPE_C = 'Y', 'CATEGORY_C', 'OTHER')))
</Derivation>
</Column>
<!-- Test Case 5: UStringToString -->
<Column Name="STRING_VALUE">
<Derivation>UStringToString(UNICODE_FIELD)</Derivation>
</Column>
</Stage>
</Job>
2: Transpile the file
Relevant log output or Exception details
Logs Confirmation
- I ran the command line with
--debug - I have attached the
lsp-server.logunder USER_HOME/.databricks/labs/remorph-transpilers/<converter_name>/lib/lsp-server.log
Sample Query
Operating System
macOS
Version
latest via Databricks CLI