-
Notifications
You must be signed in to change notification settings - Fork 69
Description
Overwatch Version: 0.8.2.0
In raw Spark event logs, and therefore in Overwatch table spark_events_bronze, the field ExecutorID is usually a number but occasionally it gets the value 'driver'.
In Overwatch deployments where this special value is present in the first run for a given target storage location Spark will infer STRING type for that column and this particular issue will never occur.
In Overwatch deployments where no such special value is present in the first run for a given target storage location Spark will infer BIGINT/Long for that column and create the spark_*_silver target tables with same. In some later Overwatch ETL run, when 'driver' shows up in that column in spark_events_bronze one of two things can happen when persisting results to spark_*_silver tables depending on the DBR version:
-
DBR 11.3: Spark silently converts
'driver'toNULLwhile implicitly casting values like'0'and'105'toBIGINTs. This behavior is available in later DBRs by setting configuration propertyspark.sql.storeAssignmentPolicytolegacy, but this is not explicitly set anywhere in the Overwatch code as of release 0.8.2.0. -
DBR 13.3: by default,
spark.sql.storeAssignmentPolicyis set toANSI, which causes a runtime exception when attempting to implicitly cast'driver'toBIGINT. See Safe casts enabled by default for Delta Lake operations in the DBR 13.3 release notes and ANSI compliance in Databricks Runtime in the Databricks SQL language reference for details.
The Silver Spark modules should be future-proofed for DBR > 11.3 by explicitly designating STRING type for ExecutorID columns or some equivalent solution.
Two workarounds are available in the meantime:
- Use DBR 11.3 for Overwatch ETL runs and accept this minor data loss in
spark_*_silvertables, i.e.ExecutorIDwill beNULLunlike its upstream source column of typeSTRINGinspark_events_bronze. - OR manually adjust the target schemas according to this guidance: Explicitly update schema to change column type or name.