Skip to content

Implicit cast change in DBR 13.3 can cause failures in Silver Spark modules  #1311

@neilbest-db

Description

@neilbest-db

Overwatch Version: 0.8.2.0

In raw Spark event logs, and therefore in Overwatch table spark_events_bronze, the field ExecutorID is usually a number but occasionally it gets the value 'driver'.

In Overwatch deployments where this special value is present in the first run for a given target storage location Spark will infer STRING type for that column and this particular issue will never occur.

In Overwatch deployments where no such special value is present in the first run for a given target storage location Spark will infer BIGINT/Long for that column and create the spark_*_silver target tables with same. In some later Overwatch ETL run, when 'driver' shows up in that column in spark_events_bronze one of two things can happen when persisting results to spark_*_silver tables depending on the DBR version:

  • DBR 11.3: Spark silently converts 'driver' to NULL while implicitly casting values like '0' and '105' to BIGINTs. This behavior is available in later DBRs by setting configuration property spark.sql.storeAssignmentPolicy to legacy, but this is not explicitly set anywhere in the Overwatch code as of release 0.8.2.0.

  • DBR 13.3: by default, spark.sql.storeAssignmentPolicy is set to ANSI, which causes a runtime exception when attempting to implicitly cast 'driver' to BIGINT. See Safe casts enabled by default for Delta Lake operations in the DBR 13.3 release notes and ANSI compliance in Databricks Runtime in the Databricks SQL language reference for details.

The Silver Spark modules should be future-proofed for DBR > 11.3 by explicitly designating STRING type for ExecutorID columns or some equivalent solution.

Two workarounds are available in the meantime:

  • Use DBR 11.3 for Overwatch ETL runs and accept this minor data loss in spark_*_silver tables, i.e. ExecutorID will be NULL unlike its upstream source column of type STRING in spark_events_bronze.
  • OR manually adjust the target schemas according to this guidance: Explicitly update schema to change column type or name.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdata qualityThere is a data quality issue hereschema changeRequires a schema change

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions