Skip to content

BigQuery writes with GenericRecord format don't support overridden non-String types #5644

@clairemcginty

Description

@clairemcginty

Any BQT class that uses OverrideTypeProvider to wrap a non-String type will fail when GenericRecord format is used to write records to BQ. For example:

// sample of override type provider
object NonNegativeInt {
  def parse(data: String): NonNegativeInt = NonNegativeInt(data.toInt)

  def stringType: String = "NONNEGATIVEINT"

  def bigQueryType: String = "INTEGER"
}
object Index {
  def getIndexCompileTimeTypes(c: blackbox.Context): mutable.Map[c.Type, Class[_]] = {
    import c.universe._
    mutable.Map[Type, Class[_]](
      typeOf[NonNegativeInt] -> classOf[NonNegativeInt]
    )
  }
  def getIndexClass: mutable.Map[String, Class[_]] =
    mutable.Map[String, Class[_]](
      NonNegativeInt.stringType -> classOf[NonNegativeInt]
    )
  def getIndexRuntimeTypes: mutable.Map[Type, Class[_]] =
    mutable.Map[Type, Class[_]](
      typeOf[NonNegativeInt] -> classOf[NonNegativeInt]
    )
}

// sample of job
@BigQueryType.ToTable
case class MyRecord(i: NonNegativeInt)

sc
  .parallelize(1 to 10)
  .map(MyRecord(NonNegativeInt(i))
  .saveAsTypedBigQueryTable(...)

will fail with a class cast exception like:

[info]   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:326)
[info]   at org.apache.beam.sdk.io.gcp.bigquery.AvroRowWriter.write(AvroRowWriter.java:58)
[info]   at org.apache.beam.sdk.io.gcp.bigquery.WriteBundlesToFiles.processElement(WriteBundlesToFiles.java:247)
[info]   ...
[info]   Cause: java.lang.ClassCastException: value 31 (a com.spotify.scio.example.NonNegativeInt) cannot be cast to expected type long at MyRecord.i
[info]   at org.apache.avro.path.TracingClassCastException.summarize(TracingClassCastException.java:79)
[info]   at org.apache.avro.path.TracingClassCastException.summarize(TracingClassCastException.java:30)
[info]   at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:84)
[info]   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:323)
[info]   at org.apache.beam.sdk.io.gcp.bigquery.AvroRowWriter.write(AvroRowWriter.java:58)
[info]   at org.apache.beam.sdk.io.gcp.bigquery.WriteBundlesToFiles.processElement(WriteBundlesToFiles.java:247)
[info]   at org.apache.beam.sdk.io.gcp.bigquery.WriteBundlesToFiles$DoFnInvoker.invokeProcessElement(Unknown Source)
[info]   at org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:212)
[info]   at org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:186)
[info]   at org.apache.beam.runners.core.SimplePushbackSideInputDoFnRunner.processElementInReadyWindows(SimplePushbackSideInputDoFnRunner.java:88)

This is because toAvroInternal converts all overridden types to String: https://github.com/spotify/scio/blob/v0.14.14/scio-google-cloud-platform/src/main/scala/com/spotify/scio/bigquery/types/ConverterProvider.scala#L174 . I think this was just copied from the toTableRow behavior, where it works fine because JSON format supports stringified everything, but Avro is more strict; the converted avroSchema correctly expects an Integer value.

The workaround is to fall back to TableRow format.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions