Skip to content

Expose BigQuery schema autodetect in Java SDK #20515

Open
@damccorm

Description

@damccorm

The Beam Java SDK's BigQueryIO transform currently doesn't expose the schema autodetect job configuration.  The feature is exposed by the current Python SDK, but not the Java SDK.

Although Java is more strict about types and schemas, the BigQueryIO transform supports writing TableRows which don't inherently have a schema. This provides a convenient path for loading JSON data into BigQuery but is massively thwarted by the fact that a schema is required to make use of the SchemaUpdateOption values ALLOW_FIELD_ADDITION and ALLOW_FIELD_RELAXATION.

The BigQuery schema autodetection feature must be enabled at the JobConfigurationLoad level. The BigQueryIO creates the JobConfigurationLoad in only one place: WriteTables.java. Exposing the autodetection option would mean adding it here, then propagating the change upwards until it's exposed at the BigQueryIO.Write level.

A big of context on this issue:

  • Google cloud's blog has an article on handling mutating JSON schemas in Dataflow using a black-box "Validate and Mutate BQ Schema" step.
  • Suggested workarounds include creating a stateful DoFn to dynamically generate a schema, load it as a side input to create a PCollectionView, then passing it to BigQuerIO using withSchemaFromView: https://stackoverflow.com/a/58809875/477563
  • Entire projects have been created to try and work around this issue.

All of the above would be rendered moot (and many headaches spared!) if only the schema autodetection were exposed in the Java SDK like it already is in the Python SDK.

Imported from Jira BEAM-10826. Original Jira may contain additional context.
Reported by: mtruscello.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions