Skip to content

[Bug][Go SDK]: Native bigqueryio package doesn't have scalable writes.  #27513

Open
@lostluck

Description

@lostluck

What happened?

The Go SDK native BigQueryIO package (https://github.com/apache/beam/blob/sdks/v2.48.2/sdks/go/pkg/beam/io/bigqueryio/bigquery.go#L236) doesn't scale writes.

It adds a fixed key then groups, serializing all writes to BigQuery to a single worker. This prevents writing larger datasets to BigQuery.

An acceptable fix would be to add a Load method that writes all data to files a temporary directory in GCS, in a format that BigQuery can then be tasked to injest as a load job.


The Java SDK (and I believe python) uses complex logic to choose between streaming batch RPCs for the data load vs writing to files, but in practice, it should be obvious to pipeline authors which case their jobs require for data loads.

Other issues that should be taken care of as well:

  • Not vetted for streaming writes, though it likely has better support in this case due to smaller datasizes during streaming windows.
  • Doesn't retry failed retryable RPCs. (though it should be checked if the client already does this under the hood before adding additional scaffolding).
  • No tests.

Issue Priority

Priority: 3 (minor)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions