Skip to content

[Bug]: WriteToBigQuery doesn't do WRITE_TRUNCATE properly with identical table names but in different datasets #34247

Open
@portikCoder

Description

@portikCoder

What happened?

Describe the bug

Using WriteToBigQuery transform for batch load with write disposition specified to truncate doesn't do its job as intended, instead of truncating all tables, it does truncate the first one.
It's happening only, in case the table IDs are identical in single batch job, but located in different BQ datasets.

To Reproduce

Steps to reproduce the behavior:

  1. Prepare a json file with several inputs into different dataset locations, but identically named BQ tables
  2. Initiate a BQ load job through WriteToBigQuery transform
  3. Set write disposition to BigQueryDisposition.WRITE_TRUNCATE
  4. Run it several times
  5. Expect only the first table being truncated correctly, none of the others.

E.g.:

with Pipeline(options=pipeline_options) as pipeline:
  data = (pipeline
            | "ReadAll" >> ReadFromText(user_options.source_path))
  (data 
     | "Load data into BQ" >> WriteToBigQuery(..., write_disposition=BigQueryDisposition.WRITE_TRUNCATE))

Expected behavior

All the identically named tables within different datasets must be truncated, properly.

Actual behavior

Only the first table is being truncated (whatever first means in a heavily distributed system).

Environment (tested on)

  • Apache Beam version: 2.63.0
  • Runner: DirectRunner, DataflowRunner
  • OS: MacOS 15.3.1; build: 24D70
  • Python version: Python 3.11.9

Additional context

I already have a solution, just need to add test if even possible, didn't yet validate that.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions