Open
Description
What happened?
Describe the bug
Using WriteToBigQuery
transform for batch load with write disposition specified to truncate doesn't do its job as intended, instead of truncating all tables, it does truncate the first one.
It's happening only, in case the table IDs are identical in single batch job, but located in different BQ datasets.
To Reproduce
Steps to reproduce the behavior:
- Prepare a json file with several inputs into different dataset locations, but identically named BQ tables
- Initiate a BQ load job through
WriteToBigQuery
transform - Set write disposition to
BigQueryDisposition.WRITE_TRUNCATE
- Run it several times
- Expect only the first table being truncated correctly, none of the others.
E.g.:
with Pipeline(options=pipeline_options) as pipeline:
data = (pipeline
| "ReadAll" >> ReadFromText(user_options.source_path))
(data
| "Load data into BQ" >> WriteToBigQuery(..., write_disposition=BigQueryDisposition.WRITE_TRUNCATE))
Expected behavior
All the identically named tables within different datasets must be truncated, properly.
Actual behavior
Only the first table is being truncated (whatever first means in a heavily distributed system).
Environment (tested on)
- Apache Beam version: 2.63.0
- Runner:
DirectRunner
,DataflowRunner
- OS: MacOS 15.3.1; build: 24D70
- Python version:
Python 3.11.9
Additional context
I already have a solution, just need to add test if even possible, didn't yet validate that.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Infrastructure
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner