Skip to content

feat: Support directly writing from BigQuery to parquet (dir) (or other formats e.g. CSV) via EXPORT #9643

Open
@p-a-a-a-trick

Description

@p-a-a-a-trick

Is your feature request related to a problem?

The problem is memory management for large query results in BigQuery - I understand pyarrow record batches is a possible solution, but ideally I could keep all of the work/processing/memory consumption in GCP.

What is the motivation behind your request?

I realize this is kind of a weird feature request and pretty specific to BQ (or other cloud providers w/ storage), but I think it would help with ETL orchestration. I'd like to avoid yanking things into memory if I can.

Export BQ result to GCS: https://cloud.google.com/bigquery/docs/exporting-data#sql

A workflow might look a bit like this:

# nothing gets evaluated except, I think, when BQ executes the export, but nothing substantial enters local memory otherwise
q = con.table("table").join(... # con is an Ibis BigQuery connection
q.to_parquet_dir("gs://bucket/path/to/file-*.parquet" ...) # include args to EXPORT?
... # do stuff
# in my specific case I want to take a parquet file and dump it in to postgres via ADBC (prob in another job):
from pyarrow.fs import GcsFileSystem; import pyarrow.parquet as parquet

fs, path = GcsFileSystem.from_uri("gs://bucket/path/to/file.parquet")

reader = parquet.ParquetFile(path, filesystem=fs)
with adbc_conn.cursor() as cur:
    cur.adbc_ingest("parquet_file_table", reader.iter_batches(), mode="create")

Not urgent; I don't have an immediate need for this but I imagine it'd be useful and aligns with the project's goals.

Describe the solution you'd like

Maybe some option or automatic handling in ibis-bigquery's to_parquet_dir (doesn't exist yet) to auto-export to GCS if a GCS URI is provided?

What version of ibis are you running?

9.1.0 (optionally v10 dev)

What backend(s) are you using, if any?

BigQuery

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    bigqueryThe BigQuery backendfeatureFeatures or general enhancementsioIssues related to input and/or output

    Type

    No type

    Projects

    Status

    backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions