Description
Is your feature request related to a problem?
The problem is memory management for large query results in BigQuery - I understand pyarrow record batches is a possible solution, but ideally I could keep all of the work/processing/memory consumption in GCP.
What is the motivation behind your request?
I realize this is kind of a weird feature request and pretty specific to BQ (or other cloud providers w/ storage), but I think it would help with ETL orchestration. I'd like to avoid yanking things into memory if I can.
Export BQ result to GCS: https://cloud.google.com/bigquery/docs/exporting-data#sql
A workflow might look a bit like this:
# nothing gets evaluated except, I think, when BQ executes the export, but nothing substantial enters local memory otherwise
q = con.table("table").join(... # con is an Ibis BigQuery connection
q.to_parquet_dir("gs://bucket/path/to/file-*.parquet" ...) # include args to EXPORT?
... # do stuff
# in my specific case I want to take a parquet file and dump it in to postgres via ADBC (prob in another job):
from pyarrow.fs import GcsFileSystem; import pyarrow.parquet as parquet
fs, path = GcsFileSystem.from_uri("gs://bucket/path/to/file.parquet")
reader = parquet.ParquetFile(path, filesystem=fs)
with adbc_conn.cursor() as cur:
cur.adbc_ingest("parquet_file_table", reader.iter_batches(), mode="create")
Not urgent; I don't have an immediate need for this but I imagine it'd be useful and aligns with the project's goals.
Describe the solution you'd like
Maybe some option or automatic handling in ibis-bigquery's to_parquet_dir
(doesn't exist yet) to auto-export to GCS if a GCS URI is provided?
What version of ibis are you running?
9.1.0 (optionally v10 dev)
What backend(s) are you using, if any?
BigQuery
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Type
Projects
Status