Description
I want to merge multiple tables with the same into a single one.
Say I have a table Sales(transaction_id, date, amount)
and sharded files in my file system,
sales/
- 2024/
- 12/
- 30.csv
- 31.csv
- 2025/
- 01/
- 01.csv
- 02.csv
- 03.csv
Is there a convenient way to treat sales/**/*.csv
as a single table?
So far it seems that bdt query
supports 2 flags for input tables,
--table path/to/single_file.csv
- A single table is read with name
single_file
- A single table is read with name
--tables path/to/directory/
- Multiple tables are read, each one with it's own basename
- This imports N tables
- I don't see much value here, why not using the shell to expand something like
path/to/directory/*.csv
?
- Multiple tables are read, each one with it's own basename
I kind of want a new input file flag that expects a table name, and a set of (compatible) files,
bdt query \
--partitioned_table sales sales/**/*.csv \ # Shell will expand these globs
--sql "
select
count(*)
from
sales
"
Which would use a flag with 1+N arguments, --partitioned_table sales sales/2024/12/30.csv sales/2024/12/31.csv sales/2025/01/01.csv sales/2025/01/02.csv sales/2025/01/03.csv
, and make the table sales
available.
Is there a way to get this today? I tried the --tables
flag, but instead got N different tables that were hard to work with as a unit.
It's not hard to create a single file that concatenates all tables, but I'd nice not needing to create it as it'd allow writing queries from the shell, with a tiny rewrite --partitioned_table sales sales/2024/12/*.csv
would get me info about sales in December 2024 without any made-up disk writes.