Skip to content

Querying partitioned tables #31

Open
@Dietr1ch

Description

@Dietr1ch

I want to merge multiple tables with the same into a single one.

Say I have a table Sales(transaction_id, date, amount) and sharded files in my file system,

sales/
  - 2024/
    - 12/
      - 30.csv
      - 31.csv
  - 2025/
    - 01/
      - 01.csv
      - 02.csv
      - 03.csv

Is there a convenient way to treat sales/**/*.csv as a single table?

So far it seems that bdt query supports 2 flags for input tables,

  • --table path/to/single_file.csv
    • A single table is read with name single_file
  • --tables path/to/directory/
    • Multiple tables are read, each one with it's own basename
      • This imports N tables
      • I don't see much value here, why not using the shell to expand something like path/to/directory/*.csv?

I kind of want a new input file flag that expects a table name, and a set of (compatible) files,

bdt query \
  --partitioned_table sales sales/**/*.csv \  # Shell will expand these globs
  --sql "
    select
      count(*)
    from
      sales
  "

Which would use a flag with 1+N arguments, --partitioned_table sales sales/2024/12/30.csv sales/2024/12/31.csv sales/2025/01/01.csv sales/2025/01/02.csv sales/2025/01/03.csv, and make the table sales available.

Is there a way to get this today? I tried the --tables flag, but instead got N different tables that were hard to work with as a unit.

It's not hard to create a single file that concatenates all tables, but I'd nice not needing to create it as it'd allow writing queries from the shell, with a tiny rewrite --partitioned_table sales sales/2024/12/*.csv would get me info about sales in December 2024 without any made-up disk writes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions