Skip to content

perf: SELECT FROM a single-file URI triggers recursive directory listing instead of reading the file directly #19445

@bohutang

Description

@bohutang

Problem

SELECT * FROM 'fs:///par/foods.parquet' (file_format => 'parquet')
-- Takes 2.28s in a directory with many files, should be instant

When the URI explicitly names a single file (foods.parquet), Databend still lists all files in the parent directory recursively before reading. This affects all storage backends (fs, S3, GCS, Azure Blob, OSS, COS) and all file formats.

On cloud storage the impact is worse: S3 LIST returns max 1000 objects per call, so large directories require many network round-trips.

Root Cause

bind_location.rs:64-68 splits the URI into a directory path and leaves StageFilesInfo.files empty. Downstream, both ParquetTable::do_read_partitions (partition.rs:65) and StageTable::read_partitions_simple (stage_table.rs:94) see files = None and call StageFilesInfo::list(), which runs operator.lister_with(path).recursive(true) (stage.rs:263).

StageFilesInfo.files is only populated when using the explicit FILES => clause.

Fix

When the URI path has a file extension (i.e. points to a specific file, not a directory), populate StageFilesInfo.files with that filename so the list step is skipped entirely.

Workaround

SELECT * FROM 'fs:///par/' (files => ('foods.parquet'), file_format => 'parquet')

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions