Skip to content

perf(file): add buffer_size to File.open() to reduce wasteful pre-reads#6876

Merged
madvart merged 4 commits into
Eventual-Inc:mainfrom
chenghuichen:buffer_size
May 7, 2026
Merged

perf(file): add buffer_size to File.open() to reduce wasteful pre-reads#6876
madvart merged 4 commits into
Eventual-Inc:mainfrom
chenghuichen:buffer_size

Conversation

@chenghuichen
Copy link
Copy Markdown
Contributor

Changes Made

DaftFile::load()` hardcoded a 16MB BufReader buffer for all calls, so metadata reads (needing ~1KB) and MIME sniffing (needing 16 bytes) each prefetched 16MB.

Added a buffer_size parameter through the full Rust→Python chain, letting each call site specify an appropriate size — internal callers now use 4KB for sniffing and 64KB for metadata. Fully backward-compatible — None preserves existing 16MB default.

Related Issues

@chenghuichen chenghuichen changed the title feat(file): add buffer_size parameter to File.open() to reduce wasteful pre-reads feat(file): add buffer_size to File.open() to reduce wasteful pre-reads May 4, 2026
@github-actions github-actions Bot added the feat label May 4, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 4, 2026

Greptile Summary

This PR adds a buffer_size parameter to DaftFile::load() and threads it through the full Rust→Python call chain, replacing the hardcoded 16 MB BufReader with caller-appropriate sizes: 4 KB for MIME sniffing and file-size lookups, 64 KB for metadata reads (image/audio/video), and None (preserving 16 MB) for full-file decode paths. The change is backward-compatible and well-scoped with new tests covering the relevant code paths.

Confidence Score: 4/5

Safe to merge — all findings are P2 style/design concerns with no correctness impact on current callers.

Only P2 findings: a design coupling where the same buffer_size value controls both BufReader capacity and the small-file eager-download threshold, and a redundant double-BufReader in image_file_metadata.rs. No P0/P1 issues. All new callers correctly use download_small_files=false so the threshold coupling is benign today.

src/daft-file/src/file.rs — the dual use of buf_size on line 75 is worth revisiting if future callers need download_small_files=true with a non-default buffer.

Important Files Changed

Filename Overview
src/daft-file/src/file.rs Core change: adds buffer_size: Option to load/load_blocking; the value is reused for both BufReader capacity and the small-file eager-download threshold, which are logically independent concerns.
src/daft-image/src/functions/image_file_metadata.rs Uses BUFFER_SIZE_METADATA (64 KB) correctly, but wraps the already-buffered DaftFile in an extra BufReader, creating redundant double-buffering.
daft/file/file.py Adds BUFFER_SNIFF/BUFFER_METADATA constants and threads buffer_size through File.open(), size(), and mime_type() appropriately.
src/daft-file/src/python.rs Exposes buffer_size as a keyword-only PyO3 argument in _from_file_reference; enter correctly retains None (16 MB default).
src/daft-file/src/functions.rs All verify_file helpers and Size UDF now pass BUFFER_SIZE_SNIFF; appropriate since only MIME sniffing / size metadata is needed.
daft/file/audio.py metadata() now uses BUFFER_METADATA (64 KB); correct for header-only reads via soundfile.
daft/file/image.py metadata() now uses BUFFER_METADATA (64 KB) for PIL header reads; appropriate for image metadata extraction.
daft/file/video.py metadata() now uses BUFFER_METADATA (64 KB); MP4 files with trailing moov atoms may require seeking, which ObjectSourceReader handles via range requests.
tests/file/test_file_basics.py New tests cover buffer_size correctness for read, mime_type, and size; all local-only and well scoped.
tests/file/test_image_file.py New tests confirm metadata and decode work correctly after buffer_size changes.
daft/daft/init.pyi Stub correctly updated to reflect keyword-only buffer_size parameter on _from_file_reference.
src/daft-file/src/lib.rs Re-exports the three new buffer size constants; straightforward change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["File.open(buffer_size?)"] --> B["PyDaftFile._from_file_reference(buffer_size?)"]
    B --> C["DaftFile::load_blocking(download_small_files=false, buffer_size?)"]
    D["PyFileReference.__enter__()"] --> E["DaftFile::load_blocking(download_small_files=true, buffer_size=None→16MB)"]
    C --> F{"supports_range?"}
    E --> F
    F -- "No" --> G["read_full_content() → Memory cursor"]
    F -- "Yes, file_size ≤ buf_size & download_small_files" --> G
    F -- "Yes, streaming" --> H["BufReader::with_capacity(buf_size) → ObjectReader cursor"]
    subgraph call_sites["buffer_size call sites"]
        I["MIME sniff / size() → BUFFER_SIZE_SNIFF 4KB"]
        J["Image/Audio/Video metadata → BUFFER_SIZE_METADATA 64KB"]
        K["Full decode / from_path → None → BUFFER_SIZE_FULL 16MB"]
    end
    I --> C
    J --> C
    K --> E
Loading

Comments Outside Diff (1)

  1. src/daft-image/src/functions/image_file_metadata.rs, line 88-94 (link)

    P2 Double-buffering: outer BufReader wraps an already-buffered DaftFile

    DaftFile internally holds a BufReader<ObjectSourceReader> with BUFFER_SIZE_METADATA (64 KB) capacity. Wrapping it in a second BufReader::new(file) adds a default 8 KB outer buffer. The outer buffer is satisfied by the inner one (no extra network I/O), but the double-layer is unnecessary overhead. Passing the DaftFile directly to ImageReader::new would suffice, since the inner BufReader already provides the read-ahead needed by with_guessed_format().

Reviews (1): Last reviewed commit: "buffer_size parameter to File open" | Re-trigger Greptile

Comment thread src/daft-file/src/file.rs
@chenghuichen chenghuichen changed the title feat(file): add buffer_size to File.open() to reduce wasteful pre-reads perf(file): add buffer_size to File.open() to reduce wasteful pre-reads May 4, 2026
@github-actions github-actions Bot added perf and removed feat labels May 4, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 4, 2026

Merging this PR will not alter performance

✅ 40 untouched benchmarks
⏩ 10 skipped benchmarks1


Comparing chenghuichen:buffer_size (8f4904a) with main (044b833)

Open in CodSpeed

Footnotes

  1. 10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@chenghuichen
Copy link
Copy Markdown
Contributor Author

integration-test-catalogs failure seems unrelated to this PR.

@madvart madvart merged commit b9290fc into Eventual-Inc:main May 7, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants