Partial parquet reads by pgarrison · Pull Request #698 · AllenInstitute/biofile-finder

pgarrison · 2026-03-23T19:36:19Z

Context

The initial load of test file sample_file_large.parquet from S3 previously took 90s (AICS VPN) and downloaded the full 1GB of data.

Now, loading this test file takes 10-12s and loads just 100MB (the first row group).

Resolves #694.

Changes

Enable partial reads of parquet files by setting forceFullHTTPReads: false
Use parquet table statistics to replace the following full-table scan triggered by checkDataSourceForErrors when preparing a data source. See the detailed implementation notes in the source code comments.

SELECT A.row
FROM (
    SELECT ROW_NUMBER() OVER () AS row, "${column}"
    FROM "${dataSource}"
) AS A
WHERE TRIM(A."${column}") IS NULL

Testing

So far my manual testing has only been in the browser, with the following test files:

sample_file_large.parquet, reduced load time from ~90s to ~11s
cpg0001-cellpainting-protocol_draft.csv.parquet, reduced load time from ~10s to ~2s

Reviewing

For notes on follow-up performance improvements (could get load times down to 2s for sample_file_large), see my comment on the parent issue.

…arquet-reads

pgarrison · 2026-03-23T19:48:47Z

packages/core/services/DatabaseService/index.ts

+    const logger = new duckdb.ConsoleLogger(logLevel);
+    const db = new duckdb.AsyncDuckDB(logger, worker);
+    await db.instantiate(bundle.mainModule, bundle.pthreadWorker);
+    await db.open({


This db.open call is the only change from the code this replaces from DatabaseServiceWeb and DatabaseServiceElectron.

BrianWhitneyAI · 2026-03-24T17:15:24Z

packages/core/services/DatabaseService/index.ts

-                errors.push(
-                    `"${PreDefinedColumn.FILE_PATH}" column contains ${blankFilePathRows.length} empty or purely whitespace paths at rows ${rowNumbers}.`
-                );
+                return [


Whats the significance of wrapping errors here? will this suppress the type?

You mean, why return [error] instead of error? I just kept that to be consistent with the previous code. The return type remains Promise<string[]>.

I replaced errors.push with return statements to have simple logic for exiting early if we get a useful result from parquetHasBlankValues.

I think the original returned multiple errors in some cases - looks like that isn't true anymore

SeanDuHare · 2026-03-25T16:07:12Z

packages/core/services/DatabaseService/index.ts


+    // Similar to getColumnsOnDataSource below, but suitable for use during the
+    // data source preparation step.
+    private async getRawParquetColumns(name: string): Promise<string[]> {


Is this quick? Should we cache it?

I don't think it makes sense to cache this result. The second time this query runs, it is very fast (under 10ms), which I believe is due to DuckDB's cache for portions of the parquet file it has already read.

pgarrison added 3 commits March 20, 2026 12:52

POC: 7x speed increase on sample_file_large.parquet

bc85f39

Shortcut getRowsWhereColumnIsBlank for large parquet files

74872f9

Factor out repeated duckdb initialization logic

17cf51b

pgarrison changed the title ~~Feature/694 partial parquet reads~~ Partial parquet reads Mar 23, 2026

Merge remote-tracking branch 'origin/main' into feature/694-partial-p…

802fa54

…arquet-reads

pgarrison commented Mar 23, 2026

View reviewed changes

pgarrison mentioned this pull request Mar 23, 2026

Improve initial load performance for parquets (Direct From Parquet mode) #694

Open

pgarrison marked this pull request as ready for review March 23, 2026 21:51

pgarrison requested review from BrianWhitneyAI, SeanDuHare and aswallace as code owners March 23, 2026 21:51

BrianWhitneyAI approved these changes Mar 24, 2026

View reviewed changes

SeanDuHare approved these changes Mar 25, 2026

View reviewed changes

aswallace approved these changes Mar 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial parquet reads#698

Partial parquet reads#698
pgarrison wants to merge 4 commits intomainfrom
feature/694-partial-parquet-reads

pgarrison commented Mar 23, 2026 •

edited

Loading

Uh oh!

pgarrison Mar 23, 2026

Uh oh!

BrianWhitneyAI Mar 24, 2026

Uh oh!

pgarrison Mar 24, 2026 •

edited

Loading

Uh oh!

SeanDuHare Mar 25, 2026

Uh oh!

SeanDuHare Mar 25, 2026

Uh oh!

pgarrison Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pgarrison commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changes

Testing

Reviewing

Uh oh!

pgarrison Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

BrianWhitneyAI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

pgarrison Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SeanDuHare Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

SeanDuHare Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

pgarrison Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pgarrison commented Mar 23, 2026 •

edited

Loading

pgarrison Mar 24, 2026 •

edited

Loading