Skip to content

Partial parquet reads#698

Open
pgarrison wants to merge 4 commits intomainfrom
feature/694-partial-parquet-reads
Open

Partial parquet reads#698
pgarrison wants to merge 4 commits intomainfrom
feature/694-partial-parquet-reads

Conversation

@pgarrison
Copy link
Contributor

@pgarrison pgarrison commented Mar 23, 2026

Context

The initial load of test file sample_file_large.parquet from S3 previously took 90s (AICS VPN) and downloaded the full 1GB of data.

Now, loading this test file takes 10-12s and loads just 100MB (the first row group).

Resolves #694.

Changes

  • Enable partial reads of parquet files by setting forceFullHTTPReads: false
  • Use parquet table statistics to replace the following full-table scan triggered by checkDataSourceForErrors when preparing a data source. See the detailed implementation notes in the source code comments.
SELECT A.row
FROM (
    SELECT ROW_NUMBER() OVER () AS row, "${column}"
    FROM "${dataSource}"
) AS A
WHERE TRIM(A."${column}") IS NULL

Testing

So far my manual testing has only been in the browser, with the following test files:

Reviewing

For notes on follow-up performance improvements (could get load times down to 2s for sample_file_large), see my comment on the parent issue.

@pgarrison pgarrison changed the title Feature/694 partial parquet reads Partial parquet reads Mar 23, 2026
const logger = new duckdb.ConsoleLogger(logLevel);
const db = new duckdb.AsyncDuckDB(logger, worker);
await db.instantiate(bundle.mainModule, bundle.pthreadWorker);
await db.open({
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This db.open call is the only change from the code this replaces from DatabaseServiceWeb and DatabaseServiceElectron.

errors.push(
`"${PreDefinedColumn.FILE_PATH}" column contains ${blankFilePathRows.length} empty or purely whitespace paths at rows ${rowNumbers}.`
);
return [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the significance of wrapping errors here? will this suppress the type?

Copy link
Contributor Author

@pgarrison pgarrison Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean, why return [error] instead of error? I just kept that to be consistent with the previous code. The return type remains Promise<string[]>.

I replaced errors.push with return statements to have simple logic for exiting early if we get a useful result from parquetHasBlankValues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original returned multiple errors in some cases - looks like that isn't true anymore


// Similar to getColumnsOnDataSource below, but suitable for use during the
// data source preparation step.
private async getRawParquetColumns(name: string): Promise<string[]> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this quick? Should we cache it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to cache this result. The second time this query runs, it is very fast (under 10ms), which I believe is due to DuckDB's cache for portions of the parquet file it has already read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve initial load performance for parquets (Direct From Parquet mode)

4 participants