Conversation
| const logger = new duckdb.ConsoleLogger(logLevel); | ||
| const db = new duckdb.AsyncDuckDB(logger, worker); | ||
| await db.instantiate(bundle.mainModule, bundle.pthreadWorker); | ||
| await db.open({ |
There was a problem hiding this comment.
This db.open call is the only change from the code this replaces from DatabaseServiceWeb and DatabaseServiceElectron.
| errors.push( | ||
| `"${PreDefinedColumn.FILE_PATH}" column contains ${blankFilePathRows.length} empty or purely whitespace paths at rows ${rowNumbers}.` | ||
| ); | ||
| return [ |
There was a problem hiding this comment.
Whats the significance of wrapping errors here? will this suppress the type?
There was a problem hiding this comment.
You mean, why return [error] instead of error? I just kept that to be consistent with the previous code. The return type remains Promise<string[]>.
I replaced errors.push with return statements to have simple logic for exiting early if we get a useful result from parquetHasBlankValues.
There was a problem hiding this comment.
I think the original returned multiple errors in some cases - looks like that isn't true anymore
|
|
||
| // Similar to getColumnsOnDataSource below, but suitable for use during the | ||
| // data source preparation step. | ||
| private async getRawParquetColumns(name: string): Promise<string[]> { |
There was a problem hiding this comment.
Is this quick? Should we cache it?
There was a problem hiding this comment.
I don't think it makes sense to cache this result. The second time this query runs, it is very fast (under 10ms), which I believe is due to DuckDB's cache for portions of the parquet file it has already read.
Context
The initial load of test file
sample_file_large.parquetfrom S3 previously took 90s (AICS VPN) and downloaded the full 1GB of data.Now, loading this test file takes 10-12s and loads just 100MB (the first row group).
Resolves #694.
Changes
forceFullHTTPReads: falsecheckDataSourceForErrorswhen preparing a data source. See the detailed implementation notes in the source code comments.Testing
So far my manual testing has only been in the browser, with the following test files:
Reviewing
For notes on follow-up performance improvements (could get load times down to 2s for
sample_file_large), see my comment on the parent issue.