-
Notifications
You must be signed in to change notification settings - Fork 7
Open
1 / 11 of 1 issue completedOpen
1 / 11 of 1 issue completed
Copy link
Labels
Description
Describe the bug
Large Parquet file in S3 using snappy compression (50M records) is created (size: 4GB)
writer = pq.ParquetWriter(output_file, schema, compression='snappy')
While invoking read_parquet to select from the created parquet from S3 on the above file, DuckDB throws either Invalid data error if we load cache_httpfs.
To Reproduce
- Create parquet using snappy compression (50M records and size > 4GB):
writer = pq.ParquetWriter(output_file, schema, compression='snappy')
- Install and Load cache_httpfs:
INSTALL cache_httpfs FROM community;
LOAD cache_httpfs;
- Set the S3 credentials:
D SET s3_region='***';
D SET s3_access_key_id='***';
D SET s3_secret_access_key='***';
- SELECT using read_parquet:
D explain analyze SELECT * FROM read_parquet('your-s3-parquet-path');
Invalid Error:
TProtocolException: Invalid data
D explain analyze SELECT * FROM read_parquet('your-s3-parquet-path');
Invalid Error:
TProtocolException: Invalid data
D explain analyze SELECT * FROM read_parquet('your-s3-parquet-path');
Invalid Error:
TProtocolException: Invalid data
D explain analyze SELECT * FROM read_parquet('your-s3-parquet-path');
Invalid Error:
TProtocolException: Invalid data
D explain analyze SELECT * FROM read_parquet('your-s3-parquet-path');
Invalid Error:
TProtocolException: Invalid data
D explain analyze SELECT * FROM read_parquet('your-s3-parquet-path');
Invalid Error:
don't know what type:
D explain analyze SELECT * FROM read_parquet('your-s3-parquet-path');
Invalid Error:
TProtocolException: Invalid data
Even after multiple retries, the query does NOT succeed.
Expected behavior
The query should be executed successfully
Screenshots
NA
Desktop (please complete the following information):
MacOS arm64
Smartphone (please complete the following information):
NA
DuckDB Version:
v1.4.3 (Andium) d1dc88f950
DuckDB Client:
CLI
Additional context
Initially created an issue in DuckDB side: duckdb/duckdb#20167