Skip to content

DuckDB throws Invalid data or Snappy decompression failure while reading parquet from S3 after loading cache_httpfs #331

@ahmed-shameem

Description

@ahmed-shameem

Describe the bug
Large Parquet file in S3 using snappy compression (50M records) is created (size: 4GB)
writer = pq.ParquetWriter(output_file, schema, compression='snappy')

While invoking read_parquet to select from the created parquet from S3 on the above file, DuckDB throws either Invalid data error if we load cache_httpfs.

To Reproduce

  1. Create parquet using snappy compression (50M records and size > 4GB):
writer = pq.ParquetWriter(output_file, schema, compression='snappy')
  1. Install and Load cache_httpfs:
INSTALL cache_httpfs FROM community;
LOAD cache_httpfs;
  1. Set the S3 credentials:
D SET s3_region='***';
D SET s3_access_key_id='***';
D SET s3_secret_access_key='***';
  1. SELECT using read_parquet:
D explain analyze SELECT * FROM read_parquet('your-s3-parquet-path');
Invalid Error:
TProtocolException: Invalid data
D explain analyze SELECT * FROM read_parquet('your-s3-parquet-path');
Invalid Error:
TProtocolException: Invalid data
D explain analyze SELECT * FROM read_parquet('your-s3-parquet-path');
Invalid Error:
TProtocolException: Invalid data
D explain analyze SELECT * FROM read_parquet('your-s3-parquet-path');
Invalid Error:
TProtocolException: Invalid data
D explain analyze SELECT * FROM read_parquet('your-s3-parquet-path');
Invalid Error:
TProtocolException: Invalid data
D explain analyze SELECT * FROM read_parquet('your-s3-parquet-path');
Invalid Error:
don't know what type:
D explain analyze SELECT * FROM read_parquet('your-s3-parquet-path');
Invalid Error:
TProtocolException: Invalid data

Even after multiple retries, the query does NOT succeed.

Expected behavior
The query should be executed successfully

Screenshots
NA

Desktop (please complete the following information):
MacOS arm64

Smartphone (please complete the following information):
NA

DuckDB Version:
v1.4.3 (Andium) d1dc88f950

DuckDB Client:
CLI

Additional context
Initially created an issue in DuckDB side: duckdb/duckdb#20167

Sub-issues

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions