Skip to content

High memory consumption when reading big files #173

@rsarili

Description

@rsarili

Used version: 1.8.6

I have a snappy parquet file which is ~60MB has ~1.000.000 rows and ~300 columns. I use following code snippet to read data.

const reader = await parquet.ParquetReader.openFile('.file.snappy.parquet)
const cursor = reader.getCursor()
while (await cursor.next()) {}

If memory limit is 10GB I got heap allocation error and if it is ~23GB(I put the limit ) it takes ~4 minutes to process.
I tried to set maxSpan and maxLength options too to fix the issue or have a batch processing effect but it didn't work.

I can read the same file in ~30 seconds and ~170MB memory with python pandas using following code snippet

pf = pq.ParquetFile("file.snappy.parquet")

for batch in pf.iter_batches(batch_size=1000):
    df = pa.Table.from_batches([batch]).to_pandas()
    for row in df.itertuples(index=False):
        pass

I am not sure whether this kind of a bug or I am incorrectly using this library.
Could you please share your comments?

Metadata

Metadata

Assignees

No one assigned

    Labels

    planningDiscuss & point in planning meeting

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions