forked from ZJONSSON/parquetjs
-
Notifications
You must be signed in to change notification settings - Fork 29
Open
Labels
planningDiscuss & point in planning meetingDiscuss & point in planning meeting
Description
Used version: 1.8.6
I have a snappy parquet file which is ~60MB has ~1.000.000 rows and ~300 columns. I use following code snippet to read data.
const reader = await parquet.ParquetReader.openFile('.file.snappy.parquet)
const cursor = reader.getCursor()
while (await cursor.next()) {}
If memory limit is 10GB I got heap allocation error and if it is ~23GB(I put the limit ) it takes ~4 minutes to process.
I tried to set maxSpan and maxLength options too to fix the issue or have a batch processing effect but it didn't work.
I can read the same file in ~30 seconds and ~170MB memory with python pandas using following code snippet
pf = pq.ParquetFile("file.snappy.parquet")
for batch in pf.iter_batches(batch_size=1000):
df = pa.Table.from_batches([batch]).to_pandas()
for row in df.itertuples(index=False):
pass
I am not sure whether this kind of a bug or I am incorrectly using this library.
Could you please share your comments?
Metadata
Metadata
Assignees
Labels
planningDiscuss & point in planning meetingDiscuss & point in planning meeting