How to train with parquet files? #20341

MikkelWorkF · 2021-06-10T22:11:47Z

MikkelWorkF
Jun 10, 2021

Hello

How do I train mxnet with parquet files?
I have the training data stored in a bunch of parquet files(hundreds) and they cannot fit in memory (2TB+). Until now we have been able to not deal with the issue, because we could handle the training data in memory (we ran on 728GB memory sagemaker instances, but that is no longer sufficient)
We have been looking a long time for solutions, but nothing seems to be working. We are considering switching to PyTorch as that can handle a petastorm reader, which should work with parquet files. However, we feel like there has to be some solution we are not seeing.

TristonC · 2021-06-14T21:33:05Z

TristonC
Jun 14, 2021

@szha Do you have someone to help with this question?

0 replies

szha · 2021-06-15T02:22:15Z

szha
Jun 15, 2021
Collaborator

If you are already familiar with petastorm, you can use the plain python reader of it and wrap the data into mxnet arrays using mx.nd.array

1 reply

MikkelWorkF Jun 16, 2021
Author

I can use the make_batch_reader, but this does not give me random access(getitem), however I can get around that with a sequential sampler. I am still unable to provide a length though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train with parquet files? #20341

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to train with parquet files? #20341

MikkelWorkF Jun 10, 2021

Replies: 2 comments · 1 reply

TristonC Jun 14, 2021

szha Jun 15, 2021 Collaborator

MikkelWorkF Jun 16, 2021 Author

MikkelWorkF
Jun 10, 2021

Replies: 2 comments 1 reply

TristonC
Jun 14, 2021

szha
Jun 15, 2021
Collaborator

MikkelWorkF Jun 16, 2021
Author