TransformSpec using Pandas causes incompatibilities with other libraries for make_batch_reader

Hey guys, I'm trying to create a compatibility interface between Petastorm and a few PyTorch based libraries (PyTorch Lightning, Hugging Face Transformers and AllenNLP) which I'm trying to use in a project. So far I've managed to get PyTorch Lightning working (pretty much research oriented Keras), but a few design choices within Petastorm seem to prevent usage with the NLP libraries.

My problem is that `TransformSpec` requires input and output as Pandas `DataFrame`'s. This at first may seem decent, but commonplace NLP libraries like Hugging Face Transformers tokenize lists of strings (this transformation is easy) and directly output tensors. These tensors aren't flat, so they can't be converted to Pandas, meaning that processing textual data (despite being fairly straight forward) seems nearly impossible with Petastorm's built in data loaders.

I've been working on this for a week and these are the methods to mitigate the problem:
1. Modifying Petastorm's existing `TransformSpec`/PyTorch `DataLoader`/PyArrow classes
2. Creating iterable PyTorch data loaders which just loop through the `Reader` object

I've been trying to interpret how I'd do the first option (through debug the code), however it looks extremely complicated and (I think) it would require modifying a number of classes (what modifications are needed still elude me). On the other hand, looping through a `Reader` seems reasonable as we can still read in strings. But, would doing this forfeit any optimisations/performance boosting code from Petastorm's loader (although shouldn't these be in `Reader`)?

So, would anyone be able to provide some advice on what you believe to be the best approach/course of action (or just what might have to be coded/modified)?
Thanks so much for in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TransformSpec using Pandas causes incompatibilities with other libraries for make_batch_reader #603

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TransformSpec using Pandas causes incompatibilities with other libraries for make_batch_reader #603

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions