-
Notifications
You must be signed in to change notification settings - Fork 285
Description
Hey guys, I'm trying to create a compatibility interface between Petastorm and a few PyTorch based libraries (PyTorch Lightning, Hugging Face Transformers and AllenNLP) which I'm trying to use in a project. So far I've managed to get PyTorch Lightning working (pretty much research oriented Keras), but a few design choices within Petastorm seem to prevent usage with the NLP libraries.
My problem is that TransformSpec
requires input and output as Pandas DataFrame
's. This at first may seem decent, but commonplace NLP libraries like Hugging Face Transformers tokenize lists of strings (this transformation is easy) and directly output tensors. These tensors aren't flat, so they can't be converted to Pandas, meaning that processing textual data (despite being fairly straight forward) seems nearly impossible with Petastorm's built in data loaders.
I've been working on this for a week and these are the methods to mitigate the problem:
- Modifying Petastorm's existing
TransformSpec
/PyTorchDataLoader
/PyArrow classes - Creating iterable PyTorch data loaders which just loop through the
Reader
object
I've been trying to interpret how I'd do the first option (through debug the code), however it looks extremely complicated and (I think) it would require modifying a number of classes (what modifications are needed still elude me). On the other hand, looping through a Reader
seems reasonable as we can still read in strings. But, would doing this forfeit any optimisations/performance boosting code from Petastorm's loader (although shouldn't these be in Reader
)?
So, would anyone be able to provide some advice on what you believe to be the best approach/course of action (or just what might have to be coded/modified)?
Thanks so much for in advance!