Skip to content

TransformSpec using Pandas causes incompatibilities with other libraries for make_batch_reader #603

@KamWithK

Description

@KamWithK

Hey guys, I'm trying to create a compatibility interface between Petastorm and a few PyTorch based libraries (PyTorch Lightning, Hugging Face Transformers and AllenNLP) which I'm trying to use in a project. So far I've managed to get PyTorch Lightning working (pretty much research oriented Keras), but a few design choices within Petastorm seem to prevent usage with the NLP libraries.

My problem is that TransformSpec requires input and output as Pandas DataFrame's. This at first may seem decent, but commonplace NLP libraries like Hugging Face Transformers tokenize lists of strings (this transformation is easy) and directly output tensors. These tensors aren't flat, so they can't be converted to Pandas, meaning that processing textual data (despite being fairly straight forward) seems nearly impossible with Petastorm's built in data loaders.

I've been working on this for a week and these are the methods to mitigate the problem:

  1. Modifying Petastorm's existing TransformSpec/PyTorch DataLoader/PyArrow classes
  2. Creating iterable PyTorch data loaders which just loop through the Reader object

I've been trying to interpret how I'd do the first option (through debug the code), however it looks extremely complicated and (I think) it would require modifying a number of classes (what modifications are needed still elude me). On the other hand, looping through a Reader seems reasonable as we can still read in strings. But, would doing this forfeit any optimisations/performance boosting code from Petastorm's loader (although shouldn't these be in Reader)?

So, would anyone be able to provide some advice on what you believe to be the best approach/course of action (or just what might have to be coded/modified)?
Thanks so much for in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions