Skip to content

[BUG] LoadDataFromFileSystem gets stuck when used with fsspec #1110

@gabrielmbmb

Description

@gabrielmbmb

Describe the bug

LoadDataFromFileSystem never ends loading when used in a Pipeline because call to load method gets stuck. The source of the problem is that distilabel needs to know the output that will produce the step in advance using the outputs property which is accessed from the main process. In the LoadDataFromFileSystem, the outputs property is calling the load method to load the datasets.Dataset and get its column_names. Then, when load method called is again from the child process, it gets stuck. Probably for the same reason as in fsspec/s3fs#464.

To reproduce

from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromFileSystem

with Pipeline() as pipeline:
    load_data = LoadDataFromFileSystem(
        data_files="s3://my_path/*.parquet",
        num_examples=10,
        output_mappings={"text": "seed"},
    )

if __name__ == "__main__":
    distiset = pipeline.run(use_cache=False)

Expected behavior

The step doesn't get stuck.

Screenshots

No response

Environment

  • Distilabel Version [e.g. 1.0.0]: 1.5.1
  • Python Version [e.g. 3.11]: 3.11

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions