-
Couldn't load subscription status.
- Fork 221
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
LoadDataFromFileSystem never ends loading when used in a Pipeline because call to load method gets stuck. The source of the problem is that distilabel needs to know the output that will produce the step in advance using the outputs property which is accessed from the main process. In the LoadDataFromFileSystem, the outputs property is calling the load method to load the datasets.Dataset and get its column_names. Then, when load method called is again from the child process, it gets stuck. Probably for the same reason as in fsspec/s3fs#464.
To reproduce
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromFileSystem
with Pipeline() as pipeline:
load_data = LoadDataFromFileSystem(
data_files="s3://my_path/*.parquet",
num_examples=10,
output_mappings={"text": "seed"},
)
if __name__ == "__main__":
distiset = pipeline.run(use_cache=False)Expected behavior
The step doesn't get stuck.
Screenshots
No response
Environment
- Distilabel Version [e.g. 1.0.0]: 1.5.1
- Python Version [e.g. 3.11]: 3.11
Additional context
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working