Open
Description
Bug description
Hi, I've encountered a really strange problem occurring during data processing in our training pipelines.
I've managed to distill the problem to a single script:
import pytorch_lightning
from datasets import Dataset
import spacy
def main():
dataset = Dataset.from_dict({
"ids": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
})
def _map_raw(examples_batch):
spacy.load("en_core_web_sm")
return examples_batch
module = pytorch_lightning.LightningModule()
dataset.map(_map_raw, batched=True, batch_size=2, num_proc=4)
if __name__ == '__main__':
main()
This fails with an error
Exception in thread Thread-3: | 0/2 [00:00<?, ?ba/s]
Traceback (most recent call last): | 0/2 [00:00<?, ?ba/s]
File "/home/sazanovich/.cache/bazel/_bazel_sazanovich/b6bfd90e9c0267baf464defccc50e727/external/python3_9_x86_64-unknown-linux-gnu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
self.run()
File "/home/sazanovich/.cache/bazel/_bazel_sazanovich/b6bfd90e9c0267baf464defccc50e727/external/python3_9_x86_64-unknown-linux-gnu/lib/python3.9/threading.py", line 910, in run
self._target(*self._args, **self._kwargs)
File "/home/sazanovich/.local/share/virtualenvs/grazie-ml-O5jEj6jC/lib/python3.9/site-packages/multiprocess/pool.py", line 576, in _handle_results
task = get()
File "/home/sazanovich/.local/share/virtualenvs/grazie-ml-O5jEj6jC/lib/python3.9/site-packages/multiprocess/connection.py", line 259, in recv
return _ForkingPickler.loads(buf.getbuffer())
File "/home/sazanovich/.local/share/virtualenvs/grazie-ml-O5jEj6jC/lib/python3.9/site-packages/dill/_dill.py", line 286, in loads
return load(file, ignore, **kwds)
File "/home/sazanovich/.local/share/virtualenvs/grazie-ml-O5jEj6jC/lib/python3.9/site-packages/dill/_dill.py", line 272, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/home/sazanovich/.local/share/virtualenvs/grazie-ml-O5jEj6jC/lib/python3.9/site-packages/dill/_dill.py", line 419, in load
obj = StockUnpickler.load(self)
TypeError: __init__() takes 1 positional argument but 2 were given
What's interesting here is that this code can be fixed in several ways:
- Remove everything connected to PL
- Remove spacy.load from _map_raw
- Move pl import after imports on spacy and datasets
I understand that this could be not a PL issue, but could you advise me on how is this happening? Where should I look? Is there a workaround?
Environment
python==3.9
torch==1.13.1
pytorch_lightning==1.9.3
datasets==2.9.0
spacy==3.4.4
I use python provided from bazel-rules, all the requirements are installed with pip.
OS:Ubuntu 18.04.6 LTS (GNU/Linux 4.15.0-197-generic x86_64)
NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4