Skip to content

Fatal error training detection model #2018

@neojg

Description

@neojg

Bug description

I created the labes.json file and the images.

I run the commad:

python references\detection\train.py db_resnet50 --epochs 20 --train_path C:\RBEE\DO\DetectionTrain --val_path C:\RBEE\DO\DetectionValidate --pretrained --name DtectDO --output_dir C:\RBEE\DO\DetectionTrain\models

Each try is immediately finished with error:

(env12) C:\Work\doctr2\doctr>python references\detection\train.py db_resnet50 --epochs 20 --train_path C:\RBEE\DO\DetectionTrain --val_path C:\RBEE\DO\DetectionValidate --pretrained --name DtectDO --output_dir C:\RBEE\DO\DetectionTrain\models
Namespace(backend='nccl', device=None, arch='db_resnet50', output_dir='C:\\RBEE\\DO\\DetectionTrain\\models', train_path='C:\\RBEE\\DO\\DetectionTrain', val_path='C:\\RBEE\\DO\\DetectionValidate', name='DtectDO', epochs=20, batch_size=2, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, clearml=False, push_to_hub=False, pretrained=True, rotation=False, eval_straight=False, optim='adam', sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.01304s (78 samples in 39 batches)
Train set loaded in 0.002001s (8 samples in 4 batches)
  0%|                                                                                                                                                                 | 0/4 [00:00<?, ?it/s]Traceback (most recent call last):
  File "C:\Work\doctr2\doctr\references\detection\train.py", line 650, in <module>
    main(args)
  File "C:\Work\doctr2\doctr\references\detection\train.py", line 521, in main
    train_loss, actual_lr = fit_one_epoch(
                            ^^^^^^^^^^^^^^
  File "C:\Work\doctr2\doctr\references\detection\train.py", line 115, in fit_one_epoch
    pbar = tqdm(train_loader, dynamic_ncols=True, disable=(rank != 0))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Work\doctr2\env12\Lib\site-packages\tqdm\asyncio.py", line 33, in __init__
    self.iterable_iterator = iter(iterable)
                             ^^^^^^^^^^^^^^
  File "C:\Work\doctr2\env12\Lib\site-packages\torch\utils\data\dataloader.py", line 494, in __iter__
    return self._get_iterator()
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Work\doctr2\env12\Lib\site-packages\torch\utils\data\dataloader.py", line 427, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Work\doctr2\env12\Lib\site-packages\torch\utils\data\dataloader.py", line 1172, in __init__
    w.start()
  File "C:\Users\Arhat\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "C:\Users\Arhat\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Arhat\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\context.py", line 337, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Arhat\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\popen_spawn_win32.py", line 95, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\Arhat\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'main.<locals>.<lambda>'
  0%|                                                                                                                                                                 | 0/4 [00:00<?, ?it/s]

(env12) C:\Work\doctr2\doctr>Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\Arhat\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Arhat\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\spawn.py", line 132, in _main
    self = reduction.pickle.load(from_parent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input

Code snippet to reproduce the bug

I created the labes.json file and the images.

I run the commad:

python references\detection\train.py db_resnet50 --epochs 20 --train_path C:\RBEE\DO\DetectionTrain --val_path C:\RBEE\DO\DetectionValidate --pretrained --name DtectDO --output_dir C:\RBEE\DO\DetectionTrain\models

Error traceback

(env12) C:\Work\doctr2\doctr>python references\detection\train.py db_resnet50 --epochs 20 --train_path C:\RBEE\DO\DetectionTrain --val_path C:\RBEE\DO\DetectionValidate --pretrained --name DtectDO --output_dir C:\RBEE\DO\DetectionTrain\models
Namespace(backend='nccl', device=None, arch='db_resnet50', output_dir='C:\\RBEE\\DO\\DetectionTrain\\models', train_path='C:\\RBEE\\DO\\DetectionTrain', val_path='C:\\RBEE\\DO\\DetectionValidate', name='DtectDO', epochs=20, batch_size=2, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, clearml=False, push_to_hub=False, pretrained=True, rotation=False, eval_straight=False, optim='adam', sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.01304s (78 samples in 39 batches)
Train set loaded in 0.002001s (8 samples in 4 batches)
  0%|                                                                                                                                                                 | 0/4 [00:00<?, ?it/s]Traceback (most recent call last):
  File "C:\Work\doctr2\doctr\references\detection\train.py", line 650, in <module>
    main(args)
  File "C:\Work\doctr2\doctr\references\detection\train.py", line 521, in main
    train_loss, actual_lr = fit_one_epoch(
                            ^^^^^^^^^^^^^^
  File "C:\Work\doctr2\doctr\references\detection\train.py", line 115, in fit_one_epoch
    pbar = tqdm(train_loader, dynamic_ncols=True, disable=(rank != 0))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Work\doctr2\env12\Lib\site-packages\tqdm\asyncio.py", line 33, in __init__
    self.iterable_iterator = iter(iterable)
                             ^^^^^^^^^^^^^^
  File "C:\Work\doctr2\env12\Lib\site-packages\torch\utils\data\dataloader.py", line 494, in __iter__
    return self._get_iterator()
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Work\doctr2\env12\Lib\site-packages\torch\utils\data\dataloader.py", line 427, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Work\doctr2\env12\Lib\site-packages\torch\utils\data\dataloader.py", line 1172, in __init__
    w.start()
  File "C:\Users\Arhat\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "C:\Users\Arhat\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Arhat\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\context.py", line 337, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Arhat\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\popen_spawn_win32.py", line 95, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\Arhat\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'main.<locals>.<lambda>'
  0%|                                                                                                                                                                 | 0/4 [00:00<?, ?it/s]

(env12) C:\Work\doctr2\doctr>Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\Arhat\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Arhat\AppData\Local\Programs\Python\Python312\Lib\multiprocessing\spawn.py", line 132, in _main
    self = reduction.pickle.load(from_parent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E

[labels.json](https://github.com/user-attachments/files/22547178/labels.json)

OFError: Ran out of input

labels.json

Environment

Under windows 11

Collecting environment information...

DocTR version: 1.0.1a0
PyTorch version: 2.8.0+cu129 (torchvision 0.23.0+cu129)
OpenCV version: 4.12.0
OS: Microsoft Windows 11 Pro
Python version: 3.12.4
Is CUDA available (PyTorch): Yes
CUDA runtime version: 12.9.41
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions