Issue in matcher - ValueError: matrix contains invalid numeric entries

Hi, I get the following error while training a model:



```
Epoch: [10/100]:  51%|█████████████████████████████████████████████████████████                                                      | 169/329 [09:41<09:10, 
 3.44s/it, lr=0.000100, class_loss=4.04, box_loss=0.09, loss=35.33, max_mem=24481 MB]                                                                        
Traceback (most recent call last):                                                                                                                           
  File "<frozen runpy>", line 198, in _run_module_as_main                                                                                                    
  File "<frozen runpy>", line 88, in _run_code                                                                                                               
  File "/home/ubuntu/sgoluza/instance_segmentation/src/training/trainer.py", line 158, in <module>                                                           
    main()                                                                                                                                                   
  File "/home/ubuntu/sgoluza/instance_segmentation/src/training/trainer.py", line 143, in main                                                               
    checkpoint_path = train(config=config, model_size=model_size, resume_training_weights=resume_training_weights)                                           
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                           
  File "/home/ubuntu/sgoluza/instance_segmentation/src/training/trainer.py", line 85, in train                                                               
    model.train(**train_kwargs)                                                                                                                              
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/rfdetr/detr.py", line 105, in train                                    
    self.train_from_config(config, **kwargs)                                                                                                                 
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/rfdetr/detr.py", line 281, in train_from_config                        
    self.model.train(                                                                                                                                        
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/rfdetr/main.py", line 416, in train                                    
    train_stats = train_one_epoch(                                                                                                                           
                  ^^^^^^^^^^^^^^^^                                                                                                                           
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/rfdetr/engine.py", line 187, in train_one_epoch                        
    loss_dict = criterion(outputs, new_targets)                                                                                                              
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                              
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl          
    return self._call_impl(*args, **kwargs)                                                                                                                  
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                  
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl                  
    return forward_call(*args, **kwargs)                                                                                                                     
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                      
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/rfdetr/models/lwdetr.py", line 706, in forward
    indices = self.matcher(outputs_without_aux, targets, group_detr=group_detr)                                                                              
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                              
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)                                   
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                   
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)                                      
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                      
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)                                              
           ^^^^^^^^^^^^^^^^^^^^^                                              
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/rfdetr/models/matcher.py", line 186, in forward
    indices_g = [linear_sum_assignment(c[i]) for i, c in enumerate(C_g.split(sizes, -1))]                                                                    
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                  
ValueError: matrix contains invalid numeric entries
```


I’m using RFDETRSegXLarge with rfdetr 1.5.2, CUDA 13.1, and PyTorch 2.8.0. You can find the full training configuration below. The same error consistently occurs around epoch 10.

Two days ago, I ran the same training configuration without augmentations on rf-detr 1.4.2, and everything worked smoothly. The only change I made was upgrading to version 1.5.2 so I could use augmentations.

I suspect the issue may be related to augmentations. I’ve already tried turning AMP on and off, and I checked my dataset multiple times - annotations are clean and there are no background-only images.

I also inspected the source code and found a potential bug in `rfdetr/models/matcher.py` lines 176-178:

```
max_cost = C.max() if C.numel() > 0 else 0
C[C.isinf() | C.isnan()] = max_cost * 2
```

`.max()` propagates NaN - if any element of C is NaN, `C.max()` returns NaN, so the replacement `C[mask] = NaN * 2` is a no-op. The cleanup was intended to handle non-finite values but silently fails in exactly the cases it's supposed to catch. I'm currently exploring further to resolve this but wanted to ask if somebody has the same problems.

```

AUGMENTATION_CONFIGS = {
    "HorizontalFlip": {"p": 0.5},
    "CLAHE": {"clip_limit": 2.0, "tile_grid_size": (8, 8), "p": 0.3},
    "GaussianBlur": {"blur_limit": (3, 7), "p": 0.2},
    "GaussNoise": {"std_range": (0.01, 0.05), "p": 0.2},
    "RandomBrightnessContrast": {
        "brightness_limit": 0.15,
        "contrast_limit": 0.15,
        "p": 0.3,
    },
    "Sharpen": {"alpha": (0.2, 0.5), "lightness": (0.5, 1.0), "p": 0.2},
}

@dataclass
class TrainingConfig:
    """
    Parameters passed directly to rfdetr model.train().
    """
    # data related paths
    dataset_dir: Path
    output_dir: Path
    # architecture related hyperparameters
    resolution: int = 624
    group_detr: int = 13
    # training related hyperparameters
    epochs: int = 100
    batch_size: int = 4
    grad_accum_steps: int = 4
    lr: float = 1e-4
    amp: bool = False
    gradient_checkpointing: bool = False
    num_workers: int = 2
    multi_scale: bool = True
    aug_config: dict = field(default_factory=lambda: AUGMENTATION_CONFIGS)
    # logging hyperparameters
    print_freq: int = 100
    tensorboard: bool = True
    run_test: bool = False
    progress_bar: bool = True
    # validation hyperparameters
    early_stopping: bool = True
    early_stopping_patience: int = 20
    use_ema: bool = True
    early_stopping_use_ema: bool = False
    num_select: int = 100
    eval_max_dets: int = 500
    fp16_eval: bool = False
    # reproducibility
    seed: int = 42
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue in matcher - ValueError: matrix contains invalid numeric entries #784

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue in matcher - ValueError: matrix contains invalid numeric entries #784

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions