Skip to content

Sanitize non-finite matcher costs before Hungarian assignment#787

Open
Copilot wants to merge 7 commits intodevelopfrom
copilot/fix-valueerror-matrix-entries
Open

Sanitize non-finite matcher costs before Hungarian assignment#787
Copilot wants to merge 7 commits intodevelopfrom
copilot/fix-valueerror-matrix-entries

Conversation

Copy link
Contributor

Copilot AI commented Mar 6, 2026

Training could fail in HungarianMatcher with ValueError: matrix contains invalid numeric entries when the cost matrix contained NaN/Inf values. The existing cleanup path used C.max(), which also became NaN in those cases and left invalid entries unsanitized.

  • Matcher cost sanitization

    • Replace non-finite entries using only finite costs as the reference.
    • Compute a finite fallback cost that is strictly larger than every valid entry, so invalid matches are always deprioritized instead of reaching linear_sum_assignment.
    • Handle the degenerate case where the entire cost matrix is non-finite.
  • Regression coverage

    • Add a focused matcher regression test for both NaN and Inf costs.
    • Verify matching still succeeds and selects the valid query/target pair.
finite_mask = torch.isfinite(C)
if not finite_mask.all():
    if finite_mask.any():
        finite_costs = C[finite_mask]
        max_cost = finite_costs.max()
        replacement_cost = max_cost + finite_costs.abs().max() + 1
    else:
        replacement_cost = C.new_tensor(1.0)
    C[~finite_mask] = replacement_cost

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • images.cocodataset.org
    • Triggering command: /home/REDACTED/work/rf-detr/rf-detr/.venv/bin/python /home/REDACTED/work/rf-detr/rf-detr/.venv/bin/python -u -c import sys;exec(eval(sys.stdin.readline())) e-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py (dns block)
    • Triggering command: /home/REDACTED/work/rf-detr/rf-detr/.venv/bin/python /home/REDACTED/work/rf-detr/rf-detr/.venv/bin/python -u -c import sys;exec(eval(sys.stdin.readline())) -a che/pre-commit/repof3kz6y_w/.git (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>Issue in matcher - ValueError: matrix contains invalid numeric entries</issue_title>
<issue_description>Hi, I get the following error while training a model:

Epoch: [10/100]:  51%|█████████████████████████████████████████████████████████                                                      | 169/329 [09:41<09:10, 
 3.44s/it, lr=0.000100, class_loss=4.04, box_loss=0.09, loss=35.33, max_mem=24481 MB]                                                                        
Traceback (most recent call last):                                                                                                                           
  File "<frozen runpy>", line 198, in _run_module_as_main                                                                                                    
  File "<frozen runpy>", line 88, in _run_code                                                                                                               
  File "/home/ubuntu/sgoluza/instance_segmentation/src/training/trainer.py", line 158, in <module>                                                           
    main()                                                                                                                                                   
  File "/home/ubuntu/sgoluza/instance_segmentation/src/training/trainer.py", line 143, in main                                                               
    checkpoint_path = train(config=config, model_size=model_size, resume_training_weights=resume_training_weights)                                           
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                           
  File "/home/ubuntu/sgoluza/instance_segmentation/src/training/trainer.py", line 85, in train                                                               
    model.train(**train_kwargs)                                                                                                                              
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/rfdetr/detr.py", line 105, in train                                    
    self.train_from_config(config, **kwargs)                                                                                                                 
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/rfdetr/detr.py", line 281, in train_from_config                        
    self.model.train(                                                                                                                                        
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/rfdetr/main.py", line 416, in train                                    
    train_stats = train_one_epoch(                                                                                                                           
                  ^^^^^^^^^^^^^^^^                                                                                                                           
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/rfdetr/engine.py", line 187, in train_one_epoch                        
    loss_dict = criterion(outputs, new_targets)                                                                                                              
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                              
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl          
    return self._call_impl(*args, **kwargs)                                                                                                                  
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                  
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl                  
    return forward_call(*args, **kwargs)                                                                                                                     
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                      
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/rfdetr/models/lwdetr.py", line 706, in forward
    indices = self.matcher(outputs_without_aux, targets, group_detr=group_detr)                                                                              
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                ...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes roboflow/rf-detr#784

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for you](https://github.com/roboflow/rf-detr/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits March 6, 2026 21:57
Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>
Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix ValueError in matcher when training model Sanitize non-finite matcher costs before Hungarian assignment Mar 6, 2026
Copilot finished work on behalf of Borda March 6, 2026 22:00
@codecov
Copy link

codecov bot commented Mar 7, 2026

Codecov Report

❌ Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 73%. Comparing base (fa4b623) to head (89c4b52).
⚠️ Report is 1 commits behind head on develop.

❌ Your project check has failed because the head coverage (73%) is below the target coverage (95%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@          Coverage Diff           @@
##           develop   #787   +/-   ##
======================================
  Coverage       73%    73%           
======================================
  Files           69     69           
  Lines         8149   8162   +13     
======================================
+ Hits          5965   5976   +11     
- Misses        2184   2186    +2     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Borda Borda requested a review from Copilot March 10, 2026 22:11
@Borda Borda marked this pull request as ready for review March 10, 2026 22:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens HungarianMatcher against NaN/Inf entries in the cost matrix so SciPy’s linear_sum_assignment doesn’t raise ValueError: matrix contains invalid numeric entries during training.

Changes:

  • Sanitize non-finite entries in the matcher cost matrix using only finite costs as reference, with a deterministic finite fallback when all entries are non-finite.
  • Ensure replacement costs are strictly larger than any valid finite cost so invalid pairs are deprioritized.
  • Add regression tests covering both NaN and Inf cost contamination cases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/rfdetr/models/matcher.py Replaces the previous C.max()-based cleanup with finite-only sanitization and a safe fallback when all entries are non-finite.
tests/models/test_matcher.py Adds a parametrized regression test ensuring matching succeeds and selects the valid query/target pair when non-finite costs are present.

You can also share your feedback on Copilot code review. Take the survey.

- Add logger.warning() when non-finite values are detected in the cost
  matrix so numerical instability surfaces early during training
- Add -inf as a third parametrize case alongside nan and inf
- Split all-nonfinite test into three focused assertions
- Add regression test for negative-cost + NaN (Bug 2: max_cost*2 amplification)
- Add batch_size>1 parametrized test exercising the C.split loop
- Extract matcher and standard_target fixtures to reduce duplication
- Wrap all tests in TestHungarianMatcherNonFiniteCosts class

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +183 to +187
logger.warning(
"Non-finite values detected in matcher cost matrix; "
"replacing with finite sentinel. "
"Check for numerical instability."
)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logger.warning(...) is in the matcher forward pass (runs every training step) and will emit once per batch whenever non-finite costs occur, potentially spamming logs (especially under DDP, where each rank logs). Consider throttling (e.g., warn once per process / once per epoch) or gating behind is_main_process() if available, while still keeping the sanitization behavior.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants