Sanitize non-finite matcher costs before Hungarian assignment by Copilot · Pull Request #787 · roboflow/rf-detr

Copilot · 2026-03-06T21:51:03Z

Training could fail in HungarianMatcher with ValueError: matrix contains invalid numeric entries when the cost matrix contained NaN/Inf values. The existing cleanup path used C.max(), which also became NaN in those cases and left invalid entries unsanitized.

Matcher cost sanitization
- Replace non-finite entries using only finite costs as the reference.
- Compute a finite fallback cost that is strictly larger than every valid entry, so invalid matches are always deprioritized instead of reaching linear_sum_assignment.
- Handle the degenerate case where the entire cost matrix is non-finite.
Regression coverage
- Add a focused matcher regression test for both NaN and Inf costs.
- Verify matching still succeeds and selects the valid query/target pair.

finite_mask = torch.isfinite(C)
if not finite_mask.all():
    if finite_mask.any():
        finite_costs = C[finite_mask]
        max_cost = finite_costs.max()
        replacement_cost = max_cost + finite_costs.abs().max() + 1
    else:
        replacement_cost = C.new_tensor(1.0)
    C[~finite_mask] = replacement_cost

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

images.cocodataset.org
- Triggering command: /home/REDACTED/work/rf-detr/rf-detr/.venv/bin/python /home/REDACTED/work/rf-detr/rf-detr/.venv/bin/python -u -c import sys;exec(eval(sys.stdin.readline())) e-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py (dns block)
- Triggering command: /home/REDACTED/work/rf-detr/rf-detr/.venv/bin/python /home/REDACTED/work/rf-detr/rf-detr/.venv/bin/python -u -c import sys;exec(eval(sys.stdin.readline())) -a che/pre-commit/repof3kz6y_w/.git (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

This section details on the original issue you should resolve

<issue_title>Issue in matcher - ValueError: matrix contains invalid numeric entries</issue_title>
<issue_description>Hi, I get the following error while training a model:

Epoch: [10/100]:  51%|█████████████████████████████████████████████████████████                                                      | 169/329 [09:41<09:10, 
 3.44s/it, lr=0.000100, class_loss=4.04, box_loss=0.09, loss=35.33, max_mem=24481 MB]                                                                        
Traceback (most recent call last):                                                                                                                           
  File "<frozen runpy>", line 198, in _run_module_as_main                                                                                                    
  File "<frozen runpy>", line 88, in _run_code                                                                                                               
  File "/home/ubuntu/sgoluza/instance_segmentation/src/training/trainer.py", line 158, in <module>                                                           
    main()                                                                                                                                                   
  File "/home/ubuntu/sgoluza/instance_segmentation/src/training/trainer.py", line 143, in main                                                               
    checkpoint_path = train(config=config, model_size=model_size, resume_training_weights=resume_training_weights)                                           
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                           
  File "/home/ubuntu/sgoluza/instance_segmentation/src/training/trainer.py", line 85, in train                                                               
    model.train(**train_kwargs)                                                                                                                              
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/rfdetr/detr.py", line 105, in train                                    
    self.train_from_config(config, **kwargs)                                                                                                                 
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/rfdetr/detr.py", line 281, in train_from_config                        
    self.model.train(                                                                                                                                        
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/rfdetr/main.py", line 416, in train                                    
    train_stats = train_one_epoch(                                                                                                                           
                  ^^^^^^^^^^^^^^^^                                                                                                                           
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/rfdetr/engine.py", line 187, in train_one_epoch                        
    loss_dict = criterion(outputs, new_targets)                                                                                                              
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                              
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl          
    return self._call_impl(*args, **kwargs)                                                                                                                  
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                  
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl                  
    return forward_call(*args, **kwargs)                                                                                                                     
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                      
  File "/home/ubuntu/sgoluza/instance_segmentation/.venv/lib/python3.12/site-packages/rfdetr/models/lwdetr.py", line 706, in forward
    indices = self.matcher(outputs_without_aux, targets, group_detr=group_detr)                                                                              
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                ...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes roboflow/rf-detr#784

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for you](https://github.com/roboflow/rf-detr/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>

codecov · 2026-03-07T16:35:20Z

Codecov Report

❌ Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 73%. Comparing base (fa4b623) to head (89c4b52).
⚠️ Report is 1 commits behind head on develop.

❌ Your project check has failed because the head coverage (73%) is below the target coverage (95%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@          Coverage Diff           @@
##           develop   #787   +/-   ##
======================================
  Coverage       73%    73%           
======================================
  Files           69     69           
  Lines         8149   8162   +13     
======================================
+ Hits          5965   5976   +11     
- Misses        2184   2186    +2

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR hardens HungarianMatcher against NaN/Inf entries in the cost matrix so SciPy’s linear_sum_assignment doesn’t raise ValueError: matrix contains invalid numeric entries during training.

Changes:

Sanitize non-finite entries in the matcher cost matrix using only finite costs as reference, with a deterministic finite fallback when all entries are non-finite.
Ensure replacement costs are strictly larger than any valid finite cost so invalid pairs are deprioritized.
Add regression tests covering both NaN and Inf cost contamination cases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`src/rfdetr/models/matcher.py`	Replaces the previous `C.max()`-based cleanup with finite-only sanitization and a safe fallback when all entries are non-finite.
`tests/models/test_matcher.py`	Adds a parametrized regression test ensuring matching succeeds and selects the valid query/target pair when non-finite costs are present.

You can also share your feedback on Copilot code review. Take the survey.

- Add logger.warning() when non-finite values are detected in the cost matrix so numerical instability surfaces early during training - Add -inf as a third parametrize case alongside nan and inf - Split all-nonfinite test into three focused assertions - Add regression test for negative-cost + NaN (Bug 2: max_cost*2 amplification) - Add batch_size>1 parametrized test exercising the C.split loop - Extract matcher and standard_target fixtures to reduce duplication - Wrap all tests in TestHungarianMatcherNonFiniteCosts class Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

You can also share your feedback on Copilot code review. Take the survey.

src/rfdetr/models/matcher.py

Copilot · 2026-03-10T23:05:55Z

src/rfdetr/models/matcher.py

+            logger.warning(
+                "Non-finite values detected in matcher cost matrix; "
+                "replacing with finite sentinel. "
+                "Check for numerical instability."
+            )


This logger.warning(...) is in the matcher forward pass (runs every training step) and will emit once per batch whenever non-finite costs occur, potentially spamming logs (especially under DDP, where each rank logs). Consider throttling (e.g., warn once per process / once per epoch) or gating behind is_main_process() if available, while still keeping the sanitization behavior.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Initial plan

731fef5

Copilot AI assigned Copilot and Borda Mar 6, 2026

Copilot started work on behalf of Borda March 6, 2026 21:51 View session

Copilot AI and others added 2 commits March 6, 2026 21:57

fix: sanitize non-finite matcher costs

c03f83b

Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>

test: document matcher replacement logic

dcb0609

Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix ValueError in matcher when training model~~ Sanitize non-finite matcher costs before Hungarian assignment Mar 6, 2026

Copilot finished work on behalf of Borda March 6, 2026 22:00

Merge branch 'develop' into copilot/fix-valueerror-matrix-entries

ea65053

Borda requested a review from Copilot March 10, 2026 22:11

Borda marked this pull request as ready for review March 10, 2026 22:11

Borda requested review from Borda, SkalskiP, isaacrob and probicheaux as code owners March 10, 2026 22:11

Copilot started reviewing on behalf of Borda March 10, 2026 22:11 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Borda requested a review from Copilot March 10, 2026 23:02

Copilot started reviewing on behalf of Borda March 10, 2026 23:02 View session

fix(pre-commit): 🎨 auto format pre-commit hooks

11ab8cd

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Update src/rfdetr/models/matcher.py

89c4b52

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitize non-finite matcher costs before Hungarian assignment#787

Sanitize non-finite matcher costs before Hungarian assignment#787
Copilot wants to merge 7 commits intodevelopfrom
copilot/fix-valueerror-matrix-entries

Copilot AI commented Mar 6, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

codecov bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 6, 2026 •

edited

Loading

codecov bot commented Mar 7, 2026 •

edited

Loading