Fix training/validation metrics scaling and validation loss aggregation #1126

YonghaoZhao722 · 2025-12-13T19:34:10Z

This PR fixes two issues in training and validation metrics logging that affect comparability and correctness.

Problems fixed

1. Training loss / metric scaled with batch size

In SamTrainer._compute_loss(), per-sample mask_loss and iou_regression_loss were accumulated across the batch without normalization.
As a result, increasing batch_size inflated the reported loss and metric values even when model quality was unchanged, making runs hard to compare.

2. Validation `mask_loss` / `iou_loss` were not epoch averages

During validation, validation/metric and validation/loss were averaged over len(val_loader), but validation/mask_loss and validation/iou_loss effectively reflected only the last validation batch rather than epoch-level averages.

Changes made

Normalize mask_loss and iou_regression_loss by batch_size inside _compute_loss() so training metrics are batch-size invariant.
Accumulate and log epoch-averaged validation/mask_loss and validation/iou_loss.
Apply the same aggregation logic to the joint trainer for consistency.

Why this matters

Makes training curves comparable across different batch sizes, which is important for scaling experiments.
Ensures TensorBoard validation scalars accurately represent epoch-level averages.

This change does not affect model behavior or optimization—only how metrics are computed and reported.

Co-authored-by: neverlandzyh <[email protected]>

…ication-676f Metric display verification

anwai98

Looks good to me. I'll merge this in a few minutes. Thanks for your contribution!

micro_sam/training/sam_trainer.py

cursoragent and others added 3 commits December 13, 2025 19:21

Refactor validation logging and loss normalization

e794eae

Co-authored-by: neverlandzyh <[email protected]>

Merge pull request #1 from YonghaoZhao722/cursor/metric-display-verif…

badc1b6

…ication-676f Metric display verification

Merge branch 'computational-cell-analytics:master' into master

dfb04f2

constantinpape requested a review from anwai98 December 14, 2025 20:11

anwai98 changed the base branch from master to dev December 17, 2025 14:19

anwai98 approved these changes Dec 17, 2025

View reviewed changes

micro_sam/training/sam_trainer.py Show resolved Hide resolved

anwai98 merged commit 9634951 into computational-cell-analytics:dev Dec 17, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix training/validation metrics scaling and validation loss aggregation #1126

Fix training/validation metrics scaling and validation loss aggregation #1126

Uh oh!

YonghaoZhao722 commented Dec 13, 2025

Uh oh!

anwai98 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix training/validation metrics scaling and validation loss aggregation #1126

Fix training/validation metrics scaling and validation loss aggregation #1126

Uh oh!

Conversation

YonghaoZhao722 commented Dec 13, 2025

Problems fixed

1. Training loss / metric scaled with batch size

2. Validation mask_loss / iou_loss were not epoch averages

Changes made

Why this matters

Uh oh!

anwai98 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2. Validation `mask_loss` / `iou_loss` were not epoch averages