Skip to content

Conversation

@YonghaoZhao722
Copy link
Contributor

This PR fixes two issues in training and validation metrics logging that affect comparability and correctness.

Problems fixed

1. Training loss / metric scaled with batch size

In SamTrainer._compute_loss(), per-sample mask_loss and iou_regression_loss were accumulated across the batch without normalization.
As a result, increasing batch_size inflated the reported loss and metric values even when model quality was unchanged, making runs hard to compare.

2. Validation mask_loss / iou_loss were not epoch averages

During validation, validation/metric and validation/loss were averaged over len(val_loader), but validation/mask_loss and validation/iou_loss effectively reflected only the last validation batch rather than epoch-level averages.

Changes made

  • Normalize mask_loss and iou_regression_loss by batch_size inside _compute_loss() so training metrics are batch-size invariant.
  • Accumulate and log epoch-averaged validation/mask_loss and validation/iou_loss.
  • Apply the same aggregation logic to the joint trainer for consistency.

Why this matters

  • Makes training curves comparable across different batch sizes, which is important for scaling experiments.
  • Ensures TensorBoard validation scalars accurately represent epoch-level averages.

This change does not affect model behavior or optimization—only how metrics are computed and reported.

@anwai98 anwai98 changed the base branch from master to dev December 17, 2025 14:19
Copy link
Collaborator

@anwai98 anwai98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'll merge this in a few minutes. Thanks for your contribution!

@anwai98 anwai98 merged commit 9634951 into computational-cell-analytics:dev Dec 17, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants