Skip to content

Fix canary validation artifact replication#4173

Merged
yonromai merged 2 commits intomainfrom
fix/canary-validation
Mar 26, 2026
Merged

Fix canary validation artifact replication#4173
yonromai merged 2 commits intomainfrom
fix/canary-validation

Conversation

@yonromai
Copy link
Copy Markdown
Contributor

Summary

  • restore canary tracker_metrics.jsonl emission by setting replicate_path=this_output_path() on the canary W&B tracker
  • add a regression test that reloads the canary step and asserts metrics replication stays configured
  • fix the TPU canary post-run validation failure seen in scheduled runs such as 23581948912, where training succeeded but metrics validation failed because the artifact was never written

Testing

  • ./infra/pre-commit.py experiments/ferries/canary_ferry.py tests/test_validate_canary_metrics.py
  • uv run --with pytest --with pytest-timeout --with pytest-xdist python -m pytest tests/test_validate_canary_metrics.py

@yonromai yonromai added the agent-generated Created by automation/agent label Mar 26, 2026
@yonromai yonromai merged commit a4ee33e into main Mar 26, 2026
37 of 39 checks passed
@yonromai yonromai deleted the fix/canary-validation branch March 26, 2026 14:56
ravwojdyla pushed a commit that referenced this pull request Mar 26, 2026
## Summary
- restore canary `tracker_metrics.jsonl` emission by setting
`replicate_path=this_output_path()` on the canary W&B tracker
- add a regression test that reloads the canary step and asserts metrics
replication stays configured
- fix the TPU canary post-run validation failure seen in scheduled runs
such as 23581948912, where training succeeded but metrics validation
failed because the artifact was never written

## Testing
- `./infra/pre-commit.py experiments/ferries/canary_ferry.py
tests/test_validate_canary_metrics.py`
- `uv run --with pytest --with pytest-timeout --with pytest-xdist python
-m pytest tests/test_validate_canary_metrics.py`

---------

Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com>
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
## Summary
- restore canary `tracker_metrics.jsonl` emission by setting
`replicate_path=this_output_path()` on the canary W&B tracker
- add a regression test that reloads the canary step and asserts metrics
replication stays configured
- fix the TPU canary post-run validation failure seen in scheduled runs
such as 23581948912, where training succeeded but metrics validation
failed because the artifact was never written

## Testing
- `./infra/pre-commit.py experiments/ferries/canary_ferry.py
tests/test_validate_canary_metrics.py`
- `uv run --with pytest --with pytest-timeout --with pytest-xdist python
-m pytest tests/test_validate_canary_metrics.py`

---------

Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants