-
Notifications
You must be signed in to change notification settings - Fork 586
Closed
Description
Dear MLPerf Training WG,
We’re seeing a consistent off-by-one failure from the compliance checker on opt_learning_rate_decay_steps for the Llama3.1-405B benchmark under ruleset 5.1.0. Details below.
What we ran: Benchmark: LLAMA31_405B (closed)
Ruleset: 5.1.0
Effective global batch size (GBS): 1008
Reported warmup steps: 9143
Reported decay steps (logged): 1362287
Checker expectation for decay steps: 1362286
Result: checker FAIL (by exactly 1)
Error: INFO - Running compliance on file: /results/251006072002009929661_1.log
INFO - Compliance checks: training_5.1.0/common.yaml
INFO - Compliance checks: training_5.1.0/closed_common.yaml
INFO - Compliance checks: training_5.1.0/closed_llama31_405b.yaml
WARNING - Required AT_LEAST_ONE occurrence of 'cache_clear' but found 0
CHECK for 'opt_learning_rate_decay_steps' failed in line 22:
:::MLLOG {"namespace": "", "time_ms": 1759749837494, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_decay_steps", "value": 1362287, "metadata": {"file": "/workspace/llm/pretrain.py", "lineno": 601}}
failed test: v['value'] == math.ceil(1_200_000 * 1152 / s['global_batch_size'] ) - s['opt_learning_rate_warmup_steps']
current context[s]={
"global_batch_size": 1008,
"opt_learning_rate_warmup_steps": 9143
}
current line[v]={
"metadata": {
"file": "/workspace/llm/pretrain.py",
"lineno": 601
},
"value": 1362287
}
ERROR - FAILED
INFO - Running compliance on file: /results/251006072002009929661_1.log
INFO - Compliance checks: training_5.1.0/common.yaml
INFO - Compliance checks: training_5.1.0/closed_common.yaml
INFO - Compliance checks: training_5.1.0/closed_llama31_405b.yaml
WARNING - Required AT_LEAST_ONE occurrence of 'cache_clear' but found 0
The model training itself is fine; we are wondering if it is a logging math issue?
Thank you for your help!
Best,
Yang Hong
AI Support | Research Computing
UFIT | University of Florida
Metadata
Metadata
Assignees
Labels
No labels