llama 3.1 405b compliance checker issue

Dear MLPerf Training WG,

We’re seeing a consistent off-by-one failure from the compliance checker on opt_learning_rate_decay_steps for the Llama3.1-405B benchmark under ruleset 5.1.0. Details below.

What we ran: Benchmark: LLAMA31_405B (closed)

Ruleset: 5.1.0

Effective global batch size (GBS): 1008

Reported warmup steps: 9143

Reported decay steps (logged): 1362287

Checker expectation for decay steps: 1362286

Result: checker FAIL (by exactly 1)

Error: INFO - Running compliance on file: /results/251006072002009929661_1.log
INFO -  Compliance checks: training_5.1.0/common.yaml
INFO -  Compliance checks: training_5.1.0/closed_common.yaml
INFO -  Compliance checks: training_5.1.0/closed_llama31_405b.yaml
WARNING -  Required AT_LEAST_ONE occurrence of 'cache_clear' but found 0
------------------------------
CHECK for 'opt_learning_rate_decay_steps' failed in line 22:
:::MLLOG {"namespace": "", "time_ms": 1759749837494, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_decay_steps", "value": 1362287, "metadata": {"file": "/workspace/llm/pretrain.py", "lineno": 601}}
failed test:  v['value'] == math.ceil(1_200_000 * 1152 / s['global_batch_size'] ) - s['opt_learning_rate_warmup_steps'] 
current context[s]={
  "global_batch_size": 1008,
  "opt_learning_rate_warmup_steps": 9143
}
current line[v]={
  "metadata": {
    "file": "/workspace/llm/pretrain.py",
    "lineno": 601
  },
  "value": 1362287
}
ERROR - FAILED
INFO - Running compliance on file: /results/251006072002009929661_1.log
INFO -  Compliance checks: training_5.1.0/common.yaml
INFO -  Compliance checks: training_5.1.0/closed_common.yaml
INFO -  Compliance checks: training_5.1.0/closed_llama31_405b.yaml
WARNING -  Required AT_LEAST_ONE occurrence of 'cache_clear' but found 0
------------------------------

The model training itself is fine; we are wondering if it is a logging math issue?

Thank you for your help!

Best,
Yang Hong
-------------------------------------
AI Support | Research Computing
UFIT | University of Florida

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama 3.1 405b compliance checker issue #841

Best,
Yang Hong

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama 3.1 405b compliance checker issue #841

Description

Best, Yang Hong

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Best,
Yang Hong