[Llama 3.1] Updates MLLOG tags by Elnifio · Pull Request #790 · mlcommons/training

Elnifio · 2025-04-03T15:58:37Z

No description provided.

github-actions · 2025-04-03T15:58:52Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

ShriyaRishab · 2025-04-03T17:08:45Z

@Elnifio can you verify that the logs from this reference indeed pass the latest checkers?

mmarcinkiewicz · 2025-04-04T13:49:55Z

large_language_model_pretraining/nemo/callbacks.py


    def log_validation_loss(self, metrics, step):
-        consumed_tokens = (step - self.init_global_step) * self.gbs * self.seq_len
+        consumed_tokens = step * self.gbs


shouldn't that be called consumed_samples?

Updated in the latest commit. Thanks for catching that!

mmarcinkiewicz · 2025-04-04T13:51:58Z

large_language_model_pretraining/nemo/callbacks.py

    def on_validation_start(self, trainer, pl_module):
-        mllogger.end(key=constants.BLOCK_STOP, metadata={'epoch_num': self.consumed_tokens(trainer)})
-        mllogger.start(key=constants.EVAL_START, metadata={'epoch_num': self.consumed_tokens(trainer)})
+        mllogger.end(key=constants.BLOCK_STOP, metadata={constants.SAMPLES_COUNT: self.consumed_tokens(trainer)})


not sure where self.consumed_tokens() is defined, but does it return tokens or samples? If tokens, we need to switch it to samples. If samples, we need to rename it

Updated in the latest commit.

Elnifio · 2025-04-04T19:59:48Z

@ShriyaPalsamudram @mmarcinkiewicz the latest run fails the RCP checker in the following test:

common.yaml line 128, missing train_samples - this is fixed in my latest commit, where I renamed trained_samples to train_samples
closed_llama31_405b.yaml line 35, incorrect opt_learning_rate_decay_schedule value - this is fixed in my latest commit, where I renamed cosine with linear warmups to cosine with linear warmup.
common.yaml line 97, 103: missing epoch_num in epoch_start, epoch_end tags - I'm wondering whether, similar to other tags, we should accept both epoch_num and samples_count in the metadata. What do you think?

Elnifio · 2025-04-07T21:53:13Z

@ShriyaPalsamudram @mmarcinkiewicz the latest run fails the RCP checker in the following test:

common.yaml line 128, missing train_samples - this is fixed in my latest commit, where I renamed trained_samples to train_samples

closed_llama31_405b.yaml line 35, incorrect opt_learning_rate_decay_schedule value - this is fixed in my latest commit, where I renamed cosine with linear warmups to cosine with linear warmup.

common.yaml line 97, 103: missing epoch_num in epoch_start, epoch_end tags - I'm wondering whether, similar to other tags, we should accept both epoch_num and samples_count in the metadata. What do you think?

Further updating on this comment - I have ran the RCP checker against the latest compliance checker from PR 414 and it passed without any errors / warnings, so I'd say both PRs are ready to merge.

further removes one token count compute, and update all MLLOG tags

d173113

Elnifio requested a review from a team as a code owner April 3, 2025 15:58

ShriyaRishab previously approved these changes Apr 3, 2025

View reviewed changes

mmarcinkiewicz reviewed Apr 4, 2025

View reviewed changes

updates the function names and train_samples

795a761

Elnifio dismissed ShriyaRishab’s stale review via 795a761 April 4, 2025 19:44

updates the decay schedule

29df3fc

ShriyaRishab approved these changes Apr 8, 2025

View reviewed changes

ShriyaRishab merged commit ece3d15 into mlcommons:master Apr 8, 2025
1 check passed

github-actions bot locked and limited conversation to collaborators Apr 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Llama 3.1] Updates MLLOG tags#790

[Llama 3.1] Updates MLLOG tags#790
ShriyaRishab merged 3 commits intomlcommons:masterfrom
Elnifio:update-llama31

Elnifio commented Apr 3, 2025

Uh oh!

github-actions bot commented Apr 3, 2025 •

edited

Loading

Uh oh!

ShriyaRishab commented Apr 3, 2025

Uh oh!

mmarcinkiewicz Apr 4, 2025

Uh oh!

Elnifio Apr 4, 2025

Uh oh!

mmarcinkiewicz Apr 4, 2025

Uh oh!

Elnifio Apr 4, 2025

Uh oh!

Elnifio commented Apr 4, 2025

Uh oh!

Elnifio commented Apr 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Elnifio commented Apr 3, 2025

Uh oh!

github-actions bot commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShriyaRishab commented Apr 3, 2025

Uh oh!

mmarcinkiewicz Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

Elnifio Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

mmarcinkiewicz Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

Elnifio Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

Elnifio commented Apr 4, 2025

Uh oh!

Elnifio commented Apr 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Apr 3, 2025 •

edited

Loading