Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging resolved config #2274

Merged
merged 5 commits into from
Jan 21, 2025
Merged

Conversation

Ankur-singh
Copy link
Contributor

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

Please link to any issues this PR addresses #1968

Changelog

What are the changes made in this PR?

  • Created save_config utility function
  • Implemented DiskLogger.log_config method

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
  • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Stdout:

(tune) ankur@nuc:~/github/torchtune$ tune run full_finetune_single_device --config llama3_2/1B_full_single_device output_dir=./logs/ max_steps_per_epoch=10 device=cpu optimizer._component_=torch.optim.AdamW
INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: ./logs/
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: ./logs//logs
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
  _component_: torch.optim.AdamW
  lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./logs/
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: ./logs//profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model

INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 3022706700. Local seed is seed + rank = 3022706700 + 0
Writing logs to logs/logs/log_1737066872.txt
INFO:torchtune.utils._logging:Writing resolved config to logs/torchtune_config.yaml
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
1|10|Loss: 1.088627576828003: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:33<00:00,  3.01s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to logs/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
1|10|Loss: 1.088627576828003: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:41<00:00,  4.15s/it]

Generated Log file at ./logs/torchtune_config.yaml:

output_dir: ./logs/
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
  max_seq_len: null
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: false
seed: null
shuffle: true
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
  checkpoint_files:
  - model.safetensors
  recipe_checkpoint: null
  output_dir: ${output_dir}
  model_type: LLAMA3_2
resume_from_checkpoint: false
batch_size: 4
epochs: 1
optimizer:
  _component_: torch.optim.AdamW
  lr: 2.0e-05
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
gradient_accumulation_steps: 1
optimizer_in_bwd: true
clip_grad_norm: null
compile: false
device: cpu
enable_activation_checkpointing: false
enable_activation_offloading: false
dtype: bf16
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: ${output_dir}/logs
log_every_n_steps: 1
log_peak_memory_stats: true
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: false
  output_dir: ${output_dir}/profiling_outputs
  cpu: true
  cuda: true
  profile_memory: false
  with_stack: false
  record_shapes: true
  with_flops: false
  wait_steps: 5
  warmup_steps: 3
  active_steps: 2
  num_cycles: 1

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

  • I did not change any public API
  • I have added an example to docs or docstrings

Copy link

pytorch-bot bot commented Jan 16, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2274

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2a04a07 with merge base 779569e (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 16, 2025
felipemello1

This comment was marked as resolved.

Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good, just left a couple of questions. After you answer, I think it should be good to implement in every other logger.

Thanks for the PR!

torchtune/training/metric_logging.py Outdated Show resolved Hide resolved
@@ -23,6 +23,20 @@
log = get_logger("DEBUG")


def save_config(config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: in the wandb implementation, we can to resolve the config first

resolved = OmegaConf.to_container(config, resolve=True)

Is this necessary here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The to_container method is used to convert OmegaConf object to standard python object. Wandb expects standard python container objects like dictionary and list. But for saving the config, we don't need to convert it.

Furthermore, we can use resolve=True when calling OmegaConf.save to resolve the interpolations before saving the config to yaml. However, I didn't see any interpolations in our config files.

Hope this helps :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! Is checkpointer.output_dir: ${output_dir} an example of interpolation? I think that its good to maintain the same pattern for everything, i.e. lets either change wandb/comet or change the save_config.

Copy link
Contributor Author

@Ankur-singh Ankur-singh Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, checkpointer.output_dir: ${output_dir} is an interpolation.

Just to make it explicit, wandb/comet save the config is two formats:

  1. Dictionary: resolved dictionary is pushed to hub.
  2. YAML file: the config file is saved as artifact. The YAML file is saved as is (i.e. not using the resolved dictionary).

I'm pasting the log_config method from CometLogger below for reference:

def log_config(self, config: DictConfig) -> None:
    if self.experiment is not None:
        resolved = OmegaConf.to_container(config, resolve=True)
        self.experiment.log_parameters(resolved)

        # Also try to save the config as a file
        try: self._log_config_as_file(config)
        except Exception as e:
            log.warning(f"Error saving Config to disk.\nError: \n{e}.")

As you can see, resolved variable is not used when saving the config as file. WandLogger also follows the same.

I believe, we don't need the resolved dictionary as we are only interested in saving in YAML format. However, if we wish to resolve the config before saving, then we can simply do:

with open("config.yaml", "w") as f:
    f.write(OmegaConf.to_yaml(conf, resolve=True))

Hope I didn't misunderstand your comment. Let me know what you think, whether to resolve the config or not?

@felipemello1
Copy link
Contributor

Its looking good. Lets please go ahead and implement it for every logger. We should use the same util inside of wandb and commet, if possible, plus whatever additional function they require. Notice that commet is also saving it to the wrong directory (checkpoint_dir).

If you have bandwidth, running every logger once for a sanity check would be ideal.

@Ankur-singh
Copy link
Contributor Author

I have updated metric_logging.py file. In addition to updating log_config methods, I made some minor changes to __init__ methods of WandBLogger and CometLogger to prevent errors that were previously hidden.

I'm adding the output from each logger below for reference.

Disk Logger:

INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: ./DiskLoggerOut
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: ./DiskLoggerOut/logs
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
  _component_: torch.optim.AdamW
  lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./DiskLoggerOut
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: ./DiskLoggerOut/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model

INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 1248908266. Local seed is seed + rank = 1248908266 + 0
INFO:torchtune.utils._logging:Writing config to DiskLoggerOut/torchtune_config.yaml
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
Writing logs to DiskLoggerOut/logs/log_1737162084.txt

  0%|          | 0/10 [00:00<?, ?it/s]
 10%|| 1/10 [00:04<00:38,  4.27s/it]
1|1|Loss: 2.9877049922943115:  10%|| 1/10 [00:04<00:38,  4.27s/it]
1|1|Loss: 2.9877049922943115:  20%|██        | 2/10 [00:08<00:33,  4.24s/it]
1|2|Loss: 1.7919223308563232:  20%|██        | 2/10 [00:08<00:33,  4.24s/it]
1|2|Loss: 1.7919223308563232:  30%|███       | 3/10 [00:11<00:25,  3.60s/it]
1|3|Loss: 1.48738694190979:  30%|███       | 3/10 [00:11<00:25,  3.60s/it]  
1|3|Loss: 1.48738694190979:  40%|████      | 4/10 [00:14<00:19,  3.32s/it]
1|4|Loss: 1.3556486368179321:  40%|████      | 4/10 [00:14<00:19,  3.32s/it]
1|4|Loss: 1.3556486368179321:  50%|█████     | 5/10 [00:18<00:19,  3.81s/it]
1|5|Loss: 1.263526201248169:  50%|█████     | 5/10 [00:18<00:19,  3.81s/it] 
1|5|Loss: 1.263526201248169:  60%|██████    | 6/10 [00:21<00:13,  3.40s/it]
1|6|Loss: 1.314545750617981:  60%|██████    | 6/10 [00:21<00:13,  3.40s/it]
1|6|Loss: 1.314545750617981:  70%|███████   | 7/10 [00:24<00:09,  3.21s/it]
1|7|Loss: 1.2146070003509521:  70%|███████   | 7/10 [00:24<00:09,  3.21s/it]
1|7|Loss: 1.2146070003509521:  80%|████████  | 8/10 [00:28<00:06,  3.37s/it]
1|8|Loss: 1.197435736656189:  80%|████████  | 8/10 [00:28<00:06,  3.37s/it] 
1|8|Loss: 1.197435736656189:  90%|█████████ | 9/10 [00:30<00:03,  3.16s/it]
1|9|Loss: 1.1718554496765137:  90%|█████████ | 9/10 [00:30<00:03,  3.16s/it]
1|9|Loss: 1.1718554496765137: 100%|██████████| 10/10 [00:34<00:00,  3.32s/it]
1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:34<00:00,  3.32s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to DiskLoggerOut/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.

1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:36<00:00,  3.65s/it]

Stdout Logger:

INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: ./StdoutLoggerOut
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
  _component_: torchtune.training.metric_logging.StdoutLogger
  log_dir: ./StdoutLoggerOut/logs
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
  _component_: torch.optim.AdamW
  lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./StdoutLoggerOut
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: ./StdoutLoggerOut/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model

INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 625184984. Local seed is seed + rank = 625184984 + 0
INFO:torchtune.utils._logging:Writing config to StdoutLoggerOut/torchtune_config.yaml
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}

  0%|          | 0/10 [00:00<?, ?it/s]
 10%|| 1/10 [00:04<00:38,  4.28s/it]
1|1|Loss: 2.9877049922943115:  10%|| 1/10 [00:04<00:38,  4.28s/it]
1|1|Loss: 2.9877049922943115:  20%|██        | 2/10 [00:08<00:32,  4.07s/it]
1|2|Loss: 1.7919223308563232:  20%|██        | 2/10 [00:08<00:32,  4.07s/it]
1|2|Loss: 1.7919223308563232:  30%|███       | 3/10 [00:11<00:25,  3.57s/it]
1|3|Loss: 1.48738694190979:  30%|███       | 3/10 [00:11<00:25,  3.57s/it]  
1|3|Loss: 1.48738694190979:  40%|████      | 4/10 [00:13<00:19,  3.25s/it]
1|4|Loss: 1.3556486368179321:  40%|████      | 4/10 [00:13<00:19,  3.25s/it]
1|4|Loss: 1.3556486368179321:  50%|█████     | 5/10 [00:18<00:18,  3.75s/it]
1|5|Loss: 1.263526201248169:  50%|█████     | 5/10 [00:18<00:18,  3.75s/it] 
1|5|Loss: 1.263526201248169:  60%|██████    | 6/10 [00:21<00:13,  3.35s/it]
1|6|Loss: 1.314545750617981:  60%|██████    | 6/10 [00:21<00:13,  3.35s/it]
1|6|Loss: 1.314545750617981:  70%|███████   | 7/10 [00:23<00:09,  3.17s/it]
1|7|Loss: 1.2146070003509521:  70%|███████   | 7/10 [00:23<00:09,  3.17s/it]
1|7|Loss: 1.2146070003509521:  80%|████████  | 8/10 [00:26<00:05,  2.99s/it]
1|8|Loss: 1.197435736656189:  80%|████████  | 8/10 [00:26<00:05,  2.99s/it] 
1|8|Loss: 1.197435736656189:  90%|█████████ | 9/10 [00:28<00:02,  2.76s/it]
1|9|Loss: 1.1718554496765137:  90%|█████████ | 9/10 [00:28<00:02,  2.76s/it]
1|9|Loss: 1.1718554496765137: 100%|██████████| 10/10 [00:31<00:00,  2.77s/it]
1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:31<00:00,  2.77s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to StdoutLoggerOut/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.

1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:33<00:00,  3.39s/it]
Step 1 | loss:2.9877049922943115 lr:2e-05 tokens_per_second_per_gpu:90.10209655761719 
Step 2 | loss:1.7919223308563232 lr:2e-05 tokens_per_second_per_gpu:90.45220947265625 
Step 3 | loss:1.48738694190979 lr:2e-05 tokens_per_second_per_gpu:99.32902526855469 
Step 4 | loss:1.3556486368179321 lr:2e-05 tokens_per_second_per_gpu:112.30082702636719 
Step 5 | loss:1.263526201248169 lr:2e-05 tokens_per_second_per_gpu:117.72256469726562 
Step 6 | loss:1.314545750617981 lr:2e-05 tokens_per_second_per_gpu:137.54611206054688 
Step 7 | loss:1.2146070003509521 lr:2e-05 tokens_per_second_per_gpu:143.7611083984375 
Step 8 | loss:1.197435736656189 lr:2e-05 tokens_per_second_per_gpu:120.44074249267578 
Step 9 | loss:1.1718554496765137 lr:2e-05 tokens_per_second_per_gpu:158.5180206298828 
Step 10 | loss:1.088627576828003 lr:2e-05 tokens_per_second_per_gpu:144.5830841064453 

WandB Logger:

INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: ./WandBLoggerOut
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
  _component_: torchtune.training.metric_logging.WandBLogger
  log_dir: ./WandBLoggerOut/logs
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
  _component_: torch.optim.AdamW
  lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./WandBLoggerOut
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: ./WandBLoggerOut/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model

INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 1022127600. Local seed is seed + rank = 1022127600 + 0
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.19.3
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
INFO:torchtune.utils._logging:Writing config to WandBLoggerOut/torchtune_config.yaml
INFO:torchtune.utils._logging:Uploading WandBLoggerOut/torchtune_config.yaml to W&B under Files
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
1|10|Loss: 1.088627576828003: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:29<00:00,  2.87s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to WandBLoggerOut/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
1|10|Loss: 1.088627576828003: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:32<00:00,  3.20s/it]
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:               global_step ▁▂▃▃▄▅▆▆▇█
wandb:                      loss █▄▂▂▂▂▁▁▁▁
wandb:                        lr ▁▁▁▁▁▁▁▁▁▁
wandb: tokens_per_second_per_gpu ▂▁▅▆▅▇▇▃█▄
wandb: 
wandb: Run summary:
wandb:               global_step 10
wandb:                      loss 1.08863
wandb:                        lr 2e-05
wandb: tokens_per_second_per_gpu 122.5525
wandb: 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync ./WandBLoggerOut/logs/wandb/offline-run-20250118_011118-tmk0azdt
wandb: Find logs at: ./WandBLoggerOut/logs/wandb/offline-run-20250118_011118-tmk0azdt/logs

TensorBoard Logger:

INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: ./TensorBoardLoggerOut
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
  _component_: torchtune.training.metric_logging.TensorBoardLogger
  log_dir: ./TensorBoardLoggerOut/logs
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
  _component_: torch.optim.AdamW
  lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./TensorBoardLoggerOut
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: ./TensorBoardLoggerOut/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model

INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 1183309963. Local seed is seed + rank = 1183309963 + 0
INFO:torchtune.utils._logging:Writing config to TensorBoardLoggerOut/torchtune_config.yaml
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}

  0%|          | 0/10 [00:00<?, ?it/s]
 10%|| 1/10 [00:04<00:37,  4.13s/it]
1|1|Loss: 2.9877049922943115:  10%|| 1/10 [00:04<00:37,  4.13s/it]
1|1|Loss: 2.9877049922943115:  20%|██        | 2/10 [00:08<00:32,  4.03s/it]
1|2|Loss: 1.7919223308563232:  20%|██        | 2/10 [00:08<00:32,  4.03s/it]
1|2|Loss: 1.7919223308563232:  30%|███       | 3/10 [00:10<00:24,  3.47s/it]
1|3|Loss: 1.48738694190979:  30%|███       | 3/10 [00:10<00:24,  3.47s/it]  
1|3|Loss: 1.48738694190979:  40%|████      | 4/10 [00:13<00:18,  3.12s/it]
1|4|Loss: 1.3556486368179321:  40%|████      | 4/10 [00:13<00:18,  3.12s/it]
1|4|Loss: 1.3556486368179321:  50%|█████     | 5/10 [00:18<00:18,  3.72s/it]
1|5|Loss: 1.263526201248169:  50%|█████     | 5/10 [00:18<00:18,  3.72s/it] 
1|5|Loss: 1.263526201248169:  60%|██████    | 6/10 [00:21<00:13,  3.41s/it]
1|6|Loss: 1.314545750617981:  60%|██████    | 6/10 [00:21<00:13,  3.41s/it]
1|6|Loss: 1.314545750617981:  70%|███████   | 7/10 [00:24<00:09,  3.30s/it]
1|7|Loss: 1.2146070003509521:  70%|███████   | 7/10 [00:24<00:09,  3.30s/it]
1|7|Loss: 1.2146070003509521:  80%|████████  | 8/10 [00:26<00:06,  3.16s/it]
1|8|Loss: 1.197435736656189:  80%|████████  | 8/10 [00:26<00:06,  3.16s/it] 
1|8|Loss: 1.197435736656189:  90%|█████████ | 9/10 [00:29<00:02,  3.00s/it]
1|9|Loss: 1.1718554496765137:  90%|█████████ | 9/10 [00:29<00:02,  3.00s/it]
1|9|Loss: 1.1718554496765137: 100%|██████████| 10/10 [00:32<00:00,  3.07s/it]
1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:32<00:00,  3.07s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to TensorBoardLoggerOut/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.

1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:34<00:00,  3.49s/it]

Comet Logger:

INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: ./CometLoggerOut
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
  _component_: torchtune.training.metric_logging.CometLogger
  log_dir: ./CometLoggerOut/logs
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
  _component_: torch.optim.AdamW
  lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./CometLoggerOut
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: ./CometLoggerOut/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model

INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 4230621073. Local seed is seed + rank = 4230621073 + 0
COMET WARNING: To get all data logged automatically, import comet_ml before the following modules: torch.
COMET INFO: Experiment is live on comet.com https://www.comet.com/ankur-singh/general/9d26c9e13895442aab3965049061eee7

COMET INFO: Couldn't find a Git repository in '/home/demo/github' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.
INFO:torchtune.utils._logging:Writing config to CometLoggerOut/torchtune_config.yaml
INFO:torchtune.utils._logging:Uploading CometLoggerOut/torchtune_config.yaml to Comet as an asset.
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}

  0%|          | 0/10 [00:00<?, ?it/s]
 10%|█         | 1/10 [00:04<00:36,  4.05s/it]
1|1|Loss: 2.9877049922943115:  10%|█         | 1/10 [00:04<00:36,  4.05s/it]
1|1|Loss: 2.9877049922943115:  20%|██        | 2/10 [00:08<00:33,  4.23s/it]
1|2|Loss: 1.7919223308563232:  20%|██        | 2/10 [00:08<00:33,  4.23s/it]
1|2|Loss: 1.7919223308563232:  30%|███       | 3/10 [00:11<00:25,  3.62s/it]
1|3|Loss: 1.48738694190979:  30%|███       | 3/10 [00:11<00:25,  3.62s/it]  
1|3|Loss: 1.48738694190979:  40%|████      | 4/10 [00:14<00:19,  3.27s/it]
1|4|Loss: 1.3556486368179321:  40%|████      | 4/10 [00:14<00:19,  3.27s/it]
1|4|Loss: 1.3556486368179321:  50%|█████     | 5/10 [00:18<00:18,  3.64s/it]
1|5|Loss: 1.263526201248169:  50%|█████     | 5/10 [00:18<00:18,  3.64s/it] 
1|5|Loss: 1.263526201248169:  60%|██████    | 6/10 [00:21<00:13,  3.35s/it]
1|6|Loss: 1.314545750617981:  60%|██████    | 6/10 [00:21<00:13,  3.35s/it]
1|6|Loss: 1.314545750617981:  70%|███████   | 7/10 [00:24<00:10,  3.34s/it]
1|7|Loss: 1.2146070003509521:  70%|███████   | 7/10 [00:24<00:10,  3.34s/it]
1|7|Loss: 1.2146070003509521:  80%|████████  | 8/10 [00:27<00:06,  3.22s/it]
1|8|Loss: 1.197435736656189:  80%|████████  | 8/10 [00:27<00:06,  3.22s/it] 
1|8|Loss: 1.197435736656189:  90%|█████████ | 9/10 [00:30<00:03,  3.17s/it]
1|9|Loss: 1.1718554496765137:  90%|█████████ | 9/10 [00:30<00:03,  3.17s/it]
1|9|Loss: 1.1718554496765137: 100%|██████████| 10/10 [00:34<00:00,  3.31s/it]
1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:34<00:00,  3.31s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to CometLoggerOut/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.

1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:36<00:00,  3.68s/it]
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     name                  : long_hip_409
COMET INFO:     url                   : https://www.comet.com/ankur-singh/general/9d26c9e13895442aab3965049061eee7
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     loss [10]                      : (1.088627576828003, 2.9877049922943115)
COMET INFO:     lr                             : 2e-05
COMET INFO:     tokens_per_second_per_gpu [10] : (81.59734344482422, 127.8671646118164)
COMET INFO:   Others:
COMET INFO:     hasNestedParams : True
COMET INFO:   Parameters:
COMET INFO:     batch_size                      : 4
COMET INFO:     checkpointer|_component_        : torchtune.training.FullModelHFCheckpointer
COMET INFO:     checkpointer|checkpoint_dir     : /tmp/Llama-3.2-1B-Instruct/
COMET INFO:     checkpointer|checkpoint_files   : ['model.safetensors']
COMET INFO:     checkpointer|model_type         : LLAMA3_2
COMET INFO:     checkpointer|output_dir         : ./CometLoggerOut
COMET INFO:     checkpointer|recipe_checkpoint  : None
COMET INFO:     clip_grad_norm                  : None
COMET INFO:     compile                         : False
COMET INFO:     dataset|_component_             : torchtune.datasets.alpaca_dataset
COMET INFO:     dataset|packed                  : False
COMET INFO:     dtype                           : bf16
COMET INFO:     enable_activation_checkpointing : False
COMET INFO:     enable_activation_offloading    : False
COMET INFO:     epochs                          : 1
COMET INFO:     gradient_accumulation_steps     : 1
COMET INFO:     log_every_n_steps               : 1
COMET INFO:     log_peak_memory_stats           : True
COMET INFO:     loss|_component_                : torchtune.modules.loss.CEWithChunkedOutputLoss
COMET INFO:     metric_logger|_component_       : torchtune.training.metric_logging.CometLogger
COMET INFO:     metric_logger|log_dir           : ./CometLoggerOut/logs
COMET INFO:     model|_component_               : torchtune.models.llama3_2.llama3_2_1b
COMET INFO:     optimizer_in_bwd                : True
COMET INFO:     optimizer|_component_           : torch.optim.AdamW
COMET INFO:     optimizer|lr                    : 2e-05
COMET INFO:     profiler|_component_            : torchtune.training.setup_torch_profiler
COMET INFO:     profiler|active_steps           : 2
COMET INFO:     profiler|cpu                    : True
COMET INFO:     profiler|cuda                   : True
COMET INFO:     profiler|enabled                : False
COMET INFO:     profiler|num_cycles             : 1
COMET INFO:     profiler|output_dir             : ./CometLoggerOut/profiling_outputs
COMET INFO:     profiler|profile_memory         : False
COMET INFO:     profiler|record_shapes          : True
COMET INFO:     profiler|wait_steps             : 5
COMET INFO:     profiler|warmup_steps           : 3
COMET INFO:     profiler|with_flops             : False
COMET INFO:     profiler|with_stack             : False
COMET INFO:     resume_from_checkpoint          : False
COMET INFO:     seed                            : None
COMET INFO:     shuffle                         : True
COMET INFO:     tokenizer|_component_           : torchtune.models.llama3.llama3_tokenizer
COMET INFO:     tokenizer|max_seq_len           : None
COMET INFO:     tokenizer|path                  : /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
COMET INFO:   Uploads:
COMET INFO:     asset                        : 1 (1.39 KB)
COMET INFO:     conda-environment-definition : 1
COMET INFO:     conda-info                   : 1
COMET INFO:     conda-specification          : 1
COMET INFO:     environment details          : 1
COMET INFO:     filename                     : 1
COMET INFO:     installed packages           : 1
COMET INFO:     os packages                  : 1
COMET INFO:     source_code                  : 2 (52.21 KB)
COMET INFO: 
COMET WARNING: To get all data logged automatically, import comet_ml before the following modules: torch.

@Ankur-singh
Copy link
Contributor Author

Also wanted to bring some attention to

All calls to metric_logger instance are already conditioned on if self._is_rank_zero. So ideally, logger implementations should not be concern with rank.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 48.14815% with 14 lines in your changes missing coverage. Please review.

Project coverage is 64.10%. Comparing base (779569e) to head (6218583).

Files with missing lines Patch % Lines
torchtune/training/metric_logging.py 48.14% 14 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2274       +/-   ##
===========================================
+ Coverage   23.97%   64.10%   +40.13%     
===========================================
  Files         358      353        -5     
  Lines       21207    20695      -512     
===========================================
+ Hits         5084    13267     +8183     
+ Misses      16123     7428     -8695     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the testing and the changes! Lets just remove the logs and I will merge it :)

torchtune/training/metric_logging.py Outdated Show resolved Hide resolved
torchtune/training/metric_logging.py Outdated Show resolved Hide resolved
torchtune/training/metric_logging.py Outdated Show resolved Hide resolved
Comment on lines +218 to +220
# create log_dir if missing
if not os.path.exists(self.log_dir):
os.makedirs(self.log_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we have to add this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good question. If log_dir does not exist, when calling self._wandb.init() then WandB creates a directory inside /tmp/ and logs everything there.

image

image

Often the log_dir is set to ${output_dir}/logs hence it went undetected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Ankur-singh I was able to use

metric_logger:
  _component_: torchtune.training.metric_logging.WandBLogger

before, but now it seems like I need to provide a log_dir otherwise it's erroring. Just curious if we could rollback to the previous design?

@felipemello1
Copy link
Contributor

will do a last review this Tuesday and merge it. Thank you for the changes :)

@felipemello1 felipemello1 merged commit 75965d4 into pytorch:main Jan 21, 2025
17 checks passed
@RdoubleA RdoubleA mentioned this pull request Jan 21, 2025
@Ankur-singh Ankur-singh deleted the feat/resolved-config branch January 23, 2025 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants