Logging resolved config #2274

Ankur-singh · 2025-01-16T22:58:51Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses #1968

Changelog

What are the changes made in this PR?

Created save_config utility function
Implemented DiskLogger.log_config method

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Stdout:

(tune) ankur@nuc:~/github/torchtune$ tune run full_finetune_single_device --config llama3_2/1B_full_single_device output_dir=./logs/ max_steps_per_epoch=10 device=cpu optimizer._component_=torch.optim.AdamW
INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: ./logs/
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: ./logs//logs
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
  _component_: torch.optim.AdamW
  lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./logs/
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: ./logs//profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model

INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 3022706700. Local seed is seed + rank = 3022706700 + 0
Writing logs to logs/logs/log_1737066872.txt
INFO:torchtune.utils._logging:Writing resolved config to logs/torchtune_config.yaml
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
1|10|Loss: 1.088627576828003: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:33<00:00,  3.01s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to logs/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
1|10|Loss: 1.088627576828003: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:41<00:00,  4.15s/it]

Generated Log file at ./logs/torchtune_config.yaml:

output_dir: ./logs/
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
  max_seq_len: null
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: false
seed: null
shuffle: true
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
  checkpoint_files:
  - model.safetensors
  recipe_checkpoint: null
  output_dir: ${output_dir}
  model_type: LLAMA3_2
resume_from_checkpoint: false
batch_size: 4
epochs: 1
optimizer:
  _component_: torch.optim.AdamW
  lr: 2.0e-05
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
gradient_accumulation_steps: 1
optimizer_in_bwd: true
clip_grad_norm: null
compile: false
device: cpu
enable_activation_checkpointing: false
enable_activation_offloading: false
dtype: bf16
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: ${output_dir}/logs
log_every_n_steps: 1
log_peak_memory_stats: true
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: false
  output_dir: ${output_dir}/profiling_outputs
  cpu: true
  cuda: true
  profile_memory: false
  with_stack: false
  record_shapes: true
  with_flops: false
  wait_steps: 5
  warmup_steps: 3
  active_steps: 2
  num_cycles: 1

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2025-01-16T22:58:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2274

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2a04a07 with merge base 779569e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

felipemello1

this looks good, just left a couple of questions. After you answer, I think it should be good to implement in every other logger.

Thanks for the PR!

torchtune/training/metric_logging.py

felipemello1 · 2025-01-17T20:07:24Z

torchtune/training/metric_logging.py

@@ -23,6 +23,20 @@
 log = get_logger("DEBUG")


+def save_config(config):


question: in the wandb implementation, we can to resolve the config first

resolved = OmegaConf.to_container(config, resolve=True)

Is this necessary here too?

The to_container method is used to convert OmegaConf object to standard python object. Wandb expects standard python container objects like dictionary and list. But for saving the config, we don't need to convert it.

Furthermore, we can use resolve=True when calling OmegaConf.save to resolve the interpolations before saving the config to yaml. However, I didn't see any interpolations in our config files.

Hope this helps :D

thanks! Is checkpointer.output_dir: ${output_dir} an example of interpolation? I think that its good to maintain the same pattern for everything, i.e. lets either change wandb/comet or change the save_config.

Yes, checkpointer.output_dir: ${output_dir} is an interpolation.

Just to make it explicit, wandb/comet save the config is two formats:

Dictionary: resolved dictionary is pushed to hub.

YAML file: the config file is saved as artifact. The YAML file is saved as is (i.e. not using the resolved dictionary).

I'm pasting the log_config method from CometLogger below for reference:

def log_config(self, config: DictConfig) -> None: if self.experiment is not None: resolved = OmegaConf.to_container(config, resolve=True) self.experiment.log_parameters(resolved) # Also try to save the config as a file try: self._log_config_as_file(config) except Exception as e: log.warning(f"Error saving Config to disk.\nError: \n{e}.")

As you can see, resolved variable is not used when saving the config as file. WandLogger also follows the same.

I believe, we don't need the resolved dictionary as we are only interested in saving in YAML format. However, if we wish to resolve the config before saving, then we can simply do:

with open("config.yaml", "w") as f: f.write(OmegaConf.to_yaml(conf, resolve=True))

Hope I didn't misunderstand your comment. Let me know what you think, whether to resolve the config or not?

felipemello1 · 2025-01-17T21:24:54Z

Its looking good. Lets please go ahead and implement it for every logger. We should use the same util inside of wandb and commet, if possible, plus whatever additional function they require. Notice that commet is also saving it to the wrong directory (checkpoint_dir).

If you have bandwidth, running every logger once for a sanity check would be ideal.

Ankur-singh · 2025-01-18T01:28:23Z

I have updated metric_logging.py file. In addition to updating log_config methods, I made some minor changes to __init__ methods of WandBLogger and CometLogger to prevent errors that were previously hidden.

I'm adding the output from each logger below for reference.

Disk Logger:

INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: ./DiskLoggerOut
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: ./DiskLoggerOut/logs
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
  _component_: torch.optim.AdamW
  lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./DiskLoggerOut
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: ./DiskLoggerOut/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model

INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 1248908266. Local seed is seed + rank = 1248908266 + 0
INFO:torchtune.utils._logging:Writing config to DiskLoggerOut/torchtune_config.yaml
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
Writing logs to DiskLoggerOut/logs/log_1737162084.txt

  0%|          | 0/10 [00:00<?, ?it/s]
 10%|█         | 1/10 [00:04<00:38,  4.27s/it]
1|1|Loss: 2.9877049922943115:  10%|█         | 1/10 [00:04<00:38,  4.27s/it]
1|1|Loss: 2.9877049922943115:  20%|██        | 2/10 [00:08<00:33,  4.24s/it]
1|2|Loss: 1.7919223308563232:  20%|██        | 2/10 [00:08<00:33,  4.24s/it]
1|2|Loss: 1.7919223308563232:  30%|███       | 3/10 [00:11<00:25,  3.60s/it]
1|3|Loss: 1.48738694190979:  30%|███       | 3/10 [00:11<00:25,  3.60s/it]  
1|3|Loss: 1.48738694190979:  40%|████      | 4/10 [00:14<00:19,  3.32s/it]
1|4|Loss: 1.3556486368179321:  40%|████      | 4/10 [00:14<00:19,  3.32s/it]
1|4|Loss: 1.3556486368179321:  50%|█████     | 5/10 [00:18<00:19,  3.81s/it]
1|5|Loss: 1.263526201248169:  50%|█████     | 5/10 [00:18<00:19,  3.81s/it] 
1|5|Loss: 1.263526201248169:  60%|██████    | 6/10 [00:21<00:13,  3.40s/it]
1|6|Loss: 1.314545750617981:  60%|██████    | 6/10 [00:21<00:13,  3.40s/it]
1|6|Loss: 1.314545750617981:  70%|███████   | 7/10 [00:24<00:09,  3.21s/it]
1|7|Loss: 1.2146070003509521:  70%|███████   | 7/10 [00:24<00:09,  3.21s/it]
1|7|Loss: 1.2146070003509521:  80%|████████  | 8/10 [00:28<00:06,  3.37s/it]
1|8|Loss: 1.197435736656189:  80%|████████  | 8/10 [00:28<00:06,  3.37s/it] 
1|8|Loss: 1.197435736656189:  90%|█████████ | 9/10 [00:30<00:03,  3.16s/it]
1|9|Loss: 1.1718554496765137:  90%|█████████ | 9/10 [00:30<00:03,  3.16s/it]
1|9|Loss: 1.1718554496765137: 100%|██████████| 10/10 [00:34<00:00,  3.32s/it]
1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:34<00:00,  3.32s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to DiskLoggerOut/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.

1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:36<00:00,  3.65s/it]

Stdout Logger:

INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: ./StdoutLoggerOut
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
  _component_: torchtune.training.metric_logging.StdoutLogger
  log_dir: ./StdoutLoggerOut/logs
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
  _component_: torch.optim.AdamW
  lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./StdoutLoggerOut
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: ./StdoutLoggerOut/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model

INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 625184984. Local seed is seed + rank = 625184984 + 0
INFO:torchtune.utils._logging:Writing config to StdoutLoggerOut/torchtune_config.yaml
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}

  0%|          | 0/10 [00:00<?, ?it/s]
 10%|█         | 1/10 [00:04<00:38,  4.28s/it]
1|1|Loss: 2.9877049922943115:  10%|█         | 1/10 [00:04<00:38,  4.28s/it]
1|1|Loss: 2.9877049922943115:  20%|██        | 2/10 [00:08<00:32,  4.07s/it]
1|2|Loss: 1.7919223308563232:  20%|██        | 2/10 [00:08<00:32,  4.07s/it]
1|2|Loss: 1.7919223308563232:  30%|███       | 3/10 [00:11<00:25,  3.57s/it]
1|3|Loss: 1.48738694190979:  30%|███       | 3/10 [00:11<00:25,  3.57s/it]  
1|3|Loss: 1.48738694190979:  40%|████      | 4/10 [00:13<00:19,  3.25s/it]
1|4|Loss: 1.3556486368179321:  40%|████      | 4/10 [00:13<00:19,  3.25s/it]
1|4|Loss: 1.3556486368179321:  50%|█████     | 5/10 [00:18<00:18,  3.75s/it]
1|5|Loss: 1.263526201248169:  50%|█████     | 5/10 [00:18<00:18,  3.75s/it] 
1|5|Loss: 1.263526201248169:  60%|██████    | 6/10 [00:21<00:13,  3.35s/it]
1|6|Loss: 1.314545750617981:  60%|██████    | 6/10 [00:21<00:13,  3.35s/it]
1|6|Loss: 1.314545750617981:  70%|███████   | 7/10 [00:23<00:09,  3.17s/it]
1|7|Loss: 1.2146070003509521:  70%|███████   | 7/10 [00:23<00:09,  3.17s/it]
1|7|Loss: 1.2146070003509521:  80%|████████  | 8/10 [00:26<00:05,  2.99s/it]
1|8|Loss: 1.197435736656189:  80%|████████  | 8/10 [00:26<00:05,  2.99s/it] 
1|8|Loss: 1.197435736656189:  90%|█████████ | 9/10 [00:28<00:02,  2.76s/it]
1|9|Loss: 1.1718554496765137:  90%|█████████ | 9/10 [00:28<00:02,  2.76s/it]
1|9|Loss: 1.1718554496765137: 100%|██████████| 10/10 [00:31<00:00,  2.77s/it]
1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:31<00:00,  2.77s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to StdoutLoggerOut/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.

1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:33<00:00,  3.39s/it]
Step 1 | loss:2.9877049922943115 lr:2e-05 tokens_per_second_per_gpu:90.10209655761719 
Step 2 | loss:1.7919223308563232 lr:2e-05 tokens_per_second_per_gpu:90.45220947265625 
Step 3 | loss:1.48738694190979 lr:2e-05 tokens_per_second_per_gpu:99.32902526855469 
Step 4 | loss:1.3556486368179321 lr:2e-05 tokens_per_second_per_gpu:112.30082702636719 
Step 5 | loss:1.263526201248169 lr:2e-05 tokens_per_second_per_gpu:117.72256469726562 
Step 6 | loss:1.314545750617981 lr:2e-05 tokens_per_second_per_gpu:137.54611206054688 
Step 7 | loss:1.2146070003509521 lr:2e-05 tokens_per_second_per_gpu:143.7611083984375 
Step 8 | loss:1.197435736656189 lr:2e-05 tokens_per_second_per_gpu:120.44074249267578 
Step 9 | loss:1.1718554496765137 lr:2e-05 tokens_per_second_per_gpu:158.5180206298828 
Step 10 | loss:1.088627576828003 lr:2e-05 tokens_per_second_per_gpu:144.5830841064453

WandB Logger:

INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: ./WandBLoggerOut
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
  _component_: torchtune.training.metric_logging.WandBLogger
  log_dir: ./WandBLoggerOut/logs
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
  _component_: torch.optim.AdamW
  lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./WandBLoggerOut
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: ./WandBLoggerOut/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model

INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 1022127600. Local seed is seed + rank = 1022127600 + 0
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.19.3
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
INFO:torchtune.utils._logging:Writing config to WandBLoggerOut/torchtune_config.yaml
INFO:torchtune.utils._logging:Uploading WandBLoggerOut/torchtune_config.yaml to W&B under Files
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
1|10|Loss: 1.088627576828003: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:29<00:00,  2.87s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to WandBLoggerOut/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
1|10|Loss: 1.088627576828003: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:32<00:00,  3.20s/it]
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:               global_step ▁▂▃▃▄▅▆▆▇█
wandb:                      loss █▄▂▂▂▂▁▁▁▁
wandb:                        lr ▁▁▁▁▁▁▁▁▁▁
wandb: tokens_per_second_per_gpu ▂▁▅▆▅▇▇▃█▄
wandb: 
wandb: Run summary:
wandb:               global_step 10
wandb:                      loss 1.08863
wandb:                        lr 2e-05
wandb: tokens_per_second_per_gpu 122.5525
wandb: 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync ./WandBLoggerOut/logs/wandb/offline-run-20250118_011118-tmk0azdt
wandb: Find logs at: ./WandBLoggerOut/logs/wandb/offline-run-20250118_011118-tmk0azdt/logs

TensorBoard Logger:

INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: ./TensorBoardLoggerOut
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
  _component_: torchtune.training.metric_logging.TensorBoardLogger
  log_dir: ./TensorBoardLoggerOut/logs
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
  _component_: torch.optim.AdamW
  lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./TensorBoardLoggerOut
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: ./TensorBoardLoggerOut/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model

INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 1183309963. Local seed is seed + rank = 1183309963 + 0
INFO:torchtune.utils._logging:Writing config to TensorBoardLoggerOut/torchtune_config.yaml
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}

  0%|          | 0/10 [00:00<?, ?it/s]
 10%|█         | 1/10 [00:04<00:37,  4.13s/it]
1|1|Loss: 2.9877049922943115:  10%|█         | 1/10 [00:04<00:37,  4.13s/it]
1|1|Loss: 2.9877049922943115:  20%|██        | 2/10 [00:08<00:32,  4.03s/it]
1|2|Loss: 1.7919223308563232:  20%|██        | 2/10 [00:08<00:32,  4.03s/it]
1|2|Loss: 1.7919223308563232:  30%|███       | 3/10 [00:10<00:24,  3.47s/it]
1|3|Loss: 1.48738694190979:  30%|███       | 3/10 [00:10<00:24,  3.47s/it]  
1|3|Loss: 1.48738694190979:  40%|████      | 4/10 [00:13<00:18,  3.12s/it]
1|4|Loss: 1.3556486368179321:  40%|████      | 4/10 [00:13<00:18,  3.12s/it]
1|4|Loss: 1.3556486368179321:  50%|█████     | 5/10 [00:18<00:18,  3.72s/it]
1|5|Loss: 1.263526201248169:  50%|█████     | 5/10 [00:18<00:18,  3.72s/it] 
1|5|Loss: 1.263526201248169:  60%|██████    | 6/10 [00:21<00:13,  3.41s/it]
1|6|Loss: 1.314545750617981:  60%|██████    | 6/10 [00:21<00:13,  3.41s/it]
1|6|Loss: 1.314545750617981:  70%|███████   | 7/10 [00:24<00:09,  3.30s/it]
1|7|Loss: 1.2146070003509521:  70%|███████   | 7/10 [00:24<00:09,  3.30s/it]
1|7|Loss: 1.2146070003509521:  80%|████████  | 8/10 [00:26<00:06,  3.16s/it]
1|8|Loss: 1.197435736656189:  80%|████████  | 8/10 [00:26<00:06,  3.16s/it] 
1|8|Loss: 1.197435736656189:  90%|█████████ | 9/10 [00:29<00:02,  3.00s/it]
1|9|Loss: 1.1718554496765137:  90%|█████████ | 9/10 [00:29<00:02,  3.00s/it]
1|9|Loss: 1.1718554496765137: 100%|██████████| 10/10 [00:32<00:00,  3.07s/it]
1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:32<00:00,  3.07s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to TensorBoardLoggerOut/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.

1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:34<00:00,  3.49s/it]

Comet Logger:

INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: ./CometLoggerOut
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
  _component_: torchtune.training.metric_logging.CometLogger
  log_dir: ./CometLoggerOut/logs
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
  _component_: torch.optim.AdamW
  lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./CometLoggerOut
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: ./CometLoggerOut/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model

INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 4230621073. Local seed is seed + rank = 4230621073 + 0
COMET WARNING: To get all data logged automatically, import comet_ml before the following modules: torch.
COMET INFO: Experiment is live on comet.com https://www.comet.com/ankur-singh/general/9d26c9e13895442aab3965049061eee7

COMET INFO: Couldn't find a Git repository in '/home/demo/github' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.
INFO:torchtune.utils._logging:Writing config to CometLoggerOut/torchtune_config.yaml
INFO:torchtune.utils._logging:Uploading CometLoggerOut/torchtune_config.yaml to Comet as an asset.
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}

  0%|          | 0/10 [00:00<?, ?it/s]
 10%|█         | 1/10 [00:04<00:36,  4.05s/it]
1|1|Loss: 2.9877049922943115:  10%|█         | 1/10 [00:04<00:36,  4.05s/it]
1|1|Loss: 2.9877049922943115:  20%|██        | 2/10 [00:08<00:33,  4.23s/it]
1|2|Loss: 1.7919223308563232:  20%|██        | 2/10 [00:08<00:33,  4.23s/it]
1|2|Loss: 1.7919223308563232:  30%|███       | 3/10 [00:11<00:25,  3.62s/it]
1|3|Loss: 1.48738694190979:  30%|███       | 3/10 [00:11<00:25,  3.62s/it]  
1|3|Loss: 1.48738694190979:  40%|████      | 4/10 [00:14<00:19,  3.27s/it]
1|4|Loss: 1.3556486368179321:  40%|████      | 4/10 [00:14<00:19,  3.27s/it]
1|4|Loss: 1.3556486368179321:  50%|█████     | 5/10 [00:18<00:18,  3.64s/it]
1|5|Loss: 1.263526201248169:  50%|█████     | 5/10 [00:18<00:18,  3.64s/it] 
1|5|Loss: 1.263526201248169:  60%|██████    | 6/10 [00:21<00:13,  3.35s/it]
1|6|Loss: 1.314545750617981:  60%|██████    | 6/10 [00:21<00:13,  3.35s/it]
1|6|Loss: 1.314545750617981:  70%|███████   | 7/10 [00:24<00:10,  3.34s/it]
1|7|Loss: 1.2146070003509521:  70%|███████   | 7/10 [00:24<00:10,  3.34s/it]
1|7|Loss: 1.2146070003509521:  80%|████████  | 8/10 [00:27<00:06,  3.22s/it]
1|8|Loss: 1.197435736656189:  80%|████████  | 8/10 [00:27<00:06,  3.22s/it] 
1|8|Loss: 1.197435736656189:  90%|█████████ | 9/10 [00:30<00:03,  3.17s/it]
1|9|Loss: 1.1718554496765137:  90%|█████████ | 9/10 [00:30<00:03,  3.17s/it]
1|9|Loss: 1.1718554496765137: 100%|██████████| 10/10 [00:34<00:00,  3.31s/it]
1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:34<00:00,  3.31s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to CometLoggerOut/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.

1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:36<00:00,  3.68s/it]
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     name                  : long_hip_409
COMET INFO:     url                   : https://www.comet.com/ankur-singh/general/9d26c9e13895442aab3965049061eee7
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     loss [10]                      : (1.088627576828003, 2.9877049922943115)
COMET INFO:     lr                             : 2e-05
COMET INFO:     tokens_per_second_per_gpu [10] : (81.59734344482422, 127.8671646118164)
COMET INFO:   Others:
COMET INFO:     hasNestedParams : True
COMET INFO:   Parameters:
COMET INFO:     batch_size                      : 4
COMET INFO:     checkpointer|_component_        : torchtune.training.FullModelHFCheckpointer
COMET INFO:     checkpointer|checkpoint_dir     : /tmp/Llama-3.2-1B-Instruct/
COMET INFO:     checkpointer|checkpoint_files   : ['model.safetensors']
COMET INFO:     checkpointer|model_type         : LLAMA3_2
COMET INFO:     checkpointer|output_dir         : ./CometLoggerOut
COMET INFO:     checkpointer|recipe_checkpoint  : None
COMET INFO:     clip_grad_norm                  : None
COMET INFO:     compile                         : False
COMET INFO:     dataset|_component_             : torchtune.datasets.alpaca_dataset
COMET INFO:     dataset|packed                  : False
COMET INFO:     dtype                           : bf16
COMET INFO:     enable_activation_checkpointing : False
COMET INFO:     enable_activation_offloading    : False
COMET INFO:     epochs                          : 1
COMET INFO:     gradient_accumulation_steps     : 1
COMET INFO:     log_every_n_steps               : 1
COMET INFO:     log_peak_memory_stats           : True
COMET INFO:     loss|_component_                : torchtune.modules.loss.CEWithChunkedOutputLoss
COMET INFO:     metric_logger|_component_       : torchtune.training.metric_logging.CometLogger
COMET INFO:     metric_logger|log_dir           : ./CometLoggerOut/logs
COMET INFO:     model|_component_               : torchtune.models.llama3_2.llama3_2_1b
COMET INFO:     optimizer_in_bwd                : True
COMET INFO:     optimizer|_component_           : torch.optim.AdamW
COMET INFO:     optimizer|lr                    : 2e-05
COMET INFO:     profiler|_component_            : torchtune.training.setup_torch_profiler
COMET INFO:     profiler|active_steps           : 2
COMET INFO:     profiler|cpu                    : True
COMET INFO:     profiler|cuda                   : True
COMET INFO:     profiler|enabled                : False
COMET INFO:     profiler|num_cycles             : 1
COMET INFO:     profiler|output_dir             : ./CometLoggerOut/profiling_outputs
COMET INFO:     profiler|profile_memory         : False
COMET INFO:     profiler|record_shapes          : True
COMET INFO:     profiler|wait_steps             : 5
COMET INFO:     profiler|warmup_steps           : 3
COMET INFO:     profiler|with_flops             : False
COMET INFO:     profiler|with_stack             : False
COMET INFO:     resume_from_checkpoint          : False
COMET INFO:     seed                            : None
COMET INFO:     shuffle                         : True
COMET INFO:     tokenizer|_component_           : torchtune.models.llama3.llama3_tokenizer
COMET INFO:     tokenizer|max_seq_len           : None
COMET INFO:     tokenizer|path                  : /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
COMET INFO:   Uploads:
COMET INFO:     asset                        : 1 (1.39 KB)
COMET INFO:     conda-environment-definition : 1
COMET INFO:     conda-info                   : 1
COMET INFO:     conda-specification          : 1
COMET INFO:     environment details          : 1
COMET INFO:     filename                     : 1
COMET INFO:     installed packages           : 1
COMET INFO:     os packages                  : 1
COMET INFO:     source_code                  : 2 (52.21 KB)
COMET INFO: 
COMET WARNING: To get all data logged automatically, import comet_ml before the following modules: torch.

Ankur-singh · 2025-01-18T01:39:07Z

Also wanted to bring some attention to

torchtune/torchtune/training/metric_logging.py

Line 397 in 7747db1

if self.rank == 0:

All calls to metric_logger instance are already conditioned on if self._is_rank_zero. So ideally, logger implementations should not be concern with rank.

codecov-commenter · 2025-01-18T18:50:12Z

Codecov Report

Attention: Patch coverage is 48.14815% with 14 lines in your changes missing coverage. Please review.

Project coverage is 64.10%. Comparing base (779569e) to head (6218583).

Files with missing lines	Patch %	Lines
torchtune/training/metric_logging.py	48.14%	14 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2274       +/-   ##
===========================================
+ Coverage   23.97%   64.10%   +40.13%     
===========================================
  Files         358      353        -5     
  Lines       21207    20695      -512     
===========================================
+ Hits         5084    13267     +8183     
+ Misses      16123     7428     -8695

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

felipemello1

thanks for the testing and the changes! Lets just remove the logs and I will merge it :)

torchtune/training/metric_logging.py

felipemello1 · 2025-01-18T18:50:51Z

torchtune/training/metric_logging.py

+        # create log_dir if missing
+        if not os.path.exists(self.log_dir):
+            os.makedirs(self.log_dir)


why did we have to add this?

That is a very good question. If log_dir does not exist, when calling self._wandb.init() then WandB creates a directory inside /tmp/ and logs everything there.

Often the log_dir is set to ${output_dir}/logs hence it went undetected.

Hi @Ankur-singh I was able to use

metric_logger: _component_: torchtune.training.metric_logging.WandBLogger

before, but now it seems like I need to provide a log_dir otherwise it's erroring. Just curious if we could rollback to the previous design?

felipemello1 · 2025-01-21T04:18:44Z

will do a last review this Tuesday and merge it. Thank you for the changes :)

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 16, 2025

This comment was marked as resolved.

Sign in to view

felipemello1 reviewed Jan 17, 2025

View reviewed changes

Ankur-singh added 4 commits January 18, 2025 03:52

added save_config function and implemented DiskLogger.log_config

15914a1

updated log message

1885cf3

refactor log_config implementation for each logger class

1e90052

fix key error in CometLogger

6218583

Ankur-singh force-pushed the feat/resolved-config branch from 6ee8a73 to 6218583 Compare January 18, 2025 03:54

felipemello1 reviewed Jan 18, 2025

View reviewed changes

remove redundant log messages in save_config and logger classes

2a04a07

joecummings approved these changes Jan 21, 2025

View reviewed changes

felipemello1 merged commit 75965d4 into pytorch:main Jan 21, 2025
17 checks passed

RdoubleA mentioned this pull request Jan 21, 2025

v0.6.0 tracker #2232

Closed

Ankur-singh deleted the feat/resolved-config branch January 23, 2025 01:43

		@@ -23,6 +23,20 @@
		log = get_logger("DEBUG")


		def save_config(config):

Logging resolved config #2274

Logging resolved config #2274

Uh oh!

Conversation

Ankur-singh commented Jan 16, 2025

Context

Changelog

Test plan

UX

Uh oh!

pytorch-bot bot commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2274

✅ No Failures

Uh oh!

This comment was marked as resolved.

Uh oh!

felipemello1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

felipemello1 Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

Ankur-singh Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

felipemello1 Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

Ankur-singh Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 commented Jan 17, 2025

Uh oh!

Ankur-singh commented Jan 18, 2025

Uh oh!

Ankur-singh commented Jan 18, 2025

Uh oh!

codecov-commenter commented Jan 18, 2025

Codecov Report

Uh oh!

felipemello1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

felipemello1 Jan 18, 2025

Choose a reason for hiding this comment

Uh oh!

Ankur-singh Jan 19, 2025

Choose a reason for hiding this comment

Uh oh!

acisseJZhong Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

felipemello1 commented Jan 21, 2025

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 16, 2025 •

edited

Loading

Ankur-singh Jan 17, 2025 •

edited

Loading