-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logging resolved config #2274
Logging resolved config #2274
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2274
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 2a04a07 with merge base 779569e (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good, just left a couple of questions. After you answer, I think it should be good to implement in every other logger.
Thanks for the PR!
torchtune/training/metric_logging.py
Outdated
@@ -23,6 +23,20 @@ | |||
log = get_logger("DEBUG") | |||
|
|||
|
|||
def save_config(config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: in the wandb implementation, we can to resolve the config first
resolved = OmegaConf.to_container(config, resolve=True)
Is this necessary here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The to_container
method is used to convert OmegaConf object to standard python object. Wandb expects standard python container objects like dictionary and list. But for saving the config, we don't need to convert it.
Furthermore, we can use resolve=True
when calling OmegaConf.save
to resolve the interpolations before saving the config to yaml. However, I didn't see any interpolations in our config files.
Hope this helps :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! Is checkpointer.output_dir: ${output_dir} an example of interpolation? I think that its good to maintain the same pattern for everything, i.e. lets either change wandb/comet or change the save_config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, checkpointer.output_dir: ${output_dir}
is an interpolation.
Just to make it explicit, wandb/comet save the config is two formats:
- Dictionary: resolved dictionary is pushed to hub.
- YAML file: the config file is saved as artifact. The YAML file is saved as is (i.e. not using the resolved dictionary).
I'm pasting the log_config
method from CometLogger
below for reference:
def log_config(self, config: DictConfig) -> None:
if self.experiment is not None:
resolved = OmegaConf.to_container(config, resolve=True)
self.experiment.log_parameters(resolved)
# Also try to save the config as a file
try: self._log_config_as_file(config)
except Exception as e:
log.warning(f"Error saving Config to disk.\nError: \n{e}.")
As you can see, resolved
variable is not used when saving the config as file. WandLogger
also follows the same.
I believe, we don't need the resolved
dictionary as we are only interested in saving in YAML format. However, if we wish to resolve the config before saving, then we can simply do:
with open("config.yaml", "w") as f:
f.write(OmegaConf.to_yaml(conf, resolve=True))
Hope I didn't misunderstand your comment. Let me know what you think, whether to resolve the config or not?
Its looking good. Lets please go ahead and implement it for every logger. We should use the same util inside of wandb and commet, if possible, plus whatever additional function they require. Notice that commet is also saving it to the wrong directory (checkpoint_dir). If you have bandwidth, running every logger once for a sanity check would be ideal. |
I have updated I'm adding the output from each logger below for reference. Disk Logger: INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:
batch_size: 4
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
checkpoint_files:
- model.safetensors
model_type: LLAMA3_2
output_dir: ./DiskLoggerOut
recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
_component_: torchtune.datasets.alpaca_dataset
packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: ./DiskLoggerOut/logs
model:
_component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
_component_: torch.optim.AdamW
lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./DiskLoggerOut
profiler:
_component_: torchtune.training.setup_torch_profiler
active_steps: 2
cpu: true
cuda: true
enabled: false
num_cycles: 1
output_dir: ./DiskLoggerOut/profiling_outputs
profile_memory: false
record_shapes: true
wait_steps: 5
warmup_steps: 3
with_flops: false
with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
max_seq_len: null
path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 1248908266. Local seed is seed + rank = 1248908266 + 0
INFO:torchtune.utils._logging:Writing config to DiskLoggerOut/torchtune_config.yaml
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
Writing logs to DiskLoggerOut/logs/log_1737162084.txt
0%| | 0/10 [00:00<?, ?it/s]
10%|█ | 1/10 [00:04<00:38, 4.27s/it]
1|1|Loss: 2.9877049922943115: 10%|█ | 1/10 [00:04<00:38, 4.27s/it]
1|1|Loss: 2.9877049922943115: 20%|██ | 2/10 [00:08<00:33, 4.24s/it]
1|2|Loss: 1.7919223308563232: 20%|██ | 2/10 [00:08<00:33, 4.24s/it]
1|2|Loss: 1.7919223308563232: 30%|███ | 3/10 [00:11<00:25, 3.60s/it]
1|3|Loss: 1.48738694190979: 30%|███ | 3/10 [00:11<00:25, 3.60s/it]
1|3|Loss: 1.48738694190979: 40%|████ | 4/10 [00:14<00:19, 3.32s/it]
1|4|Loss: 1.3556486368179321: 40%|████ | 4/10 [00:14<00:19, 3.32s/it]
1|4|Loss: 1.3556486368179321: 50%|█████ | 5/10 [00:18<00:19, 3.81s/it]
1|5|Loss: 1.263526201248169: 50%|█████ | 5/10 [00:18<00:19, 3.81s/it]
1|5|Loss: 1.263526201248169: 60%|██████ | 6/10 [00:21<00:13, 3.40s/it]
1|6|Loss: 1.314545750617981: 60%|██████ | 6/10 [00:21<00:13, 3.40s/it]
1|6|Loss: 1.314545750617981: 70%|███████ | 7/10 [00:24<00:09, 3.21s/it]
1|7|Loss: 1.2146070003509521: 70%|███████ | 7/10 [00:24<00:09, 3.21s/it]
1|7|Loss: 1.2146070003509521: 80%|████████ | 8/10 [00:28<00:06, 3.37s/it]
1|8|Loss: 1.197435736656189: 80%|████████ | 8/10 [00:28<00:06, 3.37s/it]
1|8|Loss: 1.197435736656189: 90%|█████████ | 9/10 [00:30<00:03, 3.16s/it]
1|9|Loss: 1.1718554496765137: 90%|█████████ | 9/10 [00:30<00:03, 3.16s/it]
1|9|Loss: 1.1718554496765137: 100%|██████████| 10/10 [00:34<00:00, 3.32s/it]
1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:34<00:00, 3.32s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to DiskLoggerOut/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:36<00:00, 3.65s/it] Stdout Logger: INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:
batch_size: 4
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
checkpoint_files:
- model.safetensors
model_type: LLAMA3_2
output_dir: ./StdoutLoggerOut
recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
_component_: torchtune.datasets.alpaca_dataset
packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
_component_: torchtune.training.metric_logging.StdoutLogger
log_dir: ./StdoutLoggerOut/logs
model:
_component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
_component_: torch.optim.AdamW
lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./StdoutLoggerOut
profiler:
_component_: torchtune.training.setup_torch_profiler
active_steps: 2
cpu: true
cuda: true
enabled: false
num_cycles: 1
output_dir: ./StdoutLoggerOut/profiling_outputs
profile_memory: false
record_shapes: true
wait_steps: 5
warmup_steps: 3
with_flops: false
with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
max_seq_len: null
path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 625184984. Local seed is seed + rank = 625184984 + 0
INFO:torchtune.utils._logging:Writing config to StdoutLoggerOut/torchtune_config.yaml
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
0%| | 0/10 [00:00<?, ?it/s]
10%|█ | 1/10 [00:04<00:38, 4.28s/it]
1|1|Loss: 2.9877049922943115: 10%|█ | 1/10 [00:04<00:38, 4.28s/it]
1|1|Loss: 2.9877049922943115: 20%|██ | 2/10 [00:08<00:32, 4.07s/it]
1|2|Loss: 1.7919223308563232: 20%|██ | 2/10 [00:08<00:32, 4.07s/it]
1|2|Loss: 1.7919223308563232: 30%|███ | 3/10 [00:11<00:25, 3.57s/it]
1|3|Loss: 1.48738694190979: 30%|███ | 3/10 [00:11<00:25, 3.57s/it]
1|3|Loss: 1.48738694190979: 40%|████ | 4/10 [00:13<00:19, 3.25s/it]
1|4|Loss: 1.3556486368179321: 40%|████ | 4/10 [00:13<00:19, 3.25s/it]
1|4|Loss: 1.3556486368179321: 50%|█████ | 5/10 [00:18<00:18, 3.75s/it]
1|5|Loss: 1.263526201248169: 50%|█████ | 5/10 [00:18<00:18, 3.75s/it]
1|5|Loss: 1.263526201248169: 60%|██████ | 6/10 [00:21<00:13, 3.35s/it]
1|6|Loss: 1.314545750617981: 60%|██████ | 6/10 [00:21<00:13, 3.35s/it]
1|6|Loss: 1.314545750617981: 70%|███████ | 7/10 [00:23<00:09, 3.17s/it]
1|7|Loss: 1.2146070003509521: 70%|███████ | 7/10 [00:23<00:09, 3.17s/it]
1|7|Loss: 1.2146070003509521: 80%|████████ | 8/10 [00:26<00:05, 2.99s/it]
1|8|Loss: 1.197435736656189: 80%|████████ | 8/10 [00:26<00:05, 2.99s/it]
1|8|Loss: 1.197435736656189: 90%|█████████ | 9/10 [00:28<00:02, 2.76s/it]
1|9|Loss: 1.1718554496765137: 90%|█████████ | 9/10 [00:28<00:02, 2.76s/it]
1|9|Loss: 1.1718554496765137: 100%|██████████| 10/10 [00:31<00:00, 2.77s/it]
1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:31<00:00, 2.77s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to StdoutLoggerOut/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:33<00:00, 3.39s/it]
Step 1 | loss:2.9877049922943115 lr:2e-05 tokens_per_second_per_gpu:90.10209655761719
Step 2 | loss:1.7919223308563232 lr:2e-05 tokens_per_second_per_gpu:90.45220947265625
Step 3 | loss:1.48738694190979 lr:2e-05 tokens_per_second_per_gpu:99.32902526855469
Step 4 | loss:1.3556486368179321 lr:2e-05 tokens_per_second_per_gpu:112.30082702636719
Step 5 | loss:1.263526201248169 lr:2e-05 tokens_per_second_per_gpu:117.72256469726562
Step 6 | loss:1.314545750617981 lr:2e-05 tokens_per_second_per_gpu:137.54611206054688
Step 7 | loss:1.2146070003509521 lr:2e-05 tokens_per_second_per_gpu:143.7611083984375
Step 8 | loss:1.197435736656189 lr:2e-05 tokens_per_second_per_gpu:120.44074249267578
Step 9 | loss:1.1718554496765137 lr:2e-05 tokens_per_second_per_gpu:158.5180206298828
Step 10 | loss:1.088627576828003 lr:2e-05 tokens_per_second_per_gpu:144.5830841064453 WandB Logger: INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:
batch_size: 4
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
checkpoint_files:
- model.safetensors
model_type: LLAMA3_2
output_dir: ./WandBLoggerOut
recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
_component_: torchtune.datasets.alpaca_dataset
packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
_component_: torchtune.training.metric_logging.WandBLogger
log_dir: ./WandBLoggerOut/logs
model:
_component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
_component_: torch.optim.AdamW
lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./WandBLoggerOut
profiler:
_component_: torchtune.training.setup_torch_profiler
active_steps: 2
cpu: true
cuda: true
enabled: false
num_cycles: 1
output_dir: ./WandBLoggerOut/profiling_outputs
profile_memory: false
record_shapes: true
wait_steps: 5
warmup_steps: 3
with_flops: false
with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
max_seq_len: null
path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 1022127600. Local seed is seed + rank = 1022127600 + 0
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.19.3
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
INFO:torchtune.utils._logging:Writing config to WandBLoggerOut/torchtune_config.yaml
INFO:torchtune.utils._logging:Uploading WandBLoggerOut/torchtune_config.yaml to W&B under Files
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
1|10|Loss: 1.088627576828003: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:29<00:00, 2.87s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to WandBLoggerOut/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
1|10|Loss: 1.088627576828003: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:32<00:00, 3.20s/it]
wandb:
wandb:
wandb: Run history:
wandb: global_step ▁▂▃▃▄▅▆▆▇█
wandb: loss █▄▂▂▂▂▁▁▁▁
wandb: lr ▁▁▁▁▁▁▁▁▁▁
wandb: tokens_per_second_per_gpu ▂▁▅▆▅▇▇▃█▄
wandb:
wandb: Run summary:
wandb: global_step 10
wandb: loss 1.08863
wandb: lr 2e-05
wandb: tokens_per_second_per_gpu 122.5525
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync ./WandBLoggerOut/logs/wandb/offline-run-20250118_011118-tmk0azdt
wandb: Find logs at: ./WandBLoggerOut/logs/wandb/offline-run-20250118_011118-tmk0azdt/logs TensorBoard Logger: INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:
batch_size: 4
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
checkpoint_files:
- model.safetensors
model_type: LLAMA3_2
output_dir: ./TensorBoardLoggerOut
recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
_component_: torchtune.datasets.alpaca_dataset
packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
_component_: torchtune.training.metric_logging.TensorBoardLogger
log_dir: ./TensorBoardLoggerOut/logs
model:
_component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
_component_: torch.optim.AdamW
lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./TensorBoardLoggerOut
profiler:
_component_: torchtune.training.setup_torch_profiler
active_steps: 2
cpu: true
cuda: true
enabled: false
num_cycles: 1
output_dir: ./TensorBoardLoggerOut/profiling_outputs
profile_memory: false
record_shapes: true
wait_steps: 5
warmup_steps: 3
with_flops: false
with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
max_seq_len: null
path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 1183309963. Local seed is seed + rank = 1183309963 + 0
INFO:torchtune.utils._logging:Writing config to TensorBoardLoggerOut/torchtune_config.yaml
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
0%| | 0/10 [00:00<?, ?it/s]
10%|█ | 1/10 [00:04<00:37, 4.13s/it]
1|1|Loss: 2.9877049922943115: 10%|█ | 1/10 [00:04<00:37, 4.13s/it]
1|1|Loss: 2.9877049922943115: 20%|██ | 2/10 [00:08<00:32, 4.03s/it]
1|2|Loss: 1.7919223308563232: 20%|██ | 2/10 [00:08<00:32, 4.03s/it]
1|2|Loss: 1.7919223308563232: 30%|███ | 3/10 [00:10<00:24, 3.47s/it]
1|3|Loss: 1.48738694190979: 30%|███ | 3/10 [00:10<00:24, 3.47s/it]
1|3|Loss: 1.48738694190979: 40%|████ | 4/10 [00:13<00:18, 3.12s/it]
1|4|Loss: 1.3556486368179321: 40%|████ | 4/10 [00:13<00:18, 3.12s/it]
1|4|Loss: 1.3556486368179321: 50%|█████ | 5/10 [00:18<00:18, 3.72s/it]
1|5|Loss: 1.263526201248169: 50%|█████ | 5/10 [00:18<00:18, 3.72s/it]
1|5|Loss: 1.263526201248169: 60%|██████ | 6/10 [00:21<00:13, 3.41s/it]
1|6|Loss: 1.314545750617981: 60%|██████ | 6/10 [00:21<00:13, 3.41s/it]
1|6|Loss: 1.314545750617981: 70%|███████ | 7/10 [00:24<00:09, 3.30s/it]
1|7|Loss: 1.2146070003509521: 70%|███████ | 7/10 [00:24<00:09, 3.30s/it]
1|7|Loss: 1.2146070003509521: 80%|████████ | 8/10 [00:26<00:06, 3.16s/it]
1|8|Loss: 1.197435736656189: 80%|████████ | 8/10 [00:26<00:06, 3.16s/it]
1|8|Loss: 1.197435736656189: 90%|█████████ | 9/10 [00:29<00:02, 3.00s/it]
1|9|Loss: 1.1718554496765137: 90%|█████████ | 9/10 [00:29<00:02, 3.00s/it]
1|9|Loss: 1.1718554496765137: 100%|██████████| 10/10 [00:32<00:00, 3.07s/it]
1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:32<00:00, 3.07s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to TensorBoardLoggerOut/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:34<00:00, 3.49s/it] Comet Logger: INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:
batch_size: 4
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
checkpoint_files:
- model.safetensors
model_type: LLAMA3_2
output_dir: ./CometLoggerOut
recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
_component_: torchtune.datasets.alpaca_dataset
packed: false
device: cpu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 10
metric_logger:
_component_: torchtune.training.metric_logging.CometLogger
log_dir: ./CometLoggerOut/logs
model:
_component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
_component_: torch.optim.AdamW
lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./CometLoggerOut
profiler:
_component_: torchtune.training.setup_torch_profiler
active_steps: 2
cpu: true
cuda: true
enabled: false
num_cycles: 1
output_dir: ./CometLoggerOut/profiling_outputs
profile_memory: false
record_shapes: true
wait_steps: 5
warmup_steps: 3
with_flops: false
with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
max_seq_len: null
path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
INFO:torchtune.utils._logging:log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 4230621073. Local seed is seed + rank = 4230621073 + 0
COMET WARNING: To get all data logged automatically, import comet_ml before the following modules: torch.
COMET INFO: Experiment is live on comet.com https://www.comet.com/ankur-singh/general/9d26c9e13895442aab3965049061eee7
COMET INFO: Couldn't find a Git repository in '/home/demo/github' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.
INFO:torchtune.utils._logging:Writing config to CometLoggerOut/torchtune_config.yaml
INFO:torchtune.utils._logging:Uploading CometLoggerOut/torchtune_config.yaml to Comet as an asset.
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
0%| | 0/10 [00:00<?, ?it/s]
10%|█ | 1/10 [00:04<00:36, 4.05s/it]
1|1|Loss: 2.9877049922943115: 10%|█ | 1/10 [00:04<00:36, 4.05s/it]
1|1|Loss: 2.9877049922943115: 20%|██ | 2/10 [00:08<00:33, 4.23s/it]
1|2|Loss: 1.7919223308563232: 20%|██ | 2/10 [00:08<00:33, 4.23s/it]
1|2|Loss: 1.7919223308563232: 30%|███ | 3/10 [00:11<00:25, 3.62s/it]
1|3|Loss: 1.48738694190979: 30%|███ | 3/10 [00:11<00:25, 3.62s/it]
1|3|Loss: 1.48738694190979: 40%|████ | 4/10 [00:14<00:19, 3.27s/it]
1|4|Loss: 1.3556486368179321: 40%|████ | 4/10 [00:14<00:19, 3.27s/it]
1|4|Loss: 1.3556486368179321: 50%|█████ | 5/10 [00:18<00:18, 3.64s/it]
1|5|Loss: 1.263526201248169: 50%|█████ | 5/10 [00:18<00:18, 3.64s/it]
1|5|Loss: 1.263526201248169: 60%|██████ | 6/10 [00:21<00:13, 3.35s/it]
1|6|Loss: 1.314545750617981: 60%|██████ | 6/10 [00:21<00:13, 3.35s/it]
1|6|Loss: 1.314545750617981: 70%|███████ | 7/10 [00:24<00:10, 3.34s/it]
1|7|Loss: 1.2146070003509521: 70%|███████ | 7/10 [00:24<00:10, 3.34s/it]
1|7|Loss: 1.2146070003509521: 80%|████████ | 8/10 [00:27<00:06, 3.22s/it]
1|8|Loss: 1.197435736656189: 80%|████████ | 8/10 [00:27<00:06, 3.22s/it]
1|8|Loss: 1.197435736656189: 90%|█████████ | 9/10 [00:30<00:03, 3.17s/it]
1|9|Loss: 1.1718554496765137: 90%|█████████ | 9/10 [00:30<00:03, 3.17s/it]
1|9|Loss: 1.1718554496765137: 100%|██████████| 10/10 [00:34<00:00, 3.31s/it]
1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:34<00:00, 3.31s/it]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to CometLoggerOut/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
1|10|Loss: 1.088627576828003: 100%|██████████| 10/10 [00:36<00:00, 3.68s/it]
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Data:
COMET INFO: display_summary_level : 1
COMET INFO: name : long_hip_409
COMET INFO: url : https://www.comet.com/ankur-singh/general/9d26c9e13895442aab3965049061eee7
COMET INFO: Metrics [count] (min, max):
COMET INFO: loss [10] : (1.088627576828003, 2.9877049922943115)
COMET INFO: lr : 2e-05
COMET INFO: tokens_per_second_per_gpu [10] : (81.59734344482422, 127.8671646118164)
COMET INFO: Others:
COMET INFO: hasNestedParams : True
COMET INFO: Parameters:
COMET INFO: batch_size : 4
COMET INFO: checkpointer|_component_ : torchtune.training.FullModelHFCheckpointer
COMET INFO: checkpointer|checkpoint_dir : /tmp/Llama-3.2-1B-Instruct/
COMET INFO: checkpointer|checkpoint_files : ['model.safetensors']
COMET INFO: checkpointer|model_type : LLAMA3_2
COMET INFO: checkpointer|output_dir : ./CometLoggerOut
COMET INFO: checkpointer|recipe_checkpoint : None
COMET INFO: clip_grad_norm : None
COMET INFO: compile : False
COMET INFO: dataset|_component_ : torchtune.datasets.alpaca_dataset
COMET INFO: dataset|packed : False
COMET INFO: dtype : bf16
COMET INFO: enable_activation_checkpointing : False
COMET INFO: enable_activation_offloading : False
COMET INFO: epochs : 1
COMET INFO: gradient_accumulation_steps : 1
COMET INFO: log_every_n_steps : 1
COMET INFO: log_peak_memory_stats : True
COMET INFO: loss|_component_ : torchtune.modules.loss.CEWithChunkedOutputLoss
COMET INFO: metric_logger|_component_ : torchtune.training.metric_logging.CometLogger
COMET INFO: metric_logger|log_dir : ./CometLoggerOut/logs
COMET INFO: model|_component_ : torchtune.models.llama3_2.llama3_2_1b
COMET INFO: optimizer_in_bwd : True
COMET INFO: optimizer|_component_ : torch.optim.AdamW
COMET INFO: optimizer|lr : 2e-05
COMET INFO: profiler|_component_ : torchtune.training.setup_torch_profiler
COMET INFO: profiler|active_steps : 2
COMET INFO: profiler|cpu : True
COMET INFO: profiler|cuda : True
COMET INFO: profiler|enabled : False
COMET INFO: profiler|num_cycles : 1
COMET INFO: profiler|output_dir : ./CometLoggerOut/profiling_outputs
COMET INFO: profiler|profile_memory : False
COMET INFO: profiler|record_shapes : True
COMET INFO: profiler|wait_steps : 5
COMET INFO: profiler|warmup_steps : 3
COMET INFO: profiler|with_flops : False
COMET INFO: profiler|with_stack : False
COMET INFO: resume_from_checkpoint : False
COMET INFO: seed : None
COMET INFO: shuffle : True
COMET INFO: tokenizer|_component_ : torchtune.models.llama3.llama3_tokenizer
COMET INFO: tokenizer|max_seq_len : None
COMET INFO: tokenizer|path : /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
COMET INFO: Uploads:
COMET INFO: asset : 1 (1.39 KB)
COMET INFO: conda-environment-definition : 1
COMET INFO: conda-info : 1
COMET INFO: conda-specification : 1
COMET INFO: environment details : 1
COMET INFO: filename : 1
COMET INFO: installed packages : 1
COMET INFO: os packages : 1
COMET INFO: source_code : 2 (52.21 KB)
COMET INFO:
COMET WARNING: To get all data logged automatically, import comet_ml before the following modules: torch. |
Also wanted to bring some attention to
All calls to |
6ee8a73
to
6218583
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2274 +/- ##
===========================================
+ Coverage 23.97% 64.10% +40.13%
===========================================
Files 358 353 -5
Lines 21207 20695 -512
===========================================
+ Hits 5084 13267 +8183
+ Misses 16123 7428 -8695 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the testing and the changes! Lets just remove the logs and I will merge it :)
# create log_dir if missing | ||
if not os.path.exists(self.log_dir): | ||
os.makedirs(self.log_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did we have to add this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Ankur-singh I was able to use
metric_logger:
_component_: torchtune.training.metric_logging.WandBLogger
before, but now it seems like I need to provide a log_dir
otherwise it's erroring. Just curious if we could rollback to the previous design?
will do a last review this Tuesday and merge it. Thank you for the changes :) |
Context
What is the purpose of this PR? Is it to
Please link to any issues this PR addresses #1968
Changelog
What are the changes made in this PR?
save_config
utility functionDiskLogger.log_config
methodTest plan
Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.
pre-commit install
)pytest tests
pytest tests -m integration_test
Stdout:
Generated Log file at
./logs/torchtune_config.yaml
:UX
If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example