Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Phi4 #2197

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions recipes/configs/phi3/evaluation.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,25 +7,25 @@ output_dir: ./ # Not needed

# Model Arguments
model:
_component_: torchtune.models.phi3.phi3_mini
_component_: torchtune.models.phi4.phi4_mini

# Checkpointer
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Phi-3-mini-4k-instruct
checkpoint_dir: /tmp/phi-4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure this matches the format of other directories

checkpoint_files: [
model-00001-of-00002.safetensors,
model-00002-of-00002.safetensors
]
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: PHI3_MINI
model_type: PHI4_MINI
resume_from_checkpoint: False

# Tokenizer
tokenizer:
_component_: torchtune.models.phi3.phi3_mini_tokenizer
path: /tmp/Phi-3-mini-4k-instruct/tokenizer.model
_component_: torchtune.models.phi4.phi4_mini_tokenizer
path: /tmp/phi-4/tokenizer.model
max_seq_len: null

# Environment
Expand Down
44 changes: 44 additions & 0 deletions recipes/configs/phi4/evaluation.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you made a copy from phi3, but made the changes in phi3/evaluation, instead of here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think these two eval files need to be swapped

Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Config for EleutherEvalRecipe in eleuther_eval.py
#
# To launch, run the following command:
# tune run eleuther_eval --config phi3/evaluation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/phi3/phi4


output_dir: ./ # Not needed

# Model Arguments
model:
_component_: torchtune.models.phi3.phi3_mini

# Checkpointer
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Phi-3-mini-4k-instruct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/Phi-3/Phi-4

checkpoint_files: [
model-00001-of-00002.safetensors,
model-00002-of-00002.safetensors
]
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: PHI3_MINI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/PHI3_MINI/PHI4_MINI

resume_from_checkpoint: False

# Tokenizer
tokenizer:
_component_: torchtune.models.phi3.phi3_mini_tokenizer
path: /tmp/Phi-3-mini-4k-instruct/tokenizer.model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/torchtune.models.phi3.phi3_mini_tokenizer/torchtune.models.phi4.phi4_mini_tokenizer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/Phi-3/Phi-4

max_seq_len: null

# Environment
device: cuda
dtype: bf16
seed: 1234 # It is not recommended to change this seed, b/c it matches EleutherAI's default seed

# EleutherAI specific eval args
tasks: ["truthfulqa_mc2"]
limit: null
max_seq_length: 4096
batch_size: 8
enable_kv_cache: True

# Quantization specific args
quantizer: null
109 changes: 109 additions & 0 deletions recipes/configs/phi4/mini_full.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b question: Is "mini" the right nomenclature? Or do they have a family of model sizes like phi4_7b, phi4_13B, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty arguing moment, in the description of Phi4 it is "mini model" in real life it is not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should drop the mini and just stick to model sizes, since its more informative. @ebsmothers , any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah seems like they are mostly using model sizes instead of "mini" in public docs, so maybe let's go with 14B instead of mini?

Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Config for multi-device full finetuning in full_finetune_distributed.py
# using a Phi4 16K Instruct
#
# This config assumes that you've run the following command before launching
# this run:
# tune download microsoft/phi-4 --output-dir /tmp/phi-4 --hf-token <HF_TOKEN>
#
# Run this config on 4 GPUs using the following:
# tune run --nproc_per_node 4 full_finetune_distributed --config phi4/mini_full
#
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune run --nproc_per_node 4 full_finetune_distributed --config phi4/mini_full checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config works best when the model is being fine-tuned on 2+ GPUs.
# Single device full finetuning requires more memory optimizations. It's
# best to use mini_low_memory.yaml for those cases

output_dir: /tmp/torchtune/phi4_mini/full # /tmp may be deleted by your system. Change it to your preference.

# Model arguments
model:
_component_: torchtune.models.phi4.phi4_mini

# Tokenizer
tokenizer:
_component_: torchtune.models.phi4.phi4_mini_tokenizer
path: /tmp/phi-4/tokenizer.model
max_seq_len: null

# Checkpointer
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/phi-4
checkpoint_files: [
model-00001-of-00006.safetensors,
model-00002-of-00006.safetensors,
model-00003-of-00006.safetensors,
model-00004-of-00006.safetensors,
model-00005-of-00006.safetensors,
model-00006-of-00006.safetensors,
]
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: PHI3_MINI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b question: Are there are differences between PHI3 and PHI4? Even if there arent, should we update the model_type for clarity? I believe that this is used in the checkpointer to map the HF format to torchtune format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to tech report there is difference in tokenizer and in attention in such way that it is not touching us. But some observations that I made upper might get us to different conclusion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am tempted to say that even if PHI3_MINI == PHI4_MINI, every model should have its own nomenclature, so there is less cognitive load for the user. @ebsmothers , what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I would stick with the precedent we've set, which is to only use a new model type when the arch changes. This is what we do for the Llama family, where we have LLAMA3, LLAMA3_2, but not LLAMA3_1 or LLAMA3_3. I do agree with your point though @felipemello1 -- we can consider the renaming in a follow-up (at that time I would also probably drop the MINI from Phi model names too)

resume_from_checkpoint: False

# Dataset
dataset:
_component_: torchtune.datasets.alpaca_cleaned_dataset
packed: False # True increases speed
seed: null
shuffle: True

# Fine-tuning arguments
epochs: 1
max_steps_per_epoch: null
batch_size: 2
gradient_accumulation_steps: 8 # Use to increase effective batch size
optimizer:
_component_: torch.optim.AdamW
fused: True
lr: 5e-6
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
compile: False # torch.compile the model + loss, True increases speed + decreases memory
optimizer_in_bwd: False # True saves memory. Requires gradient_accumulation_steps=1

# Training env
device: cuda

# Memory management
enable_activation_checkpointing: True # True reduces memory
enable_activation_offloading: False # True reduces memory
dtype: bf16

# Logging
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: ${output_dir}/logs
log_every_n_steps: 1
log_peak_memory_stats: True


# Profiler (disabled)
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: False

#Output directory of trace artifacts
output_dir: ${output_dir}/profiling_outputs

#`torch.profiler.ProfilerActivity` types to trace
cpu: True
cuda: True

#trace options passed to `torch.profiler.profile`
profile_memory: False
with_stack: False
record_shapes: True
with_flops: False

# `torch.profiler.schedule` options:
# wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
wait_steps: 5
warmup_steps: 3
active_steps: 2
num_cycles: 1
110 changes: 110 additions & 0 deletions recipes/configs/phi4/mini_full_low_memory.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this was already the naming convention for Phi3, but we should probably add "single_device" to the config name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phi3 uses low_memory. Personally I would like to change full_low_memory -> full_single_device across the board, but again would prioritize consistency with Phi3 in this PR.

Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Config for single device full finetuning in full_finetune_single_device.py
# using a Phi4 16K Instruct
#
# This config assumes that you've run the following command before launching
# this run:
# tune download microsoft/phi-4 --output-dir /tmp/phi-4 --hf-token <HF_TOKEN>
#
# The default config uses an optimizer from bitsandbytes. If you do not have it installed,
# you can install it with
# pip install bitsandbytes
#
# To launch on a single device, run the following command from root:
# tune run full_finetune_single_device --config phi4/mini_full_low_memory
#
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune run full_finetune_single_device --config phi4/mini_full_low_memory checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config works only for training on single device.

output_dir: /tmp/torchtune/phi4_mini/full_low_memory # /tmp may be deleted by your system. Change it to your preference.

# Model arguments
model:
_component_: torchtune.models.phi4.phi4_mini

# Tokenizer
tokenizer:
_component_: torchtune.models.phi4.phi4_mini_tokenizer
path: /tmp/phi-4/tokenizer.model
max_seq_len: null

# Checkpointer
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/phi-4
checkpoint_files: [
model-00001-of-00006.safetensors,
model-00002-of-00006.safetensors,
model-00003-of-00006.safetensors,
model-00004-of-00006.safetensors,
model-00005-of-00006.safetensors,
model-00006-of-00006.safetensors,
]
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: PHI3_MINI
resume_from_checkpoint: False

# Dataset
dataset:
_component_: torchtune.datasets.alpaca_cleaned_dataset
packed: False # True increases speed
seed: null
shuffle: True

# Fine-tuning arguments
epochs: 1
max_steps_per_epoch: null
batch_size: 2
gradient_accumulation_steps: 1 # Use to increase effective batch size
optimizer:
_component_: bitsandbytes.optim.PagedAdamW
lr: 5e-6
optimizer_in_bwd: True # True saves memory. Requires gradient_accumulation_steps=1
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
compile: False # torch.compile the model + loss, True increases speed + decreases memory

# Training env
device: cuda

# Memory management
enable_activation_checkpointing: True # True reduces memory
enable_activation_offloading: True # True reduces memory
dtype: bf16

# Logging
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: ${output_dir}/logs
log_every_n_steps: 1
log_peak_memory_stats: True


# Profiler (disabled)
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: False

#Output directory of trace artifacts
output_dir: ${output_dir}/profiling_outputs

#`torch.profiler.ProfilerActivity` types to trace
cpu: True
cuda: True

#trace options passed to `torch.profiler.profile`
profile_memory: False
with_stack: False
record_shapes: True
with_flops: False

# `torch.profiler.schedule` options:
# wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
wait_steps: 5
warmup_steps: 3
active_steps: 2
num_cycles: 1
Loading
Loading