Llama-3.3-70B-Instruct-4bit LoRA Fine-Tuning: No Change (or Instability) - Adapter Issue? #1147

corozcop1980 · 2024-12-10T16:07:20Z

corozcop1980
Dec 10, 2024

Hi everyone,
I’m very new to this and still learning the ropes. I'm struggling to fine-tune mlx-community/Llama-3.3-70B-Instruct-4bit using LoRA (mlx_lm.lora V0.21.0) . The model either doesn't change the output at all (scale=1.0) or becomes completely unstable with gibberish output and NaN/inf values in the logits when I increase the scale even slightly.

The core problem is that the LoRA adapter seems to be having no usable effect on the model's output, despite successful training (loss decreases normally). It's not a matter of fine-tuning the scale - it's like the adaptation either does nothing or breaks the model.

Here's what I've tried:

Multiple rank and alpha values: I've experimented with rank values of 16, 32, and 64, and alpha values of 4, 8, 16, and 32.
Different scale values during inference:
- scale=1.0: No style change whatsoever. The output is identical to the base model.
- scale=1.5: Model outputs only "?"
- scale slightly above 1.9355 (with a previous training run): Gibberish output, NaN/inf in logits.
- scale=4.0 and scale=10 (with a previous training run): Gibberish/repeating tokens, numerical instability.
Debugging:
- I've verified that the LoRA weights are loaded correctly from the adapter file (adapters.safetensors).
- I've confirmed that the scale parameter is correctly passed to the LoRALinear layers during inference.
- I've checked that the LoRA computations (matrix multiplications) are happening without shape errors during the forward pass.
- I've inspected the lora_a and lora_b matrices and the z values (LoRA contribution) in LoRALinear.call. The z values have a very small standard deviation, suggesting the LoRA modifications are extremely subtle.
- [DEBUG] LoRA computation - z_std: 0.078338, scale: 4.0, final_std: 0.265381 [DEBUG] Generated logits - mean: inf, std: inf
Training Results:
- Training loss decreases steadily and reaches around 1.0, indicating the model is learning something.
- Initial loss: ~4.8 Final loss: ~1.0-1.2
Here's my latest training config:

model: "mlx-community/Llama-3.3-70B-Instruct-4bit"
train: true
fine_tune_type: "lora"
data: "dataset"
batch_size: 8 
iters: 1000 
learning_rate: 1e-5
lora_parameters:
  keys: ["self_attn.q_proj", "self_attn.v_proj", "self_attn.k_proj", "self_attn.o_proj", "mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"]
  rank: 32 
  alpha: 8
  scale: 1.0 # During training
  dropout: 0.1
lr_schedule:
  name: "cosine_decay"
  warmup: 50
  arguments: [1e-5, 1000, 1e-7]

I'm really stuck here, and any insights or suggestions would be greatly appreciated!

Answered by awni

Dec 10, 2024

I tried training this:

mlx_lm.lora --model mlx-community/Llama-3.3-70B-Instruct-4bit --data mlx-community/wikisql --iters 100 --batch-size 1 --num-layers 8 --train

And then evaluating it like this:

mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit --adapter-path adapters --max-tokens 50 \
               --prompt "table: 1-10015132-16
columns: Player, No., Nationality, Position, Years in Toronto, School/Club Team
Q: What is terrence ross' nationality
A: "

And it generated the following which is very reasonable:

Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_…

View full answer

awni · 2024-12-10T21:58:07Z

awni
Dec 10, 2024
Maintainer

I tried training this:

mlx_lm.lora --model mlx-community/Llama-3.3-70B-Instruct-4bit --data mlx-community/wikisql --iters 100 --batch-size 1 --num-layers 8 --train

And then evaluating it like this:

mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit --adapter-path adapters --max-tokens 50 \
               --prompt "table: 1-10015132-16
columns: Player, No., Nationality, Position, Years in Toronto, School/Club Team
Q: What is terrence ross' nationality
A: "

And it generated the following which is very reasonable:

Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

table: 1-10015132-16
columns: Player, No., Nationality, Position, Years in Toronto, School/Club Team
Q: What is terrence ross' nationality
A:<|eot_id|><|start_header_id|>assistant<|end_header_id|>


SELECT Nationality FROM 1-10015132-16 WHERE Player = 'Terrence Ross'

So I'm not sure where things are going wrong for you. A few suggestions:

Reproduce the above. If it works then either there is a problem in your data of your config
Run the above setup but using your own data. If it gives you gibberish, probably there is a problem with your data
Run the above dataset but with your custom config. If it gives you gibberish probably there is an issue with the config (which could be a bug elsewhere but at least it will help narrow things down)..

3 replies

corozcop1980 Dec 11, 2024
Author

I tried fine-tuning your example, and it worked perfectly. I then tried your config with my data, and it worked, too! So it was my configuration. Thanks!

awni Dec 11, 2024
Maintainer

I'm pretty curious why your config doesn't work.. nothing in it jumps out to me as obviously wrong..

corozcop1980 Dec 11, 2024
Author

Well, I’m still learning, so I basically followed Claude’s instructions like a monkey pressing buttons. After some tries, it suggested setting the scale to 1.0. Now I see with the default with your example, it's set to 10 in adapter_config.json, and even with that, doing inference, it still wasn't working. So, I started raising the scale until 13, and it started giving me better responses, so I left it at 13.5.

With my previous attempts (that didn't work), I tried this:
scale=1.0: Model outputs exact input text with no style adaptation
scale=1.5: Model still outputs exact input text
scale=2.0: Outputs "?"
scale=2.5: Outputs "?"
scale=4.0: Model starts producing gibberish/curse words and shows numerical instability.

so, In my endless sea of ignorance, I’m guessing the scale was the problem. I also, in my previous attempt, targeted specific layers with the rank: 32 and alpha: 8.

corozcop1980 · 2025-03-14T17:12:55Z

corozcop1980
Mar 14, 2025
Author

Hi @awni, sorry to bother you again, but I've tried running LoRA fine-tuning multiple times, but I'm not getting good results. I'm on MLX version 0.23.2 and have tested different learning rates, layer counts, and dataset sizes. The training loss is not improving as expected. In my previous runs, the loss steadily decreased over time, but now it remains relatively high even after multiple iterations. The validation loss also does not show significant improvement, making it unclear if the model is learning effectively. I also noticed that the number of trainable parameters has dropped compared to my previous runs.

I decided to try the example you gave me that worked before, but now the same setup isn't improving the model like it did before. The loss is not getting better either.

I tried this:

mlx_lm.lora --model mlx-community/Llama-3.3-70B-Instruct-4bit --data mlx-community/wikisql --iters 100 --batch-size 1 --num-layers 8 --train
Loading pretrained model
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 18151.12it/s]
Loading datasets
Loading Hugging Face dataset mlx-community/wikisql.
Training
Trainable parameters: 0.002% (1.638M/70553.706M)
Starting training..., iters: 100
Iter 1: Val loss 3.105, Val took 31.339s
Iter 10: Train loss 2.810, Learning Rate 1.000e-05, It/sec 0.594, Tokens/sec 51.949, Trained Tokens 874, Peak mem 40.752 GB
Iter 20: Train loss 2.847, Learning Rate 1.000e-05, It/sec 0.627, Tokens/sec 46.363, Trained Tokens 1613, Peak mem 40.752 GB
Iter 30: Train loss 2.693, Learning Rate 1.000e-05, It/sec 0.660, Tokens/sec 47.737, Trained Tokens 2336, Peak mem 40.752 GB
Iter 40: Train loss 2.268, Learning Rate 1.000e-05, It/sec 0.485, Tokens/sec 41.760, Trained Tokens 3197, Peak mem 40.752 GB
Iter 50: Train loss 1.915, Learning Rate 1.000e-05, It/sec 0.275, Tokens/sec 24.056, Trained Tokens 4072, Peak mem 40.938 GB
Iter 60: Train loss 1.709, Learning Rate 1.000e-05, It/sec 0.124, Tokens/sec 10.004, Trained Tokens 4880, Peak mem 40.938 GB
Iter 70: Train loss 1.535, Learning Rate 1.000e-05, It/sec 0.280, Tokens/sec 20.595, Trained Tokens 5616, Peak mem 40.938 GB
Iter 80: Train loss 1.468, Learning Rate 1.000e-05, It/sec 0.319, Tokens/sec 26.164, Trained Tokens 6436, Peak mem 40.938 GB
Iter 90: Train loss 1.614, Learning Rate 1.000e-05, It/sec 0.352, Tokens/sec 29.681, Trained Tokens 7280, Peak mem 40.938 GB
Iter 100: Val loss 1.535, Val took 62.628s
Iter 100: Train loss 1.692, Learning Rate 1.000e-05, It/sec 0.359, Tokens/sec 28.107, Trained Tokens 8063, Peak mem 40.938 GB
Iter 100: Saved adapter weights to adapters/adapters.safetensors and adapters/0000100_adapters.safetensors.
Saved final weights to adapters/adapters.safetensors.

Then I ran:

mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit --adapter-path adapters --max-tokens 50 \
               --prompt "table: 1-10015132-16
columns: Player, No., Nationality, Position, Years in Toronto, School/Club Team
Q: What is terrence ross' nationality
A: "
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 28207.94it/s]
==========
According to the table, Terrence Ross' nationality is American.
==========
Prompt: 79 tokens, 74.713 tokens-per-sec
Generation: 14 tokens, 12.678 tokens-per-sec
Peak memory: 39.968 GB

Has anything changed in recent MLX updates that could affect fine-tuning, or is there something I should adjust?

Thanks!!

0 replies

awni · 2025-03-14T17:19:45Z

awni
Mar 14, 2025
Maintainer

I also noticed that the number of trainable parameters has dropped compared to my previous runs.

We had a bug where num-layers was ignored and set to all by default. It was fixed and now respects the default which is 16. You could try setting it to -1 to use all the layers.

1 reply

corozcop1980 Mar 15, 2025
Author

I tried with num-layers -1, and it's working. Thank you again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama-3.3-70B-Instruct-4bit LoRA Fine-Tuning: No Change (or Instability) - Adapter Issue? #1147

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Llama-3.3-70B-Instruct-4bit LoRA Fine-Tuning: No Change (or Instability) - Adapter Issue? #1147

corozcop1980 Dec 10, 2024

Replies: 3 comments · 4 replies

awni Dec 10, 2024 Maintainer

corozcop1980 Dec 11, 2024 Author

awni Dec 11, 2024 Maintainer

corozcop1980 Dec 11, 2024 Author

corozcop1980 Mar 14, 2025 Author

awni Mar 14, 2025 Maintainer

corozcop1980 Mar 15, 2025 Author

corozcop1980
Dec 10, 2024

Replies: 3 comments 4 replies

awni
Dec 10, 2024
Maintainer

corozcop1980 Dec 11, 2024
Author

awni Dec 11, 2024
Maintainer

corozcop1980 Dec 11, 2024
Author

corozcop1980
Mar 14, 2025
Author

awni
Mar 14, 2025
Maintainer

corozcop1980 Mar 15, 2025
Author