You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m about to run a full FT on Qwen/Qwen3.5-4B for a PT-BR legal assistant dataset and wanted a sanity check before I burn a bunch of GPU time.
This is not LoRA, just straight full finetuning.
Setup right now: model: Qwen/Qwen3.5-4B
data: chat dataset with a messages field
domain: Brazilian legal
max length: 1024
split: 95/5 random
epochs: 1
lr: 1e-5
wd: 0.1
warmup: 0.03
scheduler: cosine
batch size: 4
grad accum: 4
precision: bf16 if available, else fp16
grad checkpointing: on
packing: off
optimizer: adamw_torch_fused
What I’m doing is basically:
normalize messages
apply Qwen chat template
drop samples over max length
train with trl.SFTTrainer
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I’m about to run a full FT on Qwen/Qwen3.5-4B for a PT-BR legal assistant dataset and wanted a sanity check before I burn a bunch of GPU time.
This is not LoRA, just straight full finetuning.
Setup right now: model: Qwen/Qwen3.5-4B
data: chat dataset with a messages field
domain: Brazilian legal
max length: 1024
split: 95/5 random
epochs: 1
lr: 1e-5
wd: 0.1
warmup: 0.03
scheduler: cosine
batch size: 4
grad accum: 4
precision: bf16 if available, else fp16
grad checkpointing: on
packing: off
optimizer: adamw_torch_fused
What I’m doing is basically:
normalize messages
apply Qwen chat template
drop samples over max length
train with trl.SFTTrainer
Core training code is roughly:
Main thing I’m trying to figure out is: is this a common/reasonable recipe, or am I missing some Qwen-specific gotcha?
Stuff I’m unsure about:
should I be using Qwen/Qwen3.5-4B-Base instead of the post-trained one?
for Qwen chat data, is messages + SFTTrainer enough, or is there some masking/template detail that matters a lot?
would you train on the whole formatted conversation, or only assistant tokens?
do any of these hparams look obviously off for domain adaptation?
any known Qwen3.5 full FT traps?
Not looking for the “best possible” setup, mostly just trying to make sure this is a normal/sane way to do it.
Anyone here already fine-tuned Qwen3.5 and can say whether this looks reasonable?
Beta Was this translation helpful? Give feedback.
All reactions