Is this a common/reasonable recipe for full finetuning Qwen3.5-4B? #5432

celsowm · 2026-04-01T22:01:35Z

celsowm
Apr 1, 2026

I’m about to run a full FT on Qwen/Qwen3.5-4B for a PT-BR legal assistant dataset and wanted a sanity check before I burn a bunch of GPU time.

This is not LoRA, just straight full finetuning.

Setup right now: model: Qwen/Qwen3.5-4B

data: chat dataset with a messages field
domain: Brazilian legal
max length: 1024
split: 95/5 random
epochs: 1
lr: 1e-5
wd: 0.1
warmup: 0.03
scheduler: cosine
batch size: 4
grad accum: 4
precision: bf16 if available, else fp16
grad checkpointing: on
packing: off
optimizer: adamw_torch_fused

What I’m doing is basically:

normalize messages

apply Qwen chat template
drop samples over max length
train with trl.SFTTrainer

Core training code is roughly:

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
import torch

MODEL_NAME = "Qwen/Qwen3.5-4B"
MAX_LENGTH = 1024

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    low_cpu_mem_usage=True,
)

for p in model.parameters():
    p.requires_grad = True

model.config.use_cache = False

args = SFTConfig(
    output_dir="output",
    num_train_epochs=1,
    learning_rate=1e-5,
    weight_decay=0.1,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    bf16=torch.cuda.is_bf16_supported(),
    fp16=not torch.cuda.is_bf16_supported(),
    tf32=True,
    gradient_checkpointing=True,
    packing=False,
    max_length=MAX_LENGTH,
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    report_to="none",
    remove_unused_columns=False,
    eos_token=tokenizer.eos_token,
    pad_token=tokenizer.pad_token,
)

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    processing_class=tokenizer,
)
trainer.train()

Main thing I’m trying to figure out is: is this a common/reasonable recipe, or am I missing some Qwen-specific gotcha?

Stuff I’m unsure about:

should I be using Qwen/Qwen3.5-4B-Base instead of the post-trained one?

for Qwen chat data, is messages + SFTTrainer enough, or is there some masking/template detail that matters a lot?

would you train on the whole formatted conversation, or only assistant tokens?

do any of these hparams look obviously off for domain adaptation?

any known Qwen3.5 full FT traps?

Not looking for the “best possible” setup, mostly just trying to make sure this is a normal/sane way to do it.

Anyone here already fine-tuned Qwen3.5 and can say whether this looks reasonable?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this a common/reasonable recipe for full finetuning Qwen3.5-4B? #5432

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Is this a common/reasonable recipe for full finetuning Qwen3.5-4B? #5432

Uh oh!

Uh oh!

celsowm Apr 1, 2026

Replies: 0 comments

celsowm
Apr 1, 2026