Implementing Weight Decay Masking with nnx.Optimizer and Filters #4737

cadazar · 2025-05-05T10:40:56Z

cadazar
May 5, 2025

Hi, I have been using the NNX API to train a custom language model, needing to apply weight decay only to certain parameters (excluding bias/normalization/embeddings) while using nnx.Optimizer. I found that creating two nnx.Optimizer instances with different base optimizers (e.g. optax.adamw for WD, optax.adam for no WD) and applying them to filtered parameter subsets using nnx.filterlib works well without any significant overhead:

# define a learning rate scheduler
lr_sched = optax.warmup_cosine_decay_schedule(
    init_value=0.0,
    peak_value=training_args.lr,
    warmup_steps=training_args.warmup_steps,
    decay_steps=training_args.total_steps,
    end_value=0.0,
)

# parameter name substrings which should NOT have weight decay
no_wd_patterns = ["bias", "ln", "layernorm", "layer_norm", "scale", "embedding"]

def create_optimizers(model):
    """
    Create two NNX optimizers operating on disjoint parameter subsets:
    
    1. wd_optimizer: AdamW with weight decay, applied to all parameters 
       NOT matching any pattern in `no_wd_patterns`.
    2. no_wd_optimizer: plain Adam (no weight decay), applied ONLY to 
       parameters matching one of the `no_wd_patterns`.
    """
    # base Optax optimizers
    base_wd_opt    = optax.adamw(learning_rate=lr_sched, weight_decay=training_args.wdr)
    base_no_wd_opt = optax.adam(learning_rate=lr_sched)

    # optimizer for weight-decayed params (the complement of no_wd_patterns)
    wd_optimizer = nnx.Optimizer(
        nnx.variables(model, nnx.Param),
        base_wd_opt,
        wrt=nnx.filterlib.Not(
            nnx.filterlib.Any([nnx.filterlib.PathContains(p) for p in no_wd_patterns]
            + ["drop", "pos_mask"])
        )
    )

    # optimizer for non-weight-decayed params (exactly those matching no_wd_patterns)
    no_wd_optimizer = nnx.Optimizer(
        nnx.variables(model, nnx.Param),
        base_no_wd_opt,
        wrt=nnx.filterlib.Any(*[nnx.filterlib.PathContains(p) for p in no_wd_patterns])
    )

    return [wd_optimizer, no_wd_optimizer]


# … later, inside your train loop …
optimizers = create_optimizers(model)

for opt in optimizers:
    # compute gradients only for this optimizer’s slice
    (loss, aux), grad = nnx.value_and_grad(
        loss_fn, has_aux=True, argnums=nnx.DiffState(0, opt.wrt)
    )(model)

    # optional: gradient clipping
    grad = jax.tree_map(lambda g: jnp.clip(g, -1.0, 1.0), grad)

    # apply update and advance step counter
    opt.update(grad)
    lr_sched(opt.step.value)

Hope this can come of use to someone out there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing Weight Decay Masking with nnx.Optimizer and Filters #4737

{{title}}

Replies: 0 comments

Select a reply

Implementing Weight Decay Masking with nnx.Optimizer and Filters #4737

cadazar May 5, 2025

Replies: 0 comments

cadazar
May 5, 2025