Custom modeling for training #801

michaelbenayoun · 2025-03-03T10:35:47Z

What does this PR do?

Custom modeling code for training

Features

This PR adds support for custom modeling code for training.
Each custom modeling code can be added under optimum/neuron/models/training.

Having a custom modeling allows us to implement Neuron specificities in a cleaner way than using dynamic patching.
It becomes easy to:

Fuse linear layers together for efficiency
Use custom linear layers such as GQAQKVColumnParallelLinear, useful with high TP sizes.
Use custom kernels, such as the flash attention kernel

In this PR we provide a first full custom implementation with Llama.

Model weight transformations

Because having a custom modeling code enables to change the vanilla Transformers implementation, we need a way to make sure that we can load checkpoints from Transformers, and that we can save checkpoints in the original format as well.

To do that we provide an API with the ModelWeightTransformationSpec classes.
These classes represent the transformation compared to the vanilla Transformers implementation and are directly added in the modules containing these transformations.

For now two exist:

FusedLinearsSpec: represents a transformation when multiple linear layers are fused into a single linear layer (possibly a parallel linear)
GQAQKVColumnParallelLinearSpec: represents the transformation of separate query, key, and value projections into a single GQAQKVColumnParalleLinear projection.

Then during loading, saving and consolidation, we use these specs to make sure every weight matches with Transformers weights.

Known issues

There seems to be an issue when saving a checkpoint for DP > 1 during training. After initial investigation, it seems to be a compiler bug, but it will require more work. I suggest to work on it on a another PR.

Training example

Specs

Model: meta-llama/Llama-3.2-3B-Instruct
Dataset: databricks/databricks-dolly-15k
Trainer: NeuronSFTTrainer
DP=4, TP=8
Gradient accumulation steps = 16 => Effective batch size = 4 x 16 = 64
Sequence length = 2048 with packing = True
3 epochs
Learning rate = 5e-4, warmup ration = 0.3, lr scheduler type = "cosine"

Loss curve

To be done in later PRs:

Support for PP
Support for LoRA
Refactor save_pretrained as it was done for from_pretrained in this PR.
Add test that tests overfitting

HuggingFaceDocBuilderDev · 2025-03-10T17:10:08Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dacorvo

Thank you for addressing most of my comments. Waiting for the final version including refactoring and tests to review.

dacorvo · 2025-04-30T08:35:53Z

optimum/neuron/models/training/modeling_utils.py

+            unexpected_keys = {k for k in unexpected_keys if "rotary_emb.inv_freq" not in k}
+
+        model.tie_weights()
+        # TODO: stopped here, start from here tomorrow.


Suggested change

# TODO: stopped here, start from here tomorrow.

dacorvo · 2025-04-30T08:41:19Z

optimum/neuron/models/training/modeling_utils.py

+            for name, mod in not_initialized_submodules.items():
+                if isinstance(mod, GQAQKVColumnParallelLinear):
+                    # There is a bug in initialization for this module.
+                    # In any case, we will always have weights for this in the case of `from_pretrained`.


So we will have an issue when training from scratch, right ?

No because we wont call from_pretrained in this case.
I can investigate this issue but it does not seem top priority in this specific context.

dacorvo · 2025-04-30T08:49:39Z

optimum/neuron/models/training/modeling_utils.py

+        )
+
+    @classmethod
+    def from_pretrained(


Thanks, it is much clearer now. I would personally have dropped more sections of code that correspond to:

multiple models (unless I missed something we actually only support training single models in neuron),

the less usual model deployment paradigms (depending on how/when weights are loaded), as I am not entirely sure we would support them anyway.

dacorvo · 2025-04-30T08:52:43Z

optimum/neuron/models/training/modeling_utils.py

+            pretrained_model_name_or_path = str(pretrained_model_name_or_path)
+            is_local = os.path.isdir(pretrained_model_name_or_path)
+            if is_local:
+                if from_tf and os.path.isfile(


Do we support from_tf, from_flax ?

Removed the remaining artifacts for from_tf and from_flax.

dacorvo · 2025-04-30T08:53:49Z

optimum/neuron/models/training/modeling_utils.py

+                    archive_file = os.path.join(
+                        pretrained_model_name_or_path, subfolder, _add_variant(SAFE_WEIGHTS_NAME, variant)
+                    )
+                elif use_safetensors is not False and os.path.isfile(


Suggested change

elif use_safetensors is not False and os.path.isfile(

elif use_safetensors and os.path.isfile(

It's not equivalent because use_safetensors can be None.

dacorvo · 2025-04-30T09:04:35Z

optimum/neuron/models/training/modeling_utils.py

+                                        kwargs={"ignore_errors_during_conversion": True, **cached_file_kwargs},
+                                        name="Thread-auto_conversion",
+                                    ).start()
+                        else:


Do we really want to support this ?

I have not tested yet, but since it is supported in transformers, why not?

michaelbenayoun · 2025-04-30T09:49:13Z

Thank you for this pull-request: massive contribution. The general organization of the files makes sense and I think the modelization_utils.py and transformation_utils.py files are in particular good starting points. I have requested a few changes that I think are important in this first iteration to:

make it clearer what is expected to be provided by someone adding a new model,

keep things simple in this first iteration (I personally think it is easier to add things we support later on instead of keeping placeholders).

Also, I think the basic features must be tested, and it was unclear to me by reading the pull-requests which parts are actually tested, apart from eager forward inference (and by the way we must include a flash_attention test since we support the option).

What is expected is simply adding the proper transformation specs, and inheriting from CustomModule.
In this PR we test:
- The forward pass: we check the the forward pass from the original implementation and the one from the custom modeling produce the same outputs. We test different settings (eager attention, regular qkv, fused qkv, qkv gqa replication). The reason we do not test flash attention is because it does not match exactly for now. I was still able to train a model with flash attention though.

michaelbenayoun · 2025-04-30T12:53:38Z

A lot of work contributed here, thanks! This will help adding new models in a more efficient way. I added few comments, but It mostly resumes to these requests:

Remove the dependency from transformers modeling code

Reuse and merge with existing code, avoid duplication for sharding and tests.

Consider adding a complete test that shows training works (overfitting?)

I suggest we handle the last point in another PR to avoid making this PR bigger than it already is.

dacorvo · 2025-04-30T09:35:33Z

optimum/neuron/models/training/modeling_utils.py

+
+        model.tie_weights()
+        # TODO: stopped here, start from here tomorrow.
+        if device_map is None:


I thought we did not support device_map, so it should be None, right ?

We support a subset of features: device_map in [None, "xla", "cpu"].

I won't be able to review further until I get back, so trusting alvaro's review

tengomucho

Note that the linear tests now fail. I think that should be fixed before merging this.

tengomucho

LGTM, thanks a lot

michaelbenayoun added 4 commits March 3, 2025 11:34

[WIP] modeling

b131882

[WIP] modeling

447cadd

[WIP] modeling

c4107ba

[WIP] from_pretrained

93ce12b

michaelbenayoun added 15 commits March 12, 2025 17:49

[WIP] from_pretrained

7410117

[WIP] from_pretrained

eef5e0e

Incomplete styling

f67a31f

Support flash_attention_v2

53f6900

Merge branch 'main' into custom_modeling_introduction

54819bc

Support for GQA QKV

3cd352c

[WIP] test

bdc65e0

[WIP] test

f5d0214

[WIP] test

c137748

WIP

f4f0d8d

WIP

7fb574b

Refactor

20245a8

[WIP] save_pretrained

a8b247c

[WIP] save_pretrained

22974a0

Merge branch 'main' into custom_modeling_introduction

7d80a8c

michaelbenayoun mentioned this pull request Apr 11, 2025

Granite modeling for training #830

Merged

1 task

michaelbenayoun added 9 commits April 14, 2025 16:54

[WIP]

a329249

Fix

6b486b4

Fix

88ae7ea

Gradient checkpointing

91119e7

Merge branch 'main' into custom_modeling_introduction

39e5002

Styling

8b55f4d

[WIP] consolidate

8d057ab

[WIP] consolidate

3557f3c

[WIP] consolidate

e8752c2

michaelbenayoun added 5 commits April 30, 2025 09:57

Add comment explaining what transformation_utils.py is about

9f7eea3

Add comment explaining what transformation_utils.py is about

3fae083

Change sharding

aad7cad

Remove sharding.py

7b37f12

Restore sharding.py, this can be removed in the Granite PR

50df135

dacorvo previously requested changes Apr 30, 2025

View reviewed changes

michaelbenayoun added 3 commits April 30, 2025 11:37

Combime parallel linear tests

f2a8023

Test with bigger sequence length

109571c

Remove duplicate flash attention test

a542978

michaelbenayoun added 4 commits April 30, 2025 12:18

Add recompute causal mask option

be5c2db

Remove _tp_plan and _pp_plan

a1fa3c0

Remove commented code

0fd9f7c

Remove from_tf and from_flax artifacts

10f82b4

Fix comparison

7c45001

dacorvo reviewed Apr 30, 2025

View reviewed changes

michaelbenayoun added 2 commits April 30, 2025 17:18

Tiny changes

0d8abe6

Add overfitting test

f567b75

tengomucho requested changes May 6, 2025

View reviewed changes

michaelbenayoun added 6 commits May 13, 2025 14:06

Fix tests using all NCs

edcd8f0

Styling

78123e7

Remove tests from the former approach

7f00b7d

Styling

d92f62e

Remove tests from the former approach

6ef46db

Remove tests from the former approach

1746f76

tengomucho approved these changes May 14, 2025

View reviewed changes

tengomucho merged commit 66d1977 into main May 15, 2025
8 of 9 checks passed

tengomucho deleted the custom_modeling_introduction branch May 15, 2025 08:32

	elif use_safetensors is not False and os.path.isfile(
	elif use_safetensors and os.path.isfile(

Custom modeling for training #801

Custom modeling for training #801

Uh oh!

Conversation

michaelbenayoun commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Custom modeling code for training

Features

Model weight transformations

Known issues

Training example

Specs

Loss curve

To be done in later PRs:

Uh oh!

HuggingFaceDocBuilderDev commented Mar 10, 2025

Uh oh!

dacorvo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelbenayoun commented Apr 30, 2025

Uh oh!

michaelbenayoun commented Apr 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tengomucho left a comment

Choose a reason for hiding this comment

Uh oh!

tengomucho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

michaelbenayoun commented Mar 3, 2025 •

edited

Loading