Granite modeling for training #830

tengomucho · 2025-04-09T12:40:49Z

What does this PR do?

This PR introduces the modeling code for Granite, based on NxD core modeling example for Llama.
So far it is possible to instantiate ibm-granite/granite-3.2-2b-instruct and perform an inference on the sharded model.
The modeling code is based on transformers, and adapted to include sharding (only tested with TP parallelism).

MLP Linear layers are parallelized and fused.
Linear layers in Attention are parallelized too.

Flash attention is not used so far because it requires not-padded prompts to generate correct outputs. It was not possible to use fused layers, because in Granite self.num_heads != self.num_key_value_heads. The attention kernel implementation is then just the eager kernel.

This is NOT a complete implementation for training: parallel loss is not implemented, as well as a complete training test and/or example. These will follow up soon, but I thought it was better to submit this to get feedback sooner.

Did you write any new necessary tests?

The old functions are deprecated and produce a warning.

This commit is just a copy of Transformers' 4.48.1 implementation of granite modeling. The code will be subsequentially modified to allow parallelism during training.

The model parallelization is verified with a simple inference.

Note that for now after each linear outputs are gathered, otherwise the result in bfloat16 diverges too much compared to the one obtained when using float32.

This allows for parallelization on the attention, using eager algorithm. This allows to perform the attention calculation in a sharded way, accelerating it.

Remove use_cache, cache_position and past_key_values arguments when possible, since they are not used on training models and it is not useful to keep them here.

After some tests, only the eager attention kernel has proven reliable, so only this option is left in the code of the modeling for this model.

HuggingFaceDocBuilderDev · 2025-04-09T12:44:36Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dacorvo · 2025-04-10T13:55:32Z

optimum/neuron/models/training/granite/configuration_granite.py

+from transformers.models.granite import GraniteConfig
+
+
+class NeuronGraniteConfig(GraniteConfig):


I would rather create a completely different config instead of extending the GraniteConfig. This is what I do for LLM models.

In the PR I am working on to enable custom modeling, I just attach the mp_config which already existed before.
You can check it here and here.

optimum/neuron/models/training/granite/modeling_granite.py

optimum/neuron/accelerate/accelerator.py

optimum/neuron/accelerate/state.py

michaelbenayoun · 2025-04-11T09:29:43Z

optimum/neuron/models/training/granite/configuration_granite.py

+from transformers.models.granite import GraniteConfig
+
+
+class NeuronGraniteConfig(GraniteConfig):


In the PR I am working on to enable custom modeling, I just attach the mp_config which already existed before.
You can check it here and here.

michaelbenayoun · 2025-04-11T09:31:48Z

optimum/neuron/models/training/granite/modeling_granite.py

For weight fusing / QGAQKV etc I have this API: https://github.com/huggingface/optimum-neuron/pull/801/files#diff-7fda1f4deb9a79e4201376095bd5eabc0650da1fb05672493d7949af0820be18R101-R107

WDYT?

Basically the idea is to have a common API so that we can share the code for splitting the loaded weights in from_pretrained and consolidating the checkpoints afterwards.

I agree it would be good to have a common API for these kind of things, and in general even for layers. I prefer having simpler functions though. I propose to isolate this, and then we can make the API uniform once your PR is ready.

About the mp_config: I will separate the config as you had done, and we can merge the mp_config implementations once your PR is ready.

optimum/neuron/models/training/granite/modeling_granite.py

It is specific to inference.

The configuration is now separated in a specific config that is passed over between modules, so that the original config is not modified.

slice_tensor and fuse_weights can be reused in future modeling implementations.

michaelbenayoun

LGTM!

Let's address the remaining points once #801 is merged as agreed offline.

Thanks @tengomucho !!

It is not used for now and it should not passed over to parent class.

The bidirectional dict just requires a key, that can be the class name for simplicity.

dacorvo

LGTM, thanks !

tengomucho added 12 commits April 3, 2025 08:51

chore(xla): update ordinal and world size retrieval API calls

fdc9e50

The old functions are deprecated and produce a warning.

chore(modeling): import granite from transformers

0f57f6d

This commit is just a copy of Transformers' 4.48.1 implementation of granite modeling. The code will be subsequentially modified to allow parallelism during training.

chore(granite): fix imports

d6ab7cb

chore(granite): add NeuronGraniteConfig, parallel GraniteConfig

30149b4

chore(granite): use NeuronGraniteConfig

f5f6e39

chore(granite): remove tp_plan tags

b2922d1

test(granite): add a test to verify correctness of training model

21ab48c

The model parallelization is verified with a simple inference.

feat(granite): implement parallelization of linear layers in MLP

04d9d80

Note that for now after each linear outputs are gathered, otherwise the result in bfloat16 diverges too much compared to the one obtained when using float32.

feat(granite): implement parallelization of linears in Attention

9eb0eb4

This allows for parallelization on the attention, using eager algorithm. This allows to perform the attention calculation in a sharded way, accelerating it.

refactor(granite): remove cache related arguments

96c06af

Remove use_cache, cache_position and past_key_values arguments when possible, since they are not used on training models and it is not useful to keep them here.

refactor(granite): restrict usage to eager kernel

4eda004

After some tests, only the eager attention kernel has proven reliable, so only this option is left in the code of the modeling for this model.

feat(granite): fuse gate and up Linear layers in MLP

1ba417c

tengomucho marked this pull request as ready for review April 9, 2025 13:49

tengomucho requested review from michaelbenayoun and dacorvo April 9, 2025 13:49

dacorvo reviewed Apr 10, 2025

View reviewed changes

dacorvo and others added 3 commits April 10, 2025 15:58

refactor(hlo): rename NeuronConfig to HloNeuronConfig

96146fc

feat: introduce NeuronConfig

5da6f86

feat(NeuronConfig): register can use classname as key instead of backend

9721100

michaelbenayoun reviewed Apr 11, 2025

View reviewed changes

tengomucho added 5 commits April 11, 2025 09:39

refactor(config): move checkpoint id and revision to HLO config

a032ad8

It is specific to inference.

chore(granite): training specific configuration is now separated

a21c1bd

The configuration is now separated in a specific config that is passed over between modules, so that the original config is not modified.

chore(granite): remove input docstrings

0fde21c

chore(granite): remove unused methods

c879906

chore(granite): remove GenerationMixin inheritance in training

0159668

tengomucho requested review from dacorvo and michaelbenayoun April 11, 2025 09:58

refactor(training): separate sharding tools

234cb9c

slice_tensor and fuse_weights can be reused in future modeling implementations.

michaelbenayoun approved these changes Apr 11, 2025

View reviewed changes

tengomucho added 3 commits April 11, 2025 10:14

fix(config): remove checkpoint id and revision from HLO config

7a92be0

It is not used for now and it should not passed over to parent class.

chore(config): remove backend from register_neuron_config

78851ae

The bidirectional dict just requires a key, that can be the class name for simplicity.

chore(config): rename backend -> _serialized_key for clarity

de00911

dacorvo approved these changes Apr 11, 2025

View reviewed changes

tengomucho merged commit 3ef0ff7 into main Apr 11, 2025
7 of 9 checks passed

tengomucho deleted the training-granite-modeling branch April 11, 2025 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Granite modeling for training #830

Granite modeling for training #830

Uh oh!

tengomucho commented Apr 9, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 9, 2025

Uh oh!

dacorvo Apr 10, 2025

Uh oh!

michaelbenayoun Apr 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaelbenayoun Apr 11, 2025

Uh oh!

michaelbenayoun Apr 11, 2025

Uh oh!

tengomucho Apr 11, 2025

Uh oh!

tengomucho Apr 11, 2025

Uh oh!

Uh oh!

michaelbenayoun left a comment

Uh oh!

dacorvo left a comment

Uh oh!

Uh oh!

Uh oh!

		from transformers.models.granite import GraniteConfig


		class NeuronGraniteConfig(GraniteConfig):

Granite modeling for training #830

Granite modeling for training #830

Uh oh!

Conversation

tengomucho commented Apr 9, 2025

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 9, 2025

Uh oh!

dacorvo Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

michaelbenayoun Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaelbenayoun Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

michaelbenayoun Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

tengomucho Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

tengomucho Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

michaelbenayoun left a comment

Choose a reason for hiding this comment

Uh oh!

dacorvo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!