Skip to content

Use model compression pathways #1419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented May 8, 2025

Purpose

  • Use in-memory model compression pathway in order to reduce memory requirements when saving models
  • These changes along with postprocessing changes move users towards a pattern where they are aware of the status of the model (frozen/compressed) and call save_pretrained manually

Prerequisites

Changes

  • Modify save_pretrained_wrapper to use compress_model(model) rather than compress(state_dict)
  • Modify save_pretrained_wrapper so that the state dict is only retrieved if not skipping compression stats
  • Modify save_pretrained_wrapper to save dictionary and python files, even if there is no explicit compressor
  • Modify save_checkpoint (used by training) to decompress after the checkpoint is saved

Example/Testing Changes

As far as I can tell, below lists all of the instances where a model undergoes saving (no immediately followed by script exit)

File Path Solution
examples/trl_mixin/ex_trl_constant.py
test_oneshot_and_finetune.py
Decompress in between stages
examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py
test_oneshot_and_finetune_with_tokenizer.py
Do not save in between stages to avoid compressed state
test_oneshot_then_finetune.py No work is required, as model is decompressed upon loading from disk
test_compress_tensor_utils.py Fix test to use dispatch_model (which is actually used by transformers) rather than cpu_offload

Testing

State Dict In Memory
previous now
oneshot_save.py
import torch
from transformers import AutoModelForCausalLM
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from pttp import TensorProfiler

#MODEL_ID = "DeepSeek-V3_local_bf16"
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

with TensorProfiler() as prof:
    prof.mark_event("Load model")
    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)

    prof.mark_event("Oneshot")
    oneshot(
        model=model,
        recipe=QuantizationModifier(targets="Linear", scheme="W4A16"),
        trust_remote_code_model=True,
    )

    prof.mark_event("Save model")
    model.save_pretrained("sav_testing", save_compressed=True, skip_compression_stats=True)

prof.save_memory_timeline("save_timeline.png")

Signed-off-by: Kyle Sayers <[email protected]>
Copy link

github-actions bot commented May 8, 2025

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

kylesayrs added 2 commits May 14, 2025 11:33
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs added the ready When a PR is ready for review label May 14, 2025
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs removed the ready When a PR is ready for review label May 14, 2025
@kylesayrs kylesayrs added the ready When a PR is ready for review label May 19, 2025
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs marked this pull request as ready for review May 20, 2025 04:44
@kylesayrs kylesayrs changed the title [WIP] Use model compression pathways Use model compression pathways May 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant