Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quark Quantizer Support #1207

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

shobrienDMA
Copy link
Contributor

This allows Quark Quantized models to be processed by ONNX Runtime GenAI.

Quark models must be exported in hf_format.

An example quark_quantize.py command:

python quantize_quark.py --model_dir /[Model_Path] /
--output_dir /[Output_Model_Path] /
--quant_scheme w_uint4_per_group_asym /
--num_calib_data 128 /
--quant_algo awq /
--dataset pileval_for_awq_benchmark /
--seq_len 512 /
--model_export hf_format /
--data_type float32

It also allows different group sizes for different layers depending on what is present in the config.json that Quark produces, a Quark config can look like:

...
  "quantization_config": {
    "algo_config": {
      "model_decoder_layers": "model.layers",
      "name": "awq",
      "num_attention_heads": -1,
      "num_key_value_heads": -1,
      "scaling_layers": [
        {
          "inp": "self_attn.q_proj",
          "layers": [
            "self_attn.q_proj",
            "self_attn.k_proj",
            "self_attn.v_proj"
          ],
          "module2inspect": "self_attn",
          "prev_op": "input_layernorm"
        },
        {
          "inp": "self_attn.o_proj",
          "layers": [
            "self_attn.o_proj"
          ],
          "prev_op": "self_attn.v_proj"
        },
        {
          "inp": "mlp.gate_proj",
          "layers": [
            "mlp.gate_proj",
            "mlp.up_proj"
          ],
          "module2inspect": "mlp",
          "prev_op": "post_attention_layernorm"
        },
        {
          "inp": "mlp.down_proj",
          "layers": [
            "mlp.down_proj"
          ],
          "prev_op": "mlp.up_proj"
        }
      ]
    },
    "exclude": [],
    "export": {
      "kv_cache_group": [],
      "pack_method": "reorder",
      "weight_format": "real_quantized",
      "weight_merge_groups": null
    },
    "global_quant_config": {
      "bias": null,
      "input_tensors": null,
      "output_tensors": null,
      "target_device": null,
      "weight": {
        "ch_axis": 1,
        "dtype": "uint4",
        "group_size": 128,
        "is_dynamic": false,
        "observer_cls": "PerGroupMinMaxObserver",
        "qscheme": "per_group",
        "round_method": "half_even",
        "scale_type": "float",
        "symmetric": false
      }
    },
    "layer_quant_config": {
      "lm_head": {
        "bias": null,
        "input_tensors": null,
        "output_tensors": null,
        "target_device": null,
        "weight": {
          "ch_axis": 1,
          "dtype": "uint4",
          "group_size": 32,
          "is_dynamic": false,
          "observer_cls": "PerGroupMinMaxObserver",
          "qscheme": "per_group",
          "round_method": "half_even",
          "scale_type": "float",
          "symmetric": false
        }
      }
    },
    "layer_type_quant_config": {},
    "quant_method": "quark",
    "quant_mode": "eager_mode"
  },
...

As you can see the lm_head in layer_quant_config has a different group size.

@BowenBao
Copy link
Contributor

cc @kunal-vaishnavi this is the PR for Quark integration, please take a look and let us know what you think, thanks!

self.local_bits = self.global_bits
self.local_group_size = self.global_group_size

# Per-layer quantization support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this section before the if-elif-else checks cleaner by having a method that sets all properties for Quark vs non-Quark scenarios (similar to self.get_config)? I'm thinking something like this could work.

for name, tensor in weights.items():
    module = self.set_module(...)

    if tensor.dtype == torch.bfloat16:
    ...
def set_module(self, ...):
    # Set any shared attributes and variables

    if quant_type == "quark":
        # Set Quark-related attributes and variables
    else:
        # Set non-Quark-related attributes and variables

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with get_local_bits and get_local_group_size methods that are overridden by QuarkModel.

if isinstance(self.lm_head, TensorModule):
weight = self.lm_head.weight
bias = self.lm_head.bias
self.lm_head = QuantizedTensorModule(self.local_bits, self.local_group_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we find a cleaner way to initialize self.lm_head to be a TensorModule versus a QuantizedTensorModule beforehand rather than using the self._initialize_quantized_lm_head approach or this approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of code is removed and merged with existing blocks for updating self.lm_head.

# Set `g_idx` to None since it's not used in `MatMulNBits`
q_tensors.g_idx = None

# Unpack and repack all `Quantized TensorModule` classes in MLP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Unpack and repack all `Quantized TensorModule` classes in MLP
# Unpack and repack all `QuantizedTensorModule` classes in MLP

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kunal-vaishnavi kunal-vaishnavi changed the title [WIP] Quark Quantizer Support Quark Quantizer Support Mar 25, 2025
self.quant_type = quant_type
self.embedding = TensorModule()
self.final_norm = TensorModule()
self.lm_head = TensorModule()
self.layers = {}
self.num_layers = num_layers
self._quant_attrs = quant_attrs
self._load_quant_config(quant_attrs) # codeql[py/init-calls-subclass]

Check warning

Code scanning / CodeQL

`__init__` method calls overridden method Warning

Call to self.
_load_quant_config
in __init__ method, which is overridden by
method QuarkModel._load_quant_config
.
weights = load_file(os.path.join(input_path, weight_file))

# Map weights to modules
for name, tensor in weights.items():

# Per-layer quantization support
local_bits = self.get_layer_bits(name) # codeql[py/init-calls-subclass]

Check warning

Code scanning / CodeQL

`__init__` method calls overridden method Warning

Call to self.
get_layer_bits
in __init__ method, which is overridden by
method QuarkModel.get_layer_bits
.

# Per-layer quantization support
local_bits = self.get_layer_bits(name) # codeql[py/init-calls-subclass]
local_group_size = self.get_layer_group_size(name) # codeql[py/init-calls-subclass]

Check warning

Code scanning / CodeQL

`__init__` method calls overridden method Warning

Call to self.
get_layer_group_size
in __init__ method, which is overridden by
method QuarkModel.get_layer_group_size
.
revert as neither of #noqa or #codeql[py/init-calls-subclass] works
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants