-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quark Quantizer Support #1207
base: main
Are you sure you want to change the base?
Quark Quantizer Support #1207
Conversation
# Conflicts: # src/python/py/models/builder.py
…into amd_shobrien/per_layer_support
cc @kunal-vaishnavi this is the PR for Quark integration, please take a look and let us know what you think, thanks! |
self.local_bits = self.global_bits | ||
self.local_group_size = self.global_group_size | ||
|
||
# Per-layer quantization support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this section before the if-elif-else
checks cleaner by having a method that sets all properties for Quark vs non-Quark scenarios (similar to self.get_config
)? I'm thinking something like this could work.
for name, tensor in weights.items():
module = self.set_module(...)
if tensor.dtype == torch.bfloat16:
...
def set_module(self, ...):
# Set any shared attributes and variables
if quant_type == "quark":
# Set Quark-related attributes and variables
else:
# Set non-Quark-related attributes and variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with get_local_bits
and get_local_group_size
methods that are overridden by QuarkModel.
if isinstance(self.lm_head, TensorModule): | ||
weight = self.lm_head.weight | ||
bias = self.lm_head.bias | ||
self.lm_head = QuantizedTensorModule(self.local_bits, self.local_group_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we find a cleaner way to initialize self.lm_head
to be a TensorModule
versus a QuantizedTensorModule
beforehand rather than using the self._initialize_quantized_lm_head
approach or this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of code is removed and merged with existing blocks for updating self.lm_head
.
# Set `g_idx` to None since it's not used in `MatMulNBits` | ||
q_tensors.g_idx = None | ||
|
||
# Unpack and repack all `Quantized TensorModule` classes in MLP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Unpack and repack all `Quantized TensorModule` classes in MLP | |
# Unpack and repack all `QuantizedTensorModule` classes in MLP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
self.quant_type = quant_type | ||
self.embedding = TensorModule() | ||
self.final_norm = TensorModule() | ||
self.lm_head = TensorModule() | ||
self.layers = {} | ||
self.num_layers = num_layers | ||
self._quant_attrs = quant_attrs | ||
self._load_quant_config(quant_attrs) # codeql[py/init-calls-subclass] |
Check warning
Code scanning / CodeQL
`__init__` method calls overridden method Warning
_load_quant_config
method QuarkModel._load_quant_config
weights = load_file(os.path.join(input_path, weight_file)) | ||
|
||
# Map weights to modules | ||
for name, tensor in weights.items(): | ||
|
||
# Per-layer quantization support | ||
local_bits = self.get_layer_bits(name) # codeql[py/init-calls-subclass] |
Check warning
Code scanning / CodeQL
`__init__` method calls overridden method Warning
get_layer_bits
method QuarkModel.get_layer_bits
|
||
# Per-layer quantization support | ||
local_bits = self.get_layer_bits(name) # codeql[py/init-calls-subclass] | ||
local_group_size = self.get_layer_group_size(name) # codeql[py/init-calls-subclass] |
Check warning
Code scanning / CodeQL
`__init__` method calls overridden method Warning
get_layer_group_size
method QuarkModel.get_layer_group_size
revert as neither of #noqa or #codeql[py/init-calls-subclass] works
This reverts commit 540bd8d.
This allows Quark Quantized models to be processed by ONNX Runtime GenAI.
Quark models must be exported in hf_format.
An example quark_quantize.py command:
It also allows different group sizes for different layers depending on what is present in the config.json that Quark produces, a Quark config can look like:
As you can see the lm_head in layer_quant_config has a different group size.