Skip to content

[DRAFT] [NO_MERGE] GPTQv2 for llama-mx#782

Draft
stamalakhov wants to merge 5 commits into
Samsung:mainfrom
stamalakhov:llama_gptqv2_mx
Draft

[DRAFT] [NO_MERGE] GPTQv2 for llama-mx#782
stamalakhov wants to merge 5 commits into
Samsung:mainfrom
stamalakhov:llama_gptqv2_mx

Conversation

@stamalakhov

@stamalakhov stamalakhov commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

This draft assesses GPTQv2 for llama mx quantization.

results for HuggingFaceTB/SmolLM2-135M-Instruct mx (all activations for matmul, rms_norm and softmax are mxint8 ) quantization

Config ID PPL
FP32 17.38
GPTQv2+PTQ_mse_256_samples 22.73
GPTQv2,PTQ_mse_256_samples 22.81
SPQ_GPTQv2+PTQ_mse_128_samples_adapt_percdamp 19.87
SPQ_GPTQv2+PTQ_mse_256_samples_adapt_percdamp 19.93
SPQ_GPTQv2+PTQ_smse_128_samples_adapt_percdamp 19.87
SPQ_GPTQv2+PTQ_mse_for_gptq_128_samples_adapt_percdamp 19.72
SPQ_GPTQv2+PTQ_mse_for_gptq_256_samples_adapt_percdamp 19.66
SPQ_GPTQv2+PTQ_smse_for_gptq_128_samples_adapt_percdamp 19.97

This draft tries to get fully quantized model.

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant