-
Notifications
You must be signed in to change notification settings - Fork 443
Add GSM8K evaluation script and AWQ+FP8 results #2330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+61
−0
Merged
Changes from 8 commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
6ed0e44
Add GSM8K evaluation script and AWQ+FP8 results
rtj1 a883c17
Address review feedback: improve sed specificity and add path validation
rtj1 14bc4cc
Update examples/awq/gsm8k_eval.py
rtj1 680d8e7
Merge branch 'main' into add-gsm8k-eval-fp8
rtj1 1151344
Merge branch 'main' into add-gsm8k-eval-fp8
HDCharles 151a5b9
Fix code style and formatting issues
rtj1 0f35746
Merge branch 'main' into add-gsm8k-eval-fp8
brian-dellabetta 9d425dd
Update results with Llama-3-8B evaluation data
rtj1 002b77a
Address review feedback from brian-dellabetta
rtj1 c99cb45
Merge branch 'main' into add-gsm8k-eval-fp8
HDCharles 6110243
Remove redundant summary line per HDCharles review
rtj1 6601810
Merge branch 'main' into add-gsm8k-eval-fp8
rtj1 dd7077a
cleaning
HDCharles 344b76d
Add discussion on FP8_BLOCK vs FP8_DYNAMIC performance
HDCharles File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| # AWQ + FP8 Quantization Results | ||
|
|
||
| Closes #2305 | ||
|
|
||
| **Model:** Meta-Llama-3-8B-Instruct | ||
| **Hardware:** 8x NVIDIA A100-SXM4-80GB | ||
| **Date:** Feb 10, 2026 | ||
|
|
||
| ## Summary | ||
|
|
||
| Ran the example scripts with both FP8 schemes (FP8_DYNAMIC and FP8_BLOCK) on Meta-Llama-3-8B-Instruct, then evaluated on GSM8K as requested in #2305. FP8_DYNAMIC performs better overall. | ||
|
|
||
| This PR adds: | ||
| - `gsm8k_eval.py` - evaluation script for running GSM8K benchmarks | ||
| - `RESULTS.md` - results and reproducible workflow | ||
|
|
||
| ## GSM8K Results | ||
|
|
||
| | Scheme | Strict Match | Flexible Extract | | ||
| |--------|-------------|------------------| | ||
| | **FP8_DYNAMIC** | **76.42%** | **76.19%** | | ||
| | **FP8_BLOCK** | 75.21% | 74.98% | | ||
|
|
||
| FP8_DYNAMIC wins by ~1.2% on strict matching. Both achieve similar performance on flexible extraction. | ||
HDCharles marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| **Evaluation details:** | ||
| - 1,319 test samples | ||
| - Batch size: 16 | ||
| - Model: Meta-Llama-3-8B-Instruct | ||
|
|
||
| ## Model Checkpoints | ||
|
|
||
| - FP8_DYNAMIC: https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-awq-asym-fp8-dynamic | ||
| - FP8_BLOCK: https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-awq-asym-fp8-block | ||
HDCharles marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Setup | ||
|
|
||
| Use the existing example scripts from the repo: | ||
| ```bash | ||
| cd examples/awq | ||
| python fp8_dynamic_llama_example.py | ||
| python fp8_block_llama_example.py | ||
| ``` | ||
|
|
||
| ## Evaluation | ||
|
|
||
| Use `gsm8k_eval.py` for running benchmarks: | ||
|
|
||
| ```bash | ||
| python gsm8k_eval.py <model_path> | ||
| ``` | ||
|
|
||
| Or directly with lm-eval: | ||
| ```bash | ||
| lm_eval \ | ||
| --model hf \ | ||
| --model_args pretrained=<model_path>,dtype=auto \ | ||
| --tasks gsm8k \ | ||
| --batch_size 16 \ | ||
| --output_path <output_dir> | ||
| ``` | ||
|
|
||
| **Important:** Setting `batch_size=16` is critical. The default `auto` picks 1, which significantly increases evaluation time. | ||
|
|
||
| ## Recommendation | ||
|
|
||
| **Use FP8_DYNAMIC** for AWQ quantization - better accuracy preservation (76.42% vs 75.21% on GSM8K strict matching) with similar model characteristics. | ||
HDCharles marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| """ | ||
| GSM8K evaluation script for AWQ+FP8 quantized models. | ||
|
|
||
| Usage: | ||
| python gsm8k_eval.py <model_path> | ||
|
|
||
| Example: | ||
| python gsm8k_eval.py ./Qwen2.5-0.5B-Instruct-awq-fp8-dynamic | ||
| """ | ||
|
|
||
| import argparse | ||
| import os | ||
| import subprocess | ||
| import sys | ||
|
|
||
|
|
||
| def evaluate_model(model_path): | ||
| """Run GSM8K eval using lm-eval.""" | ||
| print(f"\nEvaluating {model_path} on GSM8K...") | ||
|
|
||
| # Output dir based on model path | ||
| output_dir = os.path.basename(model_path.rstrip("/")) + "_gsm8k_results" | ||
|
|
||
| # Run lm-eval with batch_size=16 | ||
| # Note: Don't use batch_size=auto, it defaults to 1 which is super slow | ||
| cmd = [ | ||
| "lm_eval", | ||
| "--model", | ||
| "hf", | ||
| "--model_args", | ||
| f"pretrained={model_path},dtype=auto", | ||
| "--tasks", | ||
| "gsm8k", | ||
| "--batch_size", | ||
| "16", | ||
| "--output_path", | ||
| output_dir, | ||
| ] | ||
|
|
||
| try: | ||
| subprocess.run(cmd, check=True) | ||
| print(f"\nResults saved to {output_dir}/") | ||
| except subprocess.CalledProcessError as e: | ||
| print(f"Evaluation failed: {e}") | ||
| sys.exit(1) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| parser = argparse.ArgumentParser(description="Eval quantized models on GSM8K") | ||
| parser.add_argument("model_path", help="Path to quantized model directory") | ||
| args = parser.parse_args() | ||
|
|
||
| if not os.path.isdir(args.model_path): | ||
| print(f"Error: Model path not found: {args.model_path}", file=sys.stderr) | ||
| sys.exit(1) | ||
|
|
||
| if not os.path.isdir(args.model_path): | ||
| print(f"Error: Model path not found: {args.model_path}", file=sys.stderr) | ||
| sys.exit(1) | ||
|
|
||
| evaluate_model(args.model_path) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.