[not for land] Gsm8k eval results by HDCharles · Pull Request #2347 · vllm-project/llm-compressor

HDCharles · 2026-02-11T15:02:44Z

Closes #2305 This PR adds: - gsm8k_eval.py: Evaluation script for running GSM8K benchmarks on quantized models - RESULTS.md: Quantization and evaluation results for Qwen2.5-0.5B-Instruct with FP8_DYNAMIC and FP8_BLOCK schemes Key findings: - FP8_DYNAMIC achieves 22.67% strict match vs 17.97% for FP8_BLOCK on GSM8K - Both schemes achieve ~1.2x compression (1.1GB -> 0.92GB) - Quantized models uploaded to HuggingFace Hub for reproducibility Evaluated on Google Colab L4 GPU (22.5GB) using the existing example scripts. Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>

Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jagarlamudi <76727507+rtj1@users.noreply.github.com> Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>

gemini-code-assist · 2026-02-11T15:03:02Z

Summary of Changes

Hello @HDCharles, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new evaluation script and comprehensive results for AWQ+FP8 quantized models on the GSM8K benchmark. It provides a direct comparison between FP8_DYNAMIC and FP8_BLOCK quantization schemes, highlighting their performance on a Qwen2.5-0.5B-Instruct model and offering a clear recommendation for optimal accuracy preservation. The changes enable standardized benchmarking and provide valuable insights into the effectiveness of different quantization approaches.

Highlights

New Evaluation Script: A new Python script, gsm8k_eval.py, was added to facilitate the evaluation of AWQ+FP8 quantized models on the GSM8K benchmark.
Quantization Results Documentation: A RESULTS.md file was introduced, detailing the performance of FP8_DYNAMIC and FP8_BLOCK quantization schemes on the Qwen2.5-0.5B-Instruct model when evaluated on GSM8K.
Performance Comparison and Recommendation: The evaluation results show that FP8_DYNAMIC preserves accuracy better on GSM8K (22.67% strict match) compared to FP8_BLOCK (17.97%), leading to a recommendation for using FP8_DYNAMIC for AWQ quantization.
Reproducible Workflow: The RESULTS.md file also provides a comprehensive, reproducible workflow for quantizing and evaluating models, including setup commands and runtime estimates.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

examples/awq/RESULTS.md
- Documented AWQ + FP8 quantization results, including model compression, GSM8K evaluation scores, setup instructions, and a reproducible workflow.
- Provided a recommendation to use FP8_DYNAMIC based on superior accuracy on the GSM8K benchmark.
examples/awq/gsm8k_eval.py
- Implemented a Python script to automate the evaluation of quantized models on the GSM8K dataset using the lm-eval framework.
- Ensured efficient evaluation by explicitly setting a batch size of 16 for the lm-eval command.

Activity

No specific human activity (comments, reviews) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

mergify · 2026-02-11T15:03:23Z

⚠️ The sha of the head commit of this PR conflicts with #2330. Mergify cannot evaluate rules on this PR. ⚠️

gemini-code-assist

Code Review

This pull request introduces evaluation results for AWQ+FP8 quantization in a RESULTS.md file and adds a Python script, gsm8k_eval.py, for running the evaluation. The results documentation is clear and well-structured. The evaluation script, however, contains a significant bug with duplicated code and logic placed outside the if __name__ == '__main__' guard. This could lead to runtime errors if the script is imported as a module. I've provided a suggestion to refactor the script's entry point to resolve these issues.

gemini-code-assist · 2026-02-11T15:04:32Z

examples/awq/gsm8k_eval.py

+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Eval quantized models on GSM8K")
+    parser.add_argument("model_path", help="Path to quantized model directory")
+    args = parser.parse_args()
+
+    if not os.path.isdir(args.model_path):
+        print(f"Error: Model path not found: {args.model_path}", file=sys.stderr)
+        sys.exit(1)
+
+if not os.path.isdir(args.model_path):
+    print(f"Error: Model path not found: {args.model_path}", file=sys.stderr)
+    sys.exit(1)
+
+evaluate_model(args.model_path)


This block contains duplicated code and has script logic outside the if __name__ == "__main__" guard. The check for os.path.isdir is repeated, and evaluate_model is called at the module level. This will cause a NameError if the file is imported as a module, as args will not be defined. The script-running logic should be consolidated within the if __name__ == "__main__" block to fix the duplication and prevent import-time errors.

Suggested change

if __name__ == "__main__":

parser = argparse.ArgumentParser(description="Eval quantized models on GSM8K")

parser.add_argument("model_path", help="Path to quantized model directory")

args = parser.parse_args()

if not os.path.isdir(args.model_path):

print(f"Error: Model path not found: {args.model_path}", file=sys.stderr)

sys.exit(1)

if not os.path.isdir(args.model_path):

print(f"Error: Model path not found: {args.model_path}", file=sys.stderr)

sys.exit(1)

evaluate_model(args.model_path)

if __name__ == "__main__":

parser = argparse.ArgumentParser(description="Eval quantized models on GSM8K")

parser.add_argument("model_path", help="Path to quantized model directory")

args = parser.parse_args()

if not os.path.isdir(args.model_path):

print(f"Error: Model path not found: {args.model_path}", file=sys.stderr)

sys.exit(1)

evaluate_model(args.model_path)

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

- Update RESULTS.md with HDCharles's Llama-3-8B-Instruct evaluation results - FP8_DYNAMIC: 76.42% strict match vs FP8_BLOCK: 75.21% - Run make style with proper dev dependencies (pip install -e.[dev]) - Fix code formatting per maintainer feedback Results from: vllm-project#2347 Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>

rtj1 and others added 4 commits February 5, 2026 13:53

Address review feedback: improve sed specificity and add path validation

a883c17

Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>

Update examples/awq/gsm8k_eval.py

14bc4cc

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jagarlamudi <76727507+rtj1@users.noreply.github.com> Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>

Merge branch 'main' into add-gsm8k-eval-fp8

680d8e7

new results

ecc2a27

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

mergify bot added the documentation Improvements or additions to documentation label Feb 11, 2026

gemini-code-assist bot reviewed Feb 11, 2026

View reviewed changes

HDCharles closed this Feb 11, 2026

HDCharles reopened this Feb 11, 2026

results

38c0d22

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

HDCharles closed this Feb 11, 2026

rtj1 mentioned this pull request Feb 11, 2026

Add GSM8K evaluation script and AWQ+FP8 results #2330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[not for land] Gsm8k eval results#2347

[not for land] Gsm8k eval results#2347
HDCharles wants to merge 6 commits intomainfrom
gsm8k-eval-results

HDCharles commented Feb 11, 2026

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Uh oh!

mergify bot commented Feb 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HDCharles commented Feb 11, 2026

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

mergify bot commented Feb 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants