Skip to content

[not for land] Gsm8k eval results#2347

Closed
HDCharles wants to merge 6 commits intomainfrom
gsm8k-eval-results
Closed

[not for land] Gsm8k eval results#2347
HDCharles wants to merge 6 commits intomainfrom
gsm8k-eval-results

Conversation

@HDCharles
Copy link
Collaborator

rtj1 and others added 4 commits February 5, 2026 13:53
Closes #2305

This PR adds:
- gsm8k_eval.py: Evaluation script for running GSM8K benchmarks on quantized models
- RESULTS.md: Quantization and evaluation results for Qwen2.5-0.5B-Instruct with FP8_DYNAMIC and FP8_BLOCK schemes

Key findings:
- FP8_DYNAMIC achieves 22.67% strict match vs 17.97% for FP8_BLOCK on GSM8K
- Both schemes achieve ~1.2x compression (1.1GB -> 0.92GB)
- Quantized models uploaded to HuggingFace Hub for reproducibility

Evaluated on Google Colab L4 GPU (22.5GB) using the existing example scripts.

Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>
Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Jagarlamudi <76727507+rtj1@users.noreply.github.com>
Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @HDCharles, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new evaluation script and comprehensive results for AWQ+FP8 quantized models on the GSM8K benchmark. It provides a direct comparison between FP8_DYNAMIC and FP8_BLOCK quantization schemes, highlighting their performance on a Qwen2.5-0.5B-Instruct model and offering a clear recommendation for optimal accuracy preservation. The changes enable standardized benchmarking and provide valuable insights into the effectiveness of different quantization approaches.

Highlights

  • New Evaluation Script: A new Python script, gsm8k_eval.py, was added to facilitate the evaluation of AWQ+FP8 quantized models on the GSM8K benchmark.
  • Quantization Results Documentation: A RESULTS.md file was introduced, detailing the performance of FP8_DYNAMIC and FP8_BLOCK quantization schemes on the Qwen2.5-0.5B-Instruct model when evaluated on GSM8K.
  • Performance Comparison and Recommendation: The evaluation results show that FP8_DYNAMIC preserves accuracy better on GSM8K (22.67% strict match) compared to FP8_BLOCK (17.97%), leading to a recommendation for using FP8_DYNAMIC for AWQ quantization.
  • Reproducible Workflow: The RESULTS.md file also provides a comprehensive, reproducible workflow for quantizing and evaluating models, including setup commands and runtime estimates.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • examples/awq/RESULTS.md
    • Documented AWQ + FP8 quantization results, including model compression, GSM8K evaluation scores, setup instructions, and a reproducible workflow.
    • Provided a recommendation to use FP8_DYNAMIC based on superior accuracy on the GSM8K benchmark.
  • examples/awq/gsm8k_eval.py
    • Implemented a Python script to automate the evaluation of quantized models on the GSM8K dataset using the lm-eval framework.
    • Ensured efficient evaluation by explicitly setting a batch size of 16 for the lm-eval command.
Activity
  • No specific human activity (comments, reviews) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
@mergify
Copy link
Contributor

mergify bot commented Feb 11, 2026

⚠️ The sha of the head commit of this PR conflicts with #2330. Mergify cannot evaluate rules on this PR. ⚠️

@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 11, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces evaluation results for AWQ+FP8 quantization in a RESULTS.md file and adds a Python script, gsm8k_eval.py, for running the evaluation. The results documentation is clear and well-structured. The evaluation script, however, contains a significant bug with duplicated code and logic placed outside the if __name__ == '__main__' guard. This could lead to runtime errors if the script is imported as a module. I've provided a suggestion to refactor the script's entry point to resolve these issues.

Comment on lines +42 to +55
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Eval quantized models on GSM8K")
parser.add_argument("model_path", help="Path to quantized model directory")
args = parser.parse_args()

if not os.path.isdir(args.model_path):
print(f"Error: Model path not found: {args.model_path}", file=sys.stderr)
sys.exit(1)

if not os.path.isdir(args.model_path):
print(f"Error: Model path not found: {args.model_path}", file=sys.stderr)
sys.exit(1)

evaluate_model(args.model_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block contains duplicated code and has script logic outside the if __name__ == "__main__" guard. The check for os.path.isdir is repeated, and evaluate_model is called at the module level. This will cause a NameError if the file is imported as a module, as args will not be defined. The script-running logic should be consolidated within the if __name__ == "__main__" block to fix the duplication and prevent import-time errors.

Suggested change
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Eval quantized models on GSM8K")
parser.add_argument("model_path", help="Path to quantized model directory")
args = parser.parse_args()
if not os.path.isdir(args.model_path):
print(f"Error: Model path not found: {args.model_path}", file=sys.stderr)
sys.exit(1)
if not os.path.isdir(args.model_path):
print(f"Error: Model path not found: {args.model_path}", file=sys.stderr)
sys.exit(1)
evaluate_model(args.model_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Eval quantized models on GSM8K")
parser.add_argument("model_path", help="Path to quantized model directory")
args = parser.parse_args()
if not os.path.isdir(args.model_path):
print(f"Error: Model path not found: {args.model_path}", file=sys.stderr)
sys.exit(1)
evaluate_model(args.model_path)

@HDCharles HDCharles closed this Feb 11, 2026
@HDCharles HDCharles reopened this Feb 11, 2026
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
@HDCharles HDCharles closed this Feb 11, 2026
rtj1 added a commit to rtj1/llm-compressor that referenced this pull request Feb 11, 2026
- Update RESULTS.md with HDCharles's Llama-3-8B-Instruct evaluation results
- FP8_DYNAMIC: 76.42% strict match vs FP8_BLOCK: 75.21%
- Run make style with proper dev dependencies (pip install -e.[dev])
- Fix code formatting per maintainer feedback

Results from: vllm-project#2347

Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants