Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Add math benchmarks #1570

Open
wants to merge 121 commits into
base: master
Choose a base branch
from
Open

[Draft] Add math benchmarks #1570

wants to merge 121 commits into from

Conversation

hallerite
Copy link
Collaborator

@hallerite hallerite commented Feb 7, 2025

Description

This PR introduces a base class for math benchmarks and provides implementations for:

  • GSM8K benchmark
  • MATH benchmark

Motivation and Context

This PR addresses and closes #1510.

Types of Changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update (adds/modifies project documentation)
  • Example update (adds/modifies example code)

Implemented Tasks ✅

  • Implement math base benchmark
  • Implement GSM8K benchmark
  • Implement MATH benchmark
  • Add unit tests
  • Add example code on how to use benchmarks
  • Update documentation with usage details

Checklist 📝

Please go over all the following points and put an x in the boxes that apply.
If you're unsure about any, feel free to ask!

  • I have read the CONTRIBUTION GUIDE (required).
  • My changes require a documentation update.
  • I have updated tests accordingly. (required for a bug fix or a new feature)
  • I have updated the documentation accordingly.

Draft Status 🚧

Current Progress:

  • ✅ Core benchmark implementations are complete.
  • 🚧 Work in Progress: Tests, examples, and documentation updates.

Next Steps:

  • Implement unit tests to ensure benchmark reliability.
  • Provide example usage to guide users.
  • Finalize and update documentation.

Copy link
Collaborator

@zjrwtx zjrwtx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @hallerite ,great work! but the docstring need to be polished ,please refer to:
https://github.com/camel-ai/camel/blob/master/CONTRIBUTING.md#guideline-for-writing-docstrings



class GSM8KBenchmark(MathBenchmark):
"""Benchmark for evaluating ChatAgents on the GSM8K dataset from Hugging Face Hub."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Benchmark for evaluating ChatAgents on the GSM8K dataset from Hugging Face Hub."""
r"""Benchmark for evaluating ChatAgents on the GSM8K dataset from Hugging Face Hub."""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a docstring optimize example

@zjrwtx
Copy link
Collaborator

zjrwtx commented Feb 8, 2025

can we add an example under the example file directory?

Wendong-Fan and others added 14 commits February 11, 2025 13:25
…oning data with thought process (Long Cot data)from deepseek R1 (#1532)

Co-authored-by: “yifeng.wang” <“[email protected];q:wqqgit config --global user.name “yifeng.wang”git config --global user.email “[email protected]>
Co-authored-by: Wendong <[email protected]>
Co-authored-by: Wendong-Fan <[email protected]>
Co-authored-by: Wendong-Fan <[email protected]>
Co-authored-by: Wendong <[email protected]>
Co-authored-by: Wendong-Fan <[email protected]>
Co-authored-by: Wendong <[email protected]>
…mel (#1493)

Co-authored-by: 任信行 <[email protected]>
Co-authored-by: Harry Ye <[email protected]>
Co-authored-by: Wendong-Fan <[email protected]>
Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @hallerite and @apokryphosx ! Left some comments below, please also remember run pre-commit run --all-files locally before push the code, now there are some errors

Comment on lines 40 to 45
for config in self.DATASET_CONFIGS:
dataset = load_dataset(
self.DATASET_REPO,
config,
cache_dir=str(self.data_dir),
download_mode="force_redownload" if force_download else "reuse_dataset_if_exists"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use def download(self) -> "MATHBenchmark"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it already comes with the hf datasets library

@hallerite
Copy link
Collaborator Author

Thanks for the comments @Wendong-Fan. I will implement the changes later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[Feature Request] Math and code benchmark to evaluate trained model