[Draft] Add math benchmarks #1570

old-hallerite · 2025-02-07T14:49:55Z

Description

This PR introduces a base class for math benchmarks and provides implementations for:

GSM8K benchmark
MATH benchmark

Motivation and Context

This PR addresses and closes #1510.

Types of Changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds core functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update (adds/modifies project documentation)
Example update (adds/modifies example code)

Implemented Tasks ✅

Implement math base benchmark
Implement GSM8K benchmark
Implement MATH benchmark
Add unit tests
Add example code on how to use benchmarks
Update documentation with usage details

Checklist 📝

Please go over all the following points and put an x in the boxes that apply.
If you're unsure about any, feel free to ask!

I have read the CONTRIBUTION GUIDE (required).
My changes require a documentation update.
I have updated tests accordingly. (required for a bug fix or a new feature)
I have updated the documentation accordingly.

Draft Status 🚧

Current Progress:

✅ Core benchmark implementations are complete.
🚧 Work in Progress: Tests, examples, and documentation updates.

Next Steps:

Implement unit tests to ensure benchmark reliability.
Provide example usage to guide users.
Finalize and update documentation.

zjrwtx

thanks @hallerite ,great work! but the docstring need to be polished ,please refer to:
https://github.com/camel-ai/camel/blob/master/CONTRIBUTING.md#guideline-for-writing-docstrings

zjrwtx · 2025-02-08T09:52:44Z

camel/benchmarks/math_benchmarks/gsm8k.py

+
+
+class GSM8KBenchmark(MathBenchmark):
+    """Benchmark for evaluating ChatAgents on the GSM8K dataset from Hugging Face Hub."""


Suggested change

"""Benchmark for evaluating ChatAgents on the GSM8K dataset from Hugging Face Hub."""

r"""Benchmark for evaluating ChatAgents on the GSM8K dataset from Hugging Face Hub."""

a docstring optimize example

zjrwtx · 2025-02-08T12:09:04Z

can we add an example under the example file directory？

…oning data with thought process (Long Cot data)from deepseek R1 (#1532) Co-authored-by: “yifeng.wang” <“[email protected];q:wqqgit config --global user.name “yifeng.wang”git config --global user.email “[email protected]> Co-authored-by: Wendong <[email protected]> Co-authored-by: Wendong-Fan <[email protected]>

Co-authored-by: Wendong-Fan <[email protected]>

Co-authored-by: Wendong <[email protected]>

Co-authored-by: Wendong-Fan <[email protected]> Co-authored-by: Wendong <[email protected]>

…mel (#1493) Co-authored-by: 任信行 <[email protected]> Co-authored-by: Harry Ye <[email protected]> Co-authored-by: Wendong-Fan <[email protected]>

…nto feat/benchmarks

…lassVar

directory

it doesn't exist

parse and evaluate the Agents Output

style checks

credentials

and fixed errors

pass itself as a directory

verify and added it to mypy overrides since it doesn't have a typing package

old-hallerite · 2025-03-04T12:17:38Z

@apokryphosx added math-verify as dependency, but since it is very new, it cannot be resolved it seems. Any idea what we should do? Without it, the benchmarks are much less powerful.

cc: @Wendong-Fan

old-hallerite added 9 commits January 30, 2025 17:03

feat(benchmark): initial GSM8K benchmark setup

0a6458b

feat(benchmark): initial MATH benchmark setup

2421a17

add new base class for math envs

0e8a480

pass mode explicitly

46acc02

handle data loading and shuffling in base class

34d7915

feat: add preprocessing step for dataset

eace104

fix: update gsm8k to new math_benchmark base

bfccedf

feat: add MATH benchmark

f1b7085

fix: update __init__.py

b10806a

old-hallerite requested review from zjrwtx and Wendong-Fan February 7, 2025 14:49

zjrwtx reviewed Feb 8, 2025

View reviewed changes

Wendong-Fan assigned old-hallerite Feb 9, 2025

Wendong-Fan added the New Feature label Feb 9, 2025

Wendong-Fan added this to the Sprint 22 milestone Feb 9, 2025

Wendong-Fan and others added 14 commits February 11, 2025 13:25

fix: Deepseek r1 reasoning content (#1523)

c88bce1

feat: Support OpenAI o3 mini (#1533)

6beebb8

fix: WolframAlpha result from all subpods (#1534)

2c2ec6c

chore: Qwen datagen cookbook add graph (#1530)

3c695aa

Co-authored-by: Wendong-Fan <[email protected]>

feat: Integrate moonshot models to camel (#1526)

14bc56a

docs: update human cookbook using qwen model (#1555)

31e54cc

Co-authored-by: Wendong <[email protected]>

docs: Add Self-Improving Math Reasoning Data Distillation (#1556)

f4b5452

feat: Integrate siliconflow model (#1541)

22b141e

Co-authored-by: Wendong-Fan <[email protected]> Co-authored-by: Wendong <[email protected]>

chore: Update supported models (#1557)

a239505

chore: Support Gemini 2.0 flash thinking and pro (#1558)

ee9b32b

chore: Rename tool call instances (#1492)

00f2db3

Co-authored-by: Wendong-Fan <[email protected]> Co-authored-by: Wendong <[email protected]>

feat: Add SemanticScholarToolkits to integrate Semantic Scholar to ca…

ba31cba

…mel (#1493) Co-authored-by: 任信行 <[email protected]> Co-authored-by: Harry Ye <[email protected]> Co-authored-by: Wendong-Fan <[email protected]>

enhance: Semantic Scholar (#1562)

6faa9ee

old-hallerite and others added 28 commits February 23, 2025 12:46

fix: use lazy imports

ee4d952

fix: add "valid" to input validation

a412ee2

chore: polish docstring

2f3865d

fix: reset state of agent after every response

664dbff

fix: vectorize evaluation

07ee4ae

fix: run pre-commit

4ace02d

style: Fix line lenghts to adhere to style checks

e979df2

style: Fix formating to pass through pre commit

7baf965

Merge branch 'feat/benchmarks' of https://github.com/camel-ai/camel i…

f452a71

…nto feat/benchmarks

fix: restore base class

9e2fd25

fix: adjust lazy imports and annotate mutable class attributes with C…

7791817

…lassVar

chore: fix formatting

99e6be8

fix: Fixed circular import errors

e4960b0

fix: Fixed download function to not pass itself as

089b171

directory

fix: Added a pass@1k as default Mode at the start

876ea7d

of the run function

fix: Fixed load function to not pass itself as a

c630c91

directory

fix: Changed it so save_to directory gets made if

d581603

it doesn't exist

chore: fix pre-commit errors

7173270

fix: change bool_ to bool return type

daf14e1

chore: fix line-length

ed9afe1

fix: Utilized Huggingface Math-Verify to correctly

9521041

parse and evaluate the Agents Output

style: Fixed code style to adhere to pre-commit

6bc7ffe

style checks

fix: Removed debugging and ran experiment with API

bfbdde1

credentials

style: Improved code readability

436b588

fix: Improved readability, ran experiment with API

0fc67b7

and fixed errors

style: Ran pre-commit and fixed code style

36328e1

fix: Fixed load function to not

70bb28e

pass itself as a directory

fix: Updated pyproject with huggingface math

c720957

verify and added it to mypy overrides since it doesn't have a typing package

Merge remote-tracking branch 'origin/master' into feat/benchmarks

e63b580

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Draft] Add math benchmarks #1570

[Draft] Add math benchmarks #1570

Uh oh!

old-hallerite commented Feb 7, 2025 •

edited

Loading

Uh oh!

zjrwtx left a comment

Uh oh!

zjrwtx Feb 8, 2025

Uh oh!

zjrwtx Feb 8, 2025

Uh oh!

zjrwtx commented Feb 8, 2025

Uh oh!

old-hallerite commented Mar 4, 2025 •

edited

Loading

Uh oh!

Uh oh!



		class GSM8KBenchmark(MathBenchmark):
		"""Benchmark for evaluating ChatAgents on the GSM8K dataset from Hugging Face Hub."""

	"""Benchmark for evaluating ChatAgents on the GSM8K dataset from Hugging Face Hub."""
	r"""Benchmark for evaluating ChatAgents on the GSM8K dataset from Hugging Face Hub."""

[Draft] Add math benchmarks #1570

Are you sure you want to change the base?

[Draft] Add math benchmarks #1570

Uh oh!

Conversation

old-hallerite commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Types of Changes

Implemented Tasks ✅

Checklist 📝

Draft Status 🚧

Uh oh!

zjrwtx left a comment

Choose a reason for hiding this comment

Uh oh!

zjrwtx Feb 8, 2025

Choose a reason for hiding this comment

Uh oh!

zjrwtx Feb 8, 2025

Choose a reason for hiding this comment

Uh oh!

zjrwtx commented Feb 8, 2025

Uh oh!

old-hallerite commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

old-hallerite commented Feb 7, 2025 •

edited

Loading

old-hallerite commented Mar 4, 2025 •

edited

Loading