Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Onboarding Omni_Math Benchmark #96

Open
wants to merge 38 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
d6fab5c
omni math initial commit - data+inf
Jan 29, 2025
421e432
adding omni math to init
Jan 29, 2025
4e13fea
merge conflicts
Feb 9, 2025
e27f483
omni math working pipeline
Feb 11, 2025
e86da30
added tests
Feb 13, 2025
47621d8
refactor
Feb 14, 2025
458e6a1
omni_math config changes
Feb 19, 2025
6eedb67
omni math - usage fix
Feb 20, 2025
8f9891f
Merge branch 'main' into vidhisha/omni_math
Feb 20, 2025
ab3e1c8
new requirements
Feb 20, 2025
872a2ae
added phi4-vllm model
Feb 21, 2025
0edbe39
eval updates
Feb 24, 2025
4e735ec
removed extractusage
Feb 24, 2025
dcb49bd
merge conflicts
Feb 24, 2025
0662b43
edits for latest extract usage transform
Feb 25, 2025
49c47ed
removed pvt configs comment
Feb 25, 2025
902af53
removed pvt configs comment
Feb 25, 2025
5e21288
column rename fix
Feb 25, 2025
c5320d1
test fix
Feb 25, 2025
7ce1a79
added majority_vote_label
Feb 25, 2025
1320168
fixed pipeline tests
Feb 25, 2025
5bbae5f
PR cleanup
Feb 25, 2025
6ac84c3
PR cleanup
Feb 25, 2025
63fc364
removed req_vllm
Feb 25, 2025
d2fa49e
fixed tests
Feb 25, 2025
587a750
merge conflict
Feb 27, 2025
54813f6
added comment on new args
Feb 27, 2025
38413c3
added utils test cases
Mar 3, 2025
4b3d2be
retrigger checks
Mar 3, 2025
454bf9c
edits for token usage for false is_valid
Mar 12, 2025
7229e47
Merge branch 'vidhisha/omni_math' of https://github.com/microsoft/eur…
Mar 12, 2025
0e6e9f3
merge conflicts
Mar 12, 2025
9e2ddd9
clean comments
Mar 12, 2025
08452d0
transform changes + config updates
Mar 13, 2025
6a860b3
Merge branch 'main' into vidhisha/omni_math
Mar 13, 2025
d205a13
config updates
Mar 14, 2025
fedf8b3
Merge branch 'main' into vidhisha/omni_math
Mar 14, 2025
60f82c1
ba-cal edits
Mar 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion eureka_ml_insights/data_utils/ba_calendar_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,11 @@ def parse_output_answer(response):

response = response.replace("**", "").replace("\n", "")

match = re.findall(r"(?i)(?<=Final Answer: )(\w+)(?=\s|\W|$)", response)
match = re.findall(r"Final Answer:\s*(\w+ \d{2}:\d{2}-\d{2}:\d{2})", response)

if match:
answer = match[len(match) - 1]
elif "No common time slot available".lower() in response.lower():
answer = "No common time slot available"

return answer
76 changes: 76 additions & 0 deletions eureka_ml_insights/data_utils/omni_math_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import math
import re
from dataclasses import dataclass

import pandas as pd

from .transform import DFTransformBase

# @staticmethod
def parse_output_answer(response):
"""
Parse the input string to extract the model judgement.
Parameters:
response (str): Input string containing model judgement as '## Equivalence Judgement: X '.
Returns:
dict: A dict of extracted final answer and model based judgement.
"""
if response is None or response == '':
return {}

parts = response.split("## ")
data = {}

for part in parts[1:]:
lines = part.strip().split("\n")
title = lines[0].strip().replace('#', '').replace('*', '').lower()
content = "\n".join(lines[1:]).strip()

if title == "Justification":
data[title] = content
else:
data[title] = lines[1].strip() if len(lines) > 1 else ''

return data

@dataclass
class Omni_Math_ParseLabel(DFTransformBase):
model_output_column: str
model_answer_column: str

def transform(self, df: pd.DataFrame) -> pd.DataFrame:
df[self.model_answer_column] = df[self.model_output_column].apply(self.extract_label)
return df

@staticmethod
def extract_label(response):
data = parse_output_answer(response)
label = 'Equivalence Judgement'.lower()
model_label = data[label] if label in data else ''
numeric_label = math.nan
if model_label.strip().replace('#', '').replace('*', '').lower() == 'true':
numeric_label = 1
elif model_label.strip().replace('#', '').replace('*', '').lower() == 'false':
numeric_label = 0
if numeric_label == math.nan:
print(data[label], model_label, numeric_label)
return numeric_label


@dataclass
class Omni_Math_ParseSolution(DFTransformBase):
model_output_column: str
model_answer_column: str

def transform(self, df: pd.DataFrame) -> pd.DataFrame:
df[self.model_answer_column] = df[self.model_output_column].apply(self.extract_solution)
return df

@staticmethod
def extract_solution(response):
data = parse_output_answer(response)
label = 'Student Final Answer'.lower()
model_label = data[label] if label in data else ''
return model_label


60 changes: 49 additions & 11 deletions eureka_ml_insights/data_utils/transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
import logging

from eureka_ml_insights.configs.config import ModelConfig

from eureka_ml_insights.models import (
ClaudeModel,
ClaudeReasoningModel,
Expand Down Expand Up @@ -402,11 +401,14 @@ def transform(self, df: pd.DataFrame, encoding="cl100k_base") -> pd.DataFrame:

@dataclass
class MajorityVoteTransform:
"""Applies the majority vote transformation to the specified model output column per id_col."""
"""Applies the majority vote transformation to the specified model output column per id_col.
If model_label_column is provided, the corresponding label or score of the majority vote output will also be added."""

model_output_col: str = "model_output" # Default column name for model outputs
model_label_column: str = None # Column name for model labels or scores corresponding to model outputs
id_col: str = "data_point_id" # Default column name for IDs
majority_vote_col: str = "majority_vote"
majority_vote_col: str = "majority_vote" # Default column name for storing majority vote
majority_label_col: str = "majority_label" # Default column name for storing label corresponding to majority vote output

def transform(self, df: pd.DataFrame, random_state:int=0) -> pd.DataFrame:
"""
Expand All @@ -420,12 +422,34 @@ def transform(self, df: pd.DataFrame, random_state:int=0) -> pd.DataFrame:
Returns:
pd.DataFrame: Transformed dataframe with majority vote for each id_col.
"""
# Step 1: Group by 'ID' and calculate the majority vote within each group
df[self.majority_vote_col] = df.groupby(self.id_col)[self.model_output_col].transform(
lambda x: x.dropna().mode().sample(n=1, random_state=random_state).iloc[0] if not x.dropna().mode().empty else pd.NA
)

return df
# # Step 1: Group by 'ID' and calculate the majority vote within each group
# df[self.majority_vote_col] = df.groupby(self.id_col)[self.model_output_col].transform(
# lambda x: x.dropna().mode().sample(n=1, random_state=random_state).iloc[0] if not x.dropna().mode().empty else pd.NA
# )
# return df

result_df = df.groupby(self.id_col).apply(self.majority_vote, self.model_output_col, self.model_label_column, self.majority_vote_col, self.majority_label_col, random_state=random_state)
return result_df

@staticmethod
def majority_vote(group, model_output_col, model_label_col, majority_vote_col, majority_label_col, random_state:int=0):
"""
Calculate majority vote for each group.
Args:
group (pd.DataFrame): Input dataframe containing model_output_col, id_col and label.
model_output_col (str): Model output column name
model_label_col (str): Model label column name
majority_vote_col (str): Majority vote column name
majority_label_col (str): Majority label column name
Returns:
pd.DataFrame: Transformed dataframe with majority vote for each id_col.
"""
x = group[model_output_col]
majority_value = x.dropna().mode().sample(n=1, random_state=random_state).iloc[0] if not x.dropna().mode().empty else pd.NA
group[majority_vote_col] = majority_value
if model_label_col:
group[majority_label_col] = group.loc[group[model_output_col] == majority_value, model_label_col].iloc[0]
return group

@dataclass
class ExtractUsageTransform:
Expand Down Expand Up @@ -470,9 +494,23 @@ def transform(self, df: pd.DataFrame) -> pd.DataFrame:
# if the model is one for which the usage of completion tokens is known, use that corresponding column for the model
# otherwise, use the default "n_output_tokens" which is computed with a universal tokenizer as shown in TokenCounterTransform()
if usage_completion_read_col:
df[self.usage_completion_output_col] = df[self.prepend_completion_read_col + "usage"].apply(lambda x: x[usage_completion_read_col])
df[self.usage_completion_output_col] = df.apply(lambda x: self._extract_usage(x, usage_completion_read_col), axis=1)
elif self.prepend_completion_read_col + "n_output_tokens" in df.columns:
df[self.usage_completion_output_col] = df[self.prepend_completion_read_col + "n_output_tokens"]
else:
df[self.usage_completion_output_col] = np.nan
return df
return df

def _extract_usage(self, row, usage_completion_read_col):
"""
Extracts the token usage for a given row is is_valid is True.
Args:
row (pd.Series): A row of the dataframe.
Returns:
int: The token usage for the row.
"""
#if row[self.prepend_completion_read_col + "is_valid"]:
if row[self.prepend_completion_read_col + "usage"] :
return row[self.prepend_completion_read_col + "usage"][usage_completion_read_col]
return np.nan

2 changes: 1 addition & 1 deletion eureka_ml_insights/metrics/ba_calendar_metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -355,4 +355,4 @@ def compute_constrainedness_programmatic(self, instance):
constrainedness_bucket = round(math.floor(constrainedness_ratio / 0.1) * 0.1, 4)

# Add test result
return {'constrainedness': constrainedness_ratio, 'constrainedness_bucket': constrainedness_bucket}
return {'constrainedness': constrainedness_ratio, 'constrainedness_bucket': constrainedness_bucket}
9 changes: 6 additions & 3 deletions eureka_ml_insights/models/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

import anthropic
import tiktoken
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from azure.identity import AzureCliCredential, DefaultAzureCredential, get_bearer_token_provider

from eureka_ml_insights.secret_management import get_secret

Expand Down Expand Up @@ -256,7 +256,8 @@ def __post_init__(self):
}
except ValueError:
self.bearer_token_provider = get_bearer_token_provider(
DefaultAzureCredential(), self.auth_scope
#DefaultAzureCredential(), self.auth_scope
AzureCliCredential(), self.auth_scope
)
self.headers = {
"Content-Type": "application/json",
Expand Down Expand Up @@ -470,7 +471,8 @@ def get_client(self):
from openai import AzureOpenAI

token_provider = get_bearer_token_provider(
DefaultAzureCredential(), self.auth_scope
#DefaultAzureCredential(), self.auth_scope
AzureCliCredential(), self.auth_scope
)
return AzureOpenAI(
azure_endpoint=self.url,
Expand All @@ -495,6 +497,7 @@ class DirectOpenAIClientMixIn(KeyBasedAuthMixIn):
def get_client(self):
from openai import OpenAI


return OpenAI(
base_url=self.base_url,
api_key=self.api_key,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
You are a genius math graduate student solving math competition problems. Reflect on what the problem is asking, explain your reasoning and potential mistakes when answering.

{{problem}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# CONTEXT #
I am a teacher, and I have some high-level math problems. I am tasked with evaluating the correctness of a student's answer.
Below, I am provided with a problem and a reference answer. Additionally, a student's answer is provided. My job is to assess whether the student's answer captures the same meaning as the reference answer, even when expressed with different wording or format.

# OBJECTIVE #
I need you to judge whether the student's answer is correct given the ground truth answer.

Your tasks include:
A. Identify Mathematical or Notational Equivalence: Pay special attention to any LaTeX expressions in both answers. Confirm that the mathematical relationships, variables, and operations conveyed are equivalent.
B. Provide a Justification: Conclude with a brief explanation as to why you believe the student's output is correct or incorrect, highlighting any key differences in meaning or content.

# STYLE #
Teaching report.

# TONE #
Professional, scientific.

# AUDIENCE #
Students. Enable them to better understand whether the answer they produce is correct.

# RESPONSE: MARKDOWN REPORT #
## Student Final Answer
[Extract the student's final answer, which is enclosed in "\\boxed{}".]
## Equivalence Judgement
[Whether the student's answer share the same meaning with the reference answer. (TRUE or FALSE)]
## Justification
[Conclude with a brief explanation as to why you believe the student's answer is correct or incorrect.]


# ATTENTION #
- The reference answer is ALWAYS correct. You should carefully judge whether the student gives the same answer as reference answer.
- The Equivalence Judgement is only TRUE or FALSE. The answer is FALSE even if the student's final answer almost correct with a minor mistakes.
- The answer is contained within the "boxed" section, so you can focus solely on comparing the content in the student's answer box with the reference answer, without needing to consider the intermediate steps.
- Add "=== report over ===" at the end of the report.

<example math solution>
**Question**:
The numbers $1447$ , $1005$ and $1231$ have something in common: each is a $4$ -digit number beginning with $1$ that has exactly two identical digits. How many such numbers are there?

**Reference Answer**
432

**Student Solution**:
To determine how many 4-digit numbers beginning with 1 have exactly two identical digits, we need to consider the structure of these numbers. A number of this form can be represented as \(1abc\), where \(a\), \(b\), and \(c\) are digits from 0 to 9, and exactly one of the digits \(a\), \(b\), or \(c\) is 1, while the other two are different from each other and different from 1.

We can break down the problem into the following steps:

1. **Choose the digit that is 1**: There are 3 positions (a, b, c) where the digit 1 can be placed. So, there are 3 choices for the position of the digit 1.

2. **Choose the other two digits**: The remaining two digits must be different from each other and different from 1. Therefore, we have 9 choices for the first of these two digits (0-9 except 1) and 8 choices for the second digit (0-9 except 1 and the first digit). This gives us \(9 \times 8 = 72\) choices for the other two digits.

Combining these choices, the total number of 4-digit numbers beginning with 1 that have exactly two identical digits is:
\[3 \times 72 = 216.\]

Thus, the number of such numbers is \(\boxed{216}\).

</example math solution>

## Student Final Answer
216

## Equivalence Judgement
FALSE

## Justification
The student's answer of 216 is incorrect in the context of the problem, which asks for the total count of 4-digit numbers beginning with 1 that have exactly two identical digits. The reference answer is 432.
In the student's solution, they consider only cases where the digit '1' is one of the identical digits. However, the problem also includes the scenario where the identical digits could be different from '1'. Thus, the student's calculation does not account for all valid configurations. The discrepancy in figures indicates that the student's answer does not share the same meaning as the reference answer.

=== report over ===


<example math solution>
**Question**:
Susana and Brenda play a game writing polynomials on the board. Susana starts and they play taking turns.
1) On the preparatory turn (turn 0), Susana choose a positive integer $n_0$ and writes the polynomial $P_0(x)=n_0$.
2) On turn 1, Brenda choose a positive integer $n_1$, different from $n_0$, and either writes the polynomial

$$P_1(x)=n_1x+P_0(x) \textup{ or } P_1(x)=n_1x-P_0(x)$$

3) In general, on turn $k$, the respective player chooses an integer $n_k$, different from $n_0, n_1, \ldots, n_{k-1}$, and either writes the polynomial
$$P_k(x)=n_kx^k+P_{k-1}(x) \textup{ or } P_k(x)=n_kx^k-P_{k-1}(x)$$
The first player to write a polynomial with at least one whole whole number root wins. Find and describe a winning strategy.

**Reference Answer**
\\text{No winning strategy; the game can last indefinitely.}

**Student Solution**:
To determine a winning strategy for Susana, we need to analyze the game and the conditions under which a player can win. The goal is to write a polynomial that has at least one integer root.

Let's start by examining the polynomials that are written on the board after each turn.

1. On turn 0, Susana writes the polynomial \( P_0(x) = n_0 \), where \( n_0 \) is a positive integer. This polynomial is a constant polynomial with no integer roots unless \( n_0 = 0 \), which is not allowed since \( n_0 \) is a positive integer.

2. On turn 1, Brenda chooses a positive integer \( n_1 \) different from \( n_0 \) and writes either \( P_1(x) = n_1 x + P_0(x) = n_1 x + n_0 \) or \( P_1(x) = n_1 x - P_0(x) = n_1 x - n_0 \). In both cases, \( P_1(x) \) is a linear polynomial. A linear polynomial \( ax + b \) has an integer root if and only if \( b/a \) is an integer. Therefore, Brenda can choose \( n_1 \) and \( n_0 \) such that \( n_1 n_0 \) is divisible by \( n_1 \), which is always possible since \( n_1 \) and \( n_0 \) are different positive integers. For example, if \( n_1 = 2 \) and \( n_0 = 2 \), then \( P_1(x) = 2x - 2 \) has the integer root \( x = 1 \). However, since \( n_1 \) must be different from \( n_0 \), we need to consider the next turn.

3. On turn 2, Susana chooses a positive integer \( n_2 \) different from \( n_0 \) and \( n_1 \) and writes either \( P_2(x) = n_2 x^2 + P_1(x) \) or \( P_2(x) = n_2 x^2 - P_1(x) \). To ensure a win, Susana should choose \( n_2 \) such that the resulting polynomial has an integer root. Notice that if \( P_1(x) = n_1 x + n_0 \) or \( P_1(x) = n_1 x - n_0 \), then \( P_2(x) = n_2 x^2 + P_1(x) \) or \( P_2(x) = n_2 x^2 - P_1(x) \) will have an integer root if \( n_2 \) is chosen appropriately. For example, if \( P_1(x) = 2x - 2 \), then \( P_2(x) = n_2 x^2 + 2x - 2 \) or \( P_2(x) = n_2 x^2 - 2x + 2 \) will have an integer root if \( n_2 \) is chosen such that the polynomial can be factored to have an integer root.

By following this strategy, Susana can always choose \( n_2 \) such that the resulting polynomial has an integer root. Therefore, Susana has a winning strategy.

The winning strategy for Susana is to choose \( n_2 \) such that the resulting polynomial has an integer root. Specifically, if \( P_1(x) = n_1 x + n_0 \) or \( P_1(x) = n_1 x - n_0 \), then Susana should choose \( n_2 \) such that \( P_2(x) = n_2 x^2 + P_1(x) \) or \( P_2(x) = n_2 x^2 - P_1(x) \) has an integer root.

The answer is: \(\boxed{\text{Susana has a winning strategy.}}\)

</example math solution>


## Student Final Answer
\text{Susana has a winning strategy.}

## Equivalence Judgement
FALSE

## Justification
The student's answer states that "Susana has a winning strategy," which is not in agreement with the reference answer indicating "No winning strategy; the game can last indefinitely." The essence of the problem is that neither player can guarantee a win because the game can go on indefinitely without either player being able to write a polynomial with an integer root. The student's conclusion contradicts this, suggesting that a specific winning strategy exists for Susana, which is fundamentally incorrect according to the problem's conditions. Thus, the student's answer does not capture the same meaning as the reference answer.


=== report over ===

<example math solution>
**Question**:
Determine if there exists a (three-variable) polynomial $P(x,y,z)$ with integer coefficients satisfying the following property: a positive integer $n$ is [i]not[/i] a perfect square if and only if there is a triple $(x,y,z)$ of positive integers such that $P(x,y,z) = n$.

**Reference Answer**
P(x,y,z) = x^2 + y^2 + z^2 + 2xyz

**Student Solution**:
**Claim:** There exists a three-variable polynomial $P(x, y

</example math solution>

## Student Final Answer
None

## Equivalence Judgement
FALSE

## Justification
The student's answer does not provide a final conclusion and the final answer is missing, which indicates a lack of conclusion.

=== report over ===

<math solution>
**Question**:
{{problem}}

**Reference Answer**
{{answer}}

**Student Solution**:

{{model_output}}
</math solution>
Loading