Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate usage columns for extract usage #110

Merged
merged 16 commits into from
Mar 16, 2025
14 changes: 13 additions & 1 deletion eureka_ml_insights/data_utils/transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -464,9 +464,21 @@ def transform(self, df: pd.DataFrame) -> pd.DataFrame:
# if the model is one for which the usage of completion tokens is known, use that corresponding column for the model
# otherwise, use the default "n_output_tokens" which is computed with a universal tokenizer as shown in TokenCounterTransform()
if usage_completion_read_col:
df[self.usage_completion_output_col] = df[self.prepend_completion_read_col + "usage"].apply(lambda x: x[usage_completion_read_col])
df[self.usage_completion_output_col] = df.apply(lambda x: self._extract_usage(x, usage_completion_read_col), axis=1)
elif self.prepend_completion_read_col + "n_output_tokens" in df.columns:
df[self.usage_completion_output_col] = df[self.prepend_completion_read_col + "n_output_tokens"]
else:
df[self.usage_completion_output_col] = np.nan
return df

def _extract_usage(self, row, usage_completion_read_col):
"""
Extracts the token usage for a given row is is_valid is True.
Args:
row (pd.Series): A row of the dataframe.
Returns:
int: The token usage for the row.
"""
if row[self.prepend_completion_read_col + "is_valid"]:
return row[self.prepend_completion_read_col + "usage"][usage_completion_read_col]
return np.nan
Loading