Skip to content

CLI for judges#26

Merged
freddiev4 merged 7 commits intoquotient-ai:mainfrom
alewarne:main
May 23, 2025
Merged

CLI for judges#26
freddiev4 merged 7 commits intoquotient-ai:mainfrom
alewarne:main

Conversation

@alewarne
Copy link
Contributor

While working with the judges library, I implemented a basic CLI that may be useful to others. It supports the common use case of specifying a judge, a model, and a JSON file (or string), where each entry contains "input", "output", and "expected" keys. The script instantiates the selected judge and calls .judge() for each JSON entry, saving the results to a new JSON file.

To make all available judges discoverable via the CLI, any new judge implementation should be registered in judges/init.py.

Happy to hear feedback or ideas for improvements!

@freddiev4
Copy link
Member

freddiev4 commented May 21, 2025

@alewarne this is pretty sweet! I've got some thoughts & suggestions for how to improve this -- I think this could be a good start for a data viewer CLI.

Will add my thoughts here later today.

Also quick question: how / where did you come across judges?

Copy link
Member

@freddiev4 freddiev4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alewarne thanks for making this PR! I think this is a great addition and a starting point for a good CLI data viewer / app.

I have some suggestions and asks down below to improve this

README.md Outdated

## CLI

We provide a command-line interface for evaluating model outputs using various judges. The CLI supports both single and batch evaluations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We provide a command-line interface for evaluating model outputs using various judges. The CLI supports both single and batch evaluations.
`judges` also provides command-line interface (CLI) for evaluating model outputs using various judges. The CLI supports both single and batch evaluations.

README.md Outdated
Comment on lines +109 to +111

The CLI will return a JSON object containing the original input, output, expected values, judgment score, and reasoning for each test case. It will
either be saved in the output file or printed to std if no output file is specified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The CLI will return a JSON object containing the original input, output, expected values, judgment score, and reasoning for each test case. It will
either be saved in the output file or printed to std if no output file is specified.
The CLI will return a JSON object containing the original input, output, expected values, judgment score, and reasoning for each test case. It will be saved to the output file or printed to `stdout` if no output file is specified.

cli.py Outdated
app = typer.Typer()


def parse_json_dict(json_dict: str) -> List[Dict[str, str]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about just swapping this all out with a pydantic.BaseModel? It's already a dependency and will be used more after I wrap up #25.

Then we can just do:

from typing import Optional

class Sample(BaseModel):
    input: str
    output: str
    expected: Optional[str]

And validate each row with that or with

class Dataset(BaseModel):
    samples: List[Sample]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know pydantic before. That's really handy! 👍

cli.py Outdated
"input": entry["input"],
"output": entry["output"],
"expected": entry["expected"],
"judgement": judgement.score,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"judgement": judgement.score,
"judgment": judgement.score,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You never stop learning ... 😄

cli.py Outdated
# Process each entry
for i, entry in enumerate(entries):
try:
judgement = judge.judge(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
judgement = judge.judge(
judgment = judge.judge(

cli.py Outdated


@app.command()
def main(judge: choices_judges, model_name: str, json_dict: str, out: str = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def main(judge: choices_judges, model_name: str, json_dict: str, out: str = None):
def main(judge: choices_judges, model_name: str, json_dict: str, output: str = None):

It'd also be great to have a commonly used shortcut of -o mapping to output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integrated shortcuts for model, output and input.

README.md Outdated
Comment on lines +84 to +85
- `model_name`: The name of the model to use (e.g., "gpt-4", "<litellm_provider>/<model_name>")
- `json_input`: Either a JSON string or path to a JSON file containing test cases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having these be arguments, I think it'd be easier to know what you're running if they are parameters.

Then

judges PollMultihopCorrectness gpt-4 test_cases.json --out results.json

becomes:

judges PollMultihopCorrectness --model gpt-4 --input test_cases.json --output results.json

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this file would require someone to clone the repo. I think it'd be friendlier to add:

[tool.poetry.scripts]
judges = "judges.cli.entrypoint:app"

to the pyproject.toml, then users can run:

judges PollMultihopCorrectness --model gpt-4 --input test_cases.json --output results.json

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included it under [project.scripts]. As of Python 3.11+, this [project.scripts] section is the standard.

Copy link
Member

@freddiev4 freddiev4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Will merge and just shuffle some a few small things in the README after.

@freddiev4 freddiev4 merged commit 2de8b47 into quotient-ai:main May 23, 2025
0 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants