CLI for judges by alewarne · Pull Request #26 · quotient-ai/judges

alewarne · 2025-05-21T13:54:08Z

While working with the judges library, I implemented a basic CLI that may be useful to others. It supports the common use case of specifying a judge, a model, and a JSON file (or string), where each entry contains "input", "output", and "expected" keys. The script instantiates the selected judge and calls .judge() for each JSON entry, saving the results to a new JSON file.

To make all available judges discoverable via the CLI, any new judge implementation should be registered in judges/init.py.

Happy to hear feedback or ideas for improvements!

freddiev4 · 2025-05-21T14:18:48Z

@alewarne this is pretty sweet! I've got some thoughts & suggestions for how to improve this -- I think this could be a good start for a data viewer CLI.

Will add my thoughts here later today.

Also quick question: how / where did you come across judges?

freddiev4

Hey @alewarne thanks for making this PR! I think this is a great addition and a starting point for a good CLI data viewer / app.

I have some suggestions and asks down below to improve this

freddiev4 · 2025-05-21T19:50:07Z

README.md


+## CLI
+
+We provide a command-line interface for evaluating model outputs using various judges. The CLI supports both single and batch evaluations.


Suggested change

We provide a command-line interface for evaluating model outputs using various judges. The CLI supports both single and batch evaluations.

`judges` also provides command-line interface (CLI) for evaluating model outputs using various judges. The CLI supports both single and batch evaluations.

freddiev4 · 2025-05-21T19:50:20Z

README.md

+
+The CLI will return a JSON object containing the original input, output, expected values, judgment score, and reasoning for each test case. It will
+either be saved in the output file or printed to std if no output file is specified.


Suggested change

The CLI will return a JSON object containing the original input, output, expected values, judgment score, and reasoning for each test case. It will

either be saved in the output file or printed to std if no output file is specified.

The CLI will return a JSON object containing the original input, output, expected values, judgment score, and reasoning for each test case. It will be saved to the output file or printed to `stdout` if no output file is specified.

freddiev4 · 2025-05-21T19:53:29Z

cli.py

+app = typer.Typer()
+
+
+def parse_json_dict(json_dict: str) -> List[Dict[str, str]]:


What do you think about just swapping this all out with a pydantic.BaseModel? It's already a dependency and will be used more after I wrap up #25.

Then we can just do:

from typing import Optional class Sample(BaseModel): input: str output: str expected: Optional[str]

And validate each row with that or with

class Dataset(BaseModel): samples: List[Sample]

I didn't know pydantic before. That's really handy! 👍

freddiev4 · 2025-05-21T19:53:54Z

cli.py

+                    "input": entry["input"],
+                    "output": entry["output"],
+                    "expected": entry["expected"],
+                    "judgement": judgement.score,


Suggested change

"judgement": judgement.score,

"judgment": judgement.score,

You never stop learning ... 😄

freddiev4 · 2025-05-21T19:54:00Z

cli.py

+    # Process each entry
+    for i, entry in enumerate(entries):
+        try:
+            judgement = judge.judge(


Suggested change

judgement = judge.judge(

judgment = judge.judge(

freddiev4 · 2025-05-21T19:54:41Z

cli.py

+
+
+@app.command()
+def main(judge: choices_judges, model_name: str, json_dict: str, out: str = None):


Suggested change

def main(judge: choices_judges, model_name: str, json_dict: str, out: str = None):

def main(judge: choices_judges, model_name: str, json_dict: str, output: str = None):

It'd also be great to have a commonly used shortcut of -o mapping to output

Integrated shortcuts for model, output and input.

freddiev4 · 2025-05-21T19:57:37Z

README.md

+- `model_name`: The name of the model to use (e.g., "gpt-4", "<litellm_provider>/<model_name>")
+- `json_input`: Either a JSON string or path to a JSON file containing test cases


Rather than having these be arguments, I think it'd be easier to know what you're running if they are parameters.

Then

judges PollMultihopCorrectness gpt-4 test_cases.json --out results.json

becomes:

judges PollMultihopCorrectness --model gpt-4 --input test_cases.json --output results.json

freddiev4 · 2025-05-21T19:57:56Z

cli.py

Using this file would require someone to clone the repo. I think it'd be friendlier to add:

[tool.poetry.scripts] judges = "judges.cli.entrypoint:app"

to the pyproject.toml, then users can run:

judges PollMultihopCorrectness --model gpt-4 --input test_cases.json --output results.json

I included it under [project.scripts]. As of Python 3.11+, this [project.scripts] section is the standard.

freddiev4

Awesome! Will merge and just shuffle some a few small things in the README after.

alexanderWarnecke and others added 6 commits May 21, 2025 15:18

cli for judges

249f509

Update README.md

f2bd072

Update README.md

3747679

Update README.md

c326c78

make sure everything that has been imported before is still imported

cadf2a2

Merge branch 'main' of github-personal:alewarne/judges

ccea1f1

freddiev4 reviewed May 21, 2025

View reviewed changes

Address proposed changes.

4316d34

freddiev4 approved these changes May 23, 2025

View reviewed changes

freddiev4 merged commit 2de8b47 into quotient-ai:main May 23, 2025
0 of 3 checks passed


		## CLI

		We provide a command-line interface for evaluating model outputs using various judges. The CLI supports both single and batch evaluations.

	We provide a command-line interface for evaluating model outputs using various judges. The CLI supports both single and batch evaluations.
	`judges` also provides command-line interface (CLI) for evaluating model outputs using various judges. The CLI supports both single and batch evaluations.


		The CLI will return a JSON object containing the original input, output, expected values, judgment score, and reasoning for each test case. It will
		either be saved in the output file or printed to std if no output file is specified.

		app = typer.Typer()


		def parse_json_dict(json_dict: str) -> List[Dict[str, str]]:



		@app.command()
		def main(judge: choices_judges, model_name: str, json_dict: str, out: str = None):

		- `model_name`: The name of the model to use (e.g., "gpt-4", "<litellm_provider>/<model_name>")
		- `json_input`: Either a JSON string or path to a JSON file containing test cases

Conversation

alewarne commented May 21, 2025

Uh oh!

freddiev4 commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

freddiev4 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

freddiev4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

freddiev4 commented May 21, 2025 •

edited

Loading