Conversation
|
@alewarne this is pretty sweet! I've got some thoughts & suggestions for how to improve this -- I think this could be a good start for a data viewer CLI. Will add my thoughts here later today. Also quick question: how / where did you come across |
README.md
Outdated
|
|
||
| ## CLI | ||
|
|
||
| We provide a command-line interface for evaluating model outputs using various judges. The CLI supports both single and batch evaluations. |
There was a problem hiding this comment.
| We provide a command-line interface for evaluating model outputs using various judges. The CLI supports both single and batch evaluations. | |
| `judges` also provides command-line interface (CLI) for evaluating model outputs using various judges. The CLI supports both single and batch evaluations. |
README.md
Outdated
|
|
||
| The CLI will return a JSON object containing the original input, output, expected values, judgment score, and reasoning for each test case. It will | ||
| either be saved in the output file or printed to std if no output file is specified. |
There was a problem hiding this comment.
| The CLI will return a JSON object containing the original input, output, expected values, judgment score, and reasoning for each test case. It will | |
| either be saved in the output file or printed to std if no output file is specified. | |
| The CLI will return a JSON object containing the original input, output, expected values, judgment score, and reasoning for each test case. It will be saved to the output file or printed to `stdout` if no output file is specified. |
cli.py
Outdated
| app = typer.Typer() | ||
|
|
||
|
|
||
| def parse_json_dict(json_dict: str) -> List[Dict[str, str]]: |
There was a problem hiding this comment.
What do you think about just swapping this all out with a pydantic.BaseModel? It's already a dependency and will be used more after I wrap up #25.
Then we can just do:
from typing import Optional
class Sample(BaseModel):
input: str
output: str
expected: Optional[str]And validate each row with that or with
class Dataset(BaseModel):
samples: List[Sample]There was a problem hiding this comment.
I didn't know pydantic before. That's really handy! 👍
cli.py
Outdated
| "input": entry["input"], | ||
| "output": entry["output"], | ||
| "expected": entry["expected"], | ||
| "judgement": judgement.score, |
There was a problem hiding this comment.
| "judgement": judgement.score, | |
| "judgment": judgement.score, |
There was a problem hiding this comment.
You never stop learning ... 😄
cli.py
Outdated
| # Process each entry | ||
| for i, entry in enumerate(entries): | ||
| try: | ||
| judgement = judge.judge( |
There was a problem hiding this comment.
| judgement = judge.judge( | |
| judgment = judge.judge( |
cli.py
Outdated
|
|
||
|
|
||
| @app.command() | ||
| def main(judge: choices_judges, model_name: str, json_dict: str, out: str = None): |
There was a problem hiding this comment.
| def main(judge: choices_judges, model_name: str, json_dict: str, out: str = None): | |
| def main(judge: choices_judges, model_name: str, json_dict: str, output: str = None): |
It'd also be great to have a commonly used shortcut of -o mapping to output
There was a problem hiding this comment.
Integrated shortcuts for model, output and input.
README.md
Outdated
| - `model_name`: The name of the model to use (e.g., "gpt-4", "<litellm_provider>/<model_name>") | ||
| - `json_input`: Either a JSON string or path to a JSON file containing test cases |
There was a problem hiding this comment.
Rather than having these be arguments, I think it'd be easier to know what you're running if they are parameters.
Then
judges PollMultihopCorrectness gpt-4 test_cases.json --out results.json
becomes:
judges PollMultihopCorrectness --model gpt-4 --input test_cases.json --output results.json
There was a problem hiding this comment.
Using this file would require someone to clone the repo. I think it'd be friendlier to add:
[tool.poetry.scripts]
judges = "judges.cli.entrypoint:app"
to the pyproject.toml, then users can run:
judges PollMultihopCorrectness --model gpt-4 --input test_cases.json --output results.json
There was a problem hiding this comment.
I included it under [project.scripts]. As of Python 3.11+, this [project.scripts] section is the standard.
freddiev4
left a comment
There was a problem hiding this comment.
Awesome! Will merge and just shuffle some a few small things in the README after.
While working with the judges library, I implemented a basic CLI that may be useful to others. It supports the common use case of specifying a judge, a model, and a JSON file (or string), where each entry contains "input", "output", and "expected" keys. The script instantiates the selected judge and calls .judge() for each JSON entry, saving the results to a new JSON file.
To make all available judges discoverable via the CLI, any new judge implementation should be registered in judges/init.py.
Happy to hear feedback or ideas for improvements!