This application is built using the nyx_client SDK to compare the performance of different Language Models (LLMs) on a set of predefined questions.
This tool evaluates the performance of various LLMs (OpenAI and Cohere models) by:
- Asking a set of predefined questions
- Measuring response time
- Judging the accuracy and source usage of the responses
-
Install the required packages:
pip install -r examples/requirements.txt
-
Set up your API keys for OpenAI and Cohere in your environment variables or through the nyx_client configuration.
-
Run the script:
python examples/advanced/evaluate/evaluate.py
-
The script will query different LLMs with the predefined questions and evaluate their responses.
-
Results are saved to
llm_comparison_results.json
.
- Modify the
input_prompts
inconfig.py
to change the evaluation questions. - Adjust the
clients
tuple inconfig.py
to include or exclude specific LLM models. - Update the
JUDGE_PROMPT
inconfig.py
to refine the evaluation criteria.
Larger models (gpt-4o and command-r-plus) are commented out in the clients
tuple to manage costs. Uncomment these in config.py
to include them in the evaluation.
The script generates a JSON file with detailed results for each LLM, including:
- Accuracy scores
- Source usage
- Execution time
- Input questions and outputs
This data can be used for further analysis and comparison of LLM performance.
Optionally it can be uploaded to https://endearing-twilight-588648.netlify.app/ to explore