This repository contains tools for evaluating question quality in human-AI interactions, specifically for the CogSci2025 conference presentation.
This project evaluates the quality of questions in conversations based on predefined rubrics. It uses the Anthropic API to assess the effectiveness of these questions according to different contexts and goals, with a focus on cognitive science research.
eval_anth.py: Main evaluation script that processes sample conversations and applies the rubricRubric_GQ.json: Evaluation criteria for good follow-up questionssystem_prompt.txt: System prompt template for Claude_src/: Directory containing sample conversation data_output/: Directory where evaluation results are stored.env: Configuration for API keys (not included in repository)
To run this evaluation tool, install the required dependencies:
pip install -r requirements.txt
- Clone this repository
- Install dependencies using the command above
- Create a
.envfile in the root directory with your Anthropic API key:Note: TheANTHROPIC_API_KEY=your_api_key_here
.envfile is included in.gitignoreand will not be uploaded to the repository for security reasons.
Run the evaluation script with:
python eval_anth.pyThe script will:
- Load sample conversations from the specified input file
- Apply the evaluation rubric with the configured variables
- Generate evaluations using Claude
- Save results to the
_outputdirectory with a timestamp
You can modify the following variables in eval_anth.py:
RUBRIC_VARIABLES: Customize context variables like "answerer" and "goal"MODEL_NAME: Change the Claude model versionMODEL_TEMPERATURE: Adjust the randomness of Claude's responsesMAX_TOKENS: Set the maximum token length for responses