-
Notifications
You must be signed in to change notification settings - Fork 2
Home
A practical toolkit for designing, testing, and validating AI tutor chatbots for academic use
This toolkit helps academics, teachers, and course coordinators design AI tutors that support learning without giving away answers to graded questions.
Modern LLMs are excellent at answering questionsβsometimes too good for academic integrity. This pipeline ensures your AI tutor:
- Encourages critical thinking and learning
- Refuses to provide direct answers to graded questions
- Resists manipulation and jailbreak attempts
- Maintains a helpful, supportive tone
- Can be customised for any subject or course
No programming experience required beyond running a few terminal commands - which this guide provides!
- Train an AI tutor customised to your course and syllabus
- Test tutor behavior against adversarial student prompts designed to trick it
- Automatically evaluate its responses at scale, saving manual evaluation time (optional)
- Deploy a chatbot with confidence that academic integrity is maintained
- A chatbot frontend or user interface
- A replacement for your LMS or existing platforms
- A tool that generates direct answers to assessment questions
Instead, this is a testing and validation pipeline to prepare your AI tutor for production use.
This pipeline can be used to shape and verify chatbot behaviour without the need to know how to code. The bot provided has been designed to behave as a university tutor, upholding academic integrity standards, but these tools can be customised to mould behaviour and tone to fit any desired purpose.
1. Customize System Prompt β 2. Test with Prompts β 3. Evaluate Results β 4. Deploy
-
Define behavior: Edit
system_prompt.txtto specify how your tutor should act -
Stress-test: Run
llm_batch_processor.pywith test prompts (friendly, adversarial, manipulative) -
Evaluate: Use
llm_evaluator.pyto automatically score responses against a custom rubric - Iterate: Refine your system prompt based on results
- Deploy: Use your validated prompt in any chatbot platform
- Getting Started Guide - Install, setup, and first run
- Quick Start Tutorial - Get running in 5 minutes
- System Prompt Guide - Understanding and customizing tutor behavior
- Batch Processing - Testing your tutor at scale
- Automated Evaluation - Scoring responses automatically
- Customization Examples - Subject-specific adaptations
- Cost & Pricing - Budget optimization strategies
- Troubleshooting - Common issues and solutions
- Advanced Topics - Deployment, security testing, results interpretation
- Validated across 140+ adversarial prompts
- Zero direct answer leaks on graded questions
- Resists jailbreaks, emotional manipulation, and authority appeals
Traditional NLP metrics (ROUGE, BLEU) penalize tutors for NOT giving answersβexactly what we don't want!
Our custom rubric evaluates:
- Refusal correctness (does it avoid giving away answers?)
- Pedagogical quality (is the guidance helpful?)
- Manipulation resistance
- Tone and professionalism
- Test 140 prompts for ~$1.50-3.00 (full pipeline)
- Budget testing: Use
gpt-4o-minifor $0.10-0.30 - Production validation: Use
gpt-5.2orgpt-5.1
- Edit system prompts without coding
- Create subject-specific rubrics
- Add course-specific red flags
- Works with any OpenAI model
| File | Purpose |
|---|---|
system_prompt.txt |
Defines tutor behavior (customizable) |
llm_batch_processor.py |
Tests tutor with multiple prompts |
llm_evaluator.py |
Automatically scores responses |
rubric.txt |
Evaluation criteria (customizable) |
| File | Description |
|---|---|
prompts.csv |
Example adversarial test prompts |
responses.csv |
Sample tutor responses (GPT-5 tested) |
evaluated_responses_gpt5.csv |
Scored results with reasoning |
This toolkit is designed for:
- π¨βπ« University lecturers and course coordinators
- π©βπ Teaching staff and TAs
- π¬ Academic researchers studying AI in education
- π» Educational technologists
Anyone with:
- A local coding environment (e.g., VS Code)
- An OpenAI API key
- A commitment to responsible AI deployment
Example: Biology Course Coordinator
Dr. Smith teaches BIO301 and wants to deploy an AI tutor that helps students understand cellular processes without giving away exam answers.
Steps:
- Edits
system_prompt.txtto add specific exam questions as red flags- Creates
bio301_test_prompts.csvwith variations of actual student questions- Runs batch processor to test tutor responses
- Reviews evaluation scores and identifies 2 cases where hints were too strong
- Adjusts system prompt to be more cautious with those topics
- Re-tests and validates: all responses maintain appropriate boundaries
- Deploys system prompt to university's chatbot platform
Result: Students get helpful guidance on concepts while exam integrity is preserved.
Ready to build your AI tutor? Start here:
- Check Troubleshooting for common issues
- Open a GitHub issue for bugs or questions
- Review existing issues for solutions
- Share your customized system prompts (anonymized)
- Submit subject-specific prompt libraries
- Improve documentation or add examples
- Report bugs or suggest features
MIT License - Free to use, modify, and distribute
See full license text in the repository.
If you use this pipeline in academic work:
@software{ai_tutor_pipeline_2025,
title={AI Tutor Testing Pipeline},
author={Thomas John Filsell},
year={2025},
url={https://github.com/[your-repo]}
}This toolkit is intentionally:
- Transparent - All code and prompts are open
- Customizable - Adapt to any subject or institution
- Model-agnostic - Works with any OpenAI model
- Practical - Designed for real classroom use, not theory
Goal: Make safe, effective academic AI tutors practicalβnot theoretical.
Ready to get started? β Installation Guide
Questions? β Troubleshooting | Open an Issue