Home

AI Tutor Testing Pipeline

A practical toolkit for designing, testing, and validating AI tutor chatbots for academic use

Welcome!

This toolkit helps academics, teachers, and course coordinators design AI tutors that support learning without giving away answers to graded questions.

Modern LLMs are excellent at answering questions—sometimes too good for academic integrity. This pipeline ensures your AI tutor:

Encourages critical thinking and learning
Refuses to provide direct answers to graded questions
Resists manipulation and jailbreak attempts
Maintains a helpful, supportive tone
Can be customised for any subject or course

No programming experience required beyond running a few terminal commands - which this guide provides!

What's This For?

✅ This toolkit helps you:

Train an AI tutor customised to your course and syllabus
Test tutor behavior against adversarial student prompts designed to trick it
Automatically evaluate its responses at scale, saving manual evaluation time (optional)
Deploy a chatbot with confidence that academic integrity is maintained

❌ This is NOT:

A chatbot frontend or user interface
A replacement for your LMS or existing platforms
A tool that generates direct answers to assessment questions

Instead, this is a testing and validation pipeline to prepare your AI tutor for production use.

More Broadly

This pipeline can be used to shape and verify chatbot behaviour without the need to know how to code. The bot provided has been designed to behave as a university tutor, upholding academic integrity standards, but these tools can be customised to mould behaviour and tone to fit any desired purpose.

How It Works

1. Customize System Prompt → 2. Test with Prompts → 3. Evaluate Results → 4. Deploy

The Pipeline

Define behavior: Edit system_prompt.txt to specify how your tutor should act
Stress-test: Run llm_batch_processor.py with test prompts (friendly, adversarial, manipulative)
Evaluate: Use llm_evaluator.py to automatically score responses against a custom rubric
Iterate: Refine your system prompt based on results
Deploy: Use your validated prompt in any chatbot platform

Quick Links

🚀 Getting Started

Getting Started Guide - Install, setup, and first run
Quick Start Tutorial - Get running in 5 minutes

📖 Core Documentation

System Prompt Guide - Understanding and customizing tutor behavior
Batch Processing - Testing your tutor at scale
Automated Evaluation - Scoring responses automatically

🎓 Customization

Customization Examples - Subject-specific adaptations
Cost & Pricing - Budget optimization strategies

🔧 Support

Troubleshooting - Common issues and solutions
Advanced Topics - Deployment, security testing, results interpretation

Key Features

🛡️ Battle-Tested Security

Validated across 140+ adversarial prompts
Zero direct answer leaks on graded questions
Resists jailbreaks, emotional manipulation, and authority appeals

📊 Custom Evaluation Metrics

Traditional NLP metrics (ROUGE, BLEU) penalize tutors for NOT giving answers—exactly what we don't want!

Our custom rubric evaluates:

Refusal correctness (does it avoid giving away answers?)
Pedagogical quality (is the guidance helpful?)
Manipulation resistance
Tone and professionalism

💰 Cost-Effective Testing

Test 140 prompts for ~$1.50-3.00 (full pipeline)
Budget testing: Use gpt-4o-mini for $0.10-0.30
Production validation: Use gpt-5.2 or gpt-5.1

🔄 Flexible & Customizable

Edit system prompts without coding
Create subject-specific rubrics
Add course-specific red flags
Works with any OpenAI model

What's Included

Core Components

File	Purpose
`system_prompt.txt`	Defines tutor behavior (customizable)
`llm_batch_processor.py`	Tests tutor with multiple prompts
`llm_evaluator.py`	Automatically scores responses
`rubric.txt`	Evaluation criteria (customizable)

Sample Data

File	Description
`prompts.csv`	Example adversarial test prompts
`responses.csv`	Sample tutor responses (GPT-5 tested)
`evaluated_responses_gpt5.csv`	Scored results with reasoning

Who Is This For?

This toolkit is designed for:

👨‍🏫 University lecturers and course coordinators
👩‍🎓 Teaching staff and TAs
🔬 Academic researchers studying AI in education
💻 Educational technologists

Anyone with:

A local coding environment (e.g., VS Code)
An OpenAI API key
A commitment to responsible AI deployment

Real-World Use Case

Example: Biology Course Coordinator

Dr. Smith teaches BIO301 and wants to deploy an AI tutor that helps students understand cellular processes without giving away exam answers.

Steps:

Edits system_prompt.txt to add specific exam questions as red flags

Creates bio301_test_prompts.csv with variations of actual student questions

Runs batch processor to test tutor responses

Reviews evaluation scores and identifies 2 cases where hints were too strong

Adjusts system prompt to be more cautious with those topics

Re-tests and validates: all responses maintain appropriate boundaries

Deploys system prompt to university's chatbot platform

Result: Students get helpful guidance on concepts while exam integrity is preserved.

Getting Started

Ready to build your AI tutor? Start here:

Support & Contributing

Need Help?

Check Troubleshooting for common issues
Open a GitHub issue for bugs or questions
Review existing issues for solutions

Want to Contribute?

Share your customized system prompts (anonymized)
Submit subject-specific prompt libraries
Improve documentation or add examples
Report bugs or suggest features

License

MIT License - Free to use, modify, and distribute

See full license text in the repository.

Citation

If you use this pipeline in academic work:

@software{ai_tutor_pipeline_2025,
  title={AI Tutor Testing Pipeline},
  author={Thomas John Filsell},
  year={2025},
  url={https://github.com/[your-repo]}
}

Final Note

This toolkit is intentionally:

Transparent - All code and prompts are open
Customizable - Adapt to any subject or institution
Model-agnostic - Works with any OpenAI model
Practical - Designed for real classroom use, not theory

Goal: Make safe, effective academic AI tutors practical—not theoretical.

Ready to get started? → Installation Guide

Questions? → Troubleshooting | Open an Issue

Home

AI Tutor Testing Pipeline

Welcome!

What's This For?

✅ This toolkit helps you:

❌ This is NOT:

More Broadly

How It Works

The Pipeline

Quick Links

🚀 Getting Started

📖 Core Documentation

🎓 Customization

🔧 Support

Key Features

🛡️ Battle-Tested Security

📊 Custom Evaluation Metrics

💰 Cost-Effective Testing

🔄 Flexible & Customizable

What's Included

Core Components

Sample Data

Who Is This For?

Real-World Use Case

Getting Started

Support & Contributing

Need Help?

Want to Contribute?

License

Citation

Final Note

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally