Skip to content
Fils0010 edited this page Dec 27, 2025 · 12 revisions

AI Tutor Testing Pipeline

A practical toolkit for designing, testing, and validating AI tutor chatbots for academic use


Welcome!

This toolkit helps academics, teachers, and course coordinators design AI tutors that support learning without giving away answers to graded questions.

Modern LLMs are excellent at answering questionsβ€”sometimes too good for academic integrity. This pipeline ensures your AI tutor:

  • Encourages critical thinking and learning
  • Refuses to provide direct answers to graded questions
  • Resists manipulation and jailbreak attempts
  • Maintains a helpful, supportive tone
  • Can be customised for any subject or course

No programming experience required beyond running a few terminal commands - which this guide provides!


What's This For?

βœ… This toolkit helps you:

  • Train an AI tutor customised to your course and syllabus
  • Test tutor behavior against adversarial student prompts designed to trick it
  • Automatically evaluate its responses at scale, saving manual evaluation time (optional)
  • Deploy a chatbot with confidence that academic integrity is maintained

❌ This is NOT:

  • A chatbot frontend or user interface
  • A replacement for your LMS or existing platforms
  • A tool that generates direct answers to assessment questions

Instead, this is a testing and validation pipeline to prepare your AI tutor for production use.

More Broadly

This pipeline can be used to shape and verify chatbot behaviour without the need to know how to code. The bot provided has been designed to behave as a university tutor, upholding academic integrity standards, but these tools can be customised to mould behaviour and tone to fit any desired purpose.


How It Works

1. Customize System Prompt β†’ 2. Test with Prompts β†’ 3. Evaluate Results β†’ 4. Deploy

The Pipeline

  1. Define behavior: Edit system_prompt.txt to specify how your tutor should act
  2. Stress-test: Run llm_batch_processor.py with test prompts (friendly, adversarial, manipulative)
  3. Evaluate: Use llm_evaluator.py to automatically score responses against a custom rubric
  4. Iterate: Refine your system prompt based on results
  5. Deploy: Use your validated prompt in any chatbot platform

Quick Links

πŸš€ Getting Started

πŸ“– Core Documentation

πŸŽ“ Customization

πŸ”§ Support


Key Features

πŸ›‘οΈ Battle-Tested Security

  • Validated across 140+ adversarial prompts
  • Zero direct answer leaks on graded questions
  • Resists jailbreaks, emotional manipulation, and authority appeals

πŸ“Š Custom Evaluation Metrics

Traditional NLP metrics (ROUGE, BLEU) penalize tutors for NOT giving answersβ€”exactly what we don't want!

Our custom rubric evaluates:

  • Refusal correctness (does it avoid giving away answers?)
  • Pedagogical quality (is the guidance helpful?)
  • Manipulation resistance
  • Tone and professionalism

πŸ’° Cost-Effective Testing

  • Test 140 prompts for ~$1.50-3.00 (full pipeline)
  • Budget testing: Use gpt-4o-mini for $0.10-0.30
  • Production validation: Use gpt-5.2 or gpt-5.1

πŸ”„ Flexible & Customizable

  • Edit system prompts without coding
  • Create subject-specific rubrics
  • Add course-specific red flags
  • Works with any OpenAI model

What's Included

Core Components

File Purpose
system_prompt.txt Defines tutor behavior (customizable)
llm_batch_processor.py Tests tutor with multiple prompts
llm_evaluator.py Automatically scores responses
rubric.txt Evaluation criteria (customizable)

Sample Data

File Description
prompts.csv Example adversarial test prompts
responses.csv Sample tutor responses (GPT-5 tested)
evaluated_responses_gpt5.csv Scored results with reasoning

Who Is This For?

This toolkit is designed for:

  • πŸ‘¨β€πŸ« University lecturers and course coordinators
  • πŸ‘©β€πŸŽ“ Teaching staff and TAs
  • πŸ”¬ Academic researchers studying AI in education
  • πŸ’» Educational technologists

Anyone with:

  • A local coding environment (e.g., VS Code)
  • An OpenAI API key
  • A commitment to responsible AI deployment

Real-World Use Case

Example: Biology Course Coordinator

Dr. Smith teaches BIO301 and wants to deploy an AI tutor that helps students understand cellular processes without giving away exam answers.

Steps:

  1. Edits system_prompt.txt to add specific exam questions as red flags
  2. Creates bio301_test_prompts.csv with variations of actual student questions
  3. Runs batch processor to test tutor responses
  4. Reviews evaluation scores and identifies 2 cases where hints were too strong
  5. Adjusts system prompt to be more cautious with those topics
  6. Re-tests and validates: all responses maintain appropriate boundaries
  7. Deploys system prompt to university's chatbot platform

Result: Students get helpful guidance on concepts while exam integrity is preserved.


Getting Started

Ready to build your AI tutor? Start here:

  1. Install and Setup β†’
  2. Understand the System Prompt β†’
  3. Run Your First Test β†’

Support & Contributing

Need Help?

  • Check Troubleshooting for common issues
  • Open a GitHub issue for bugs or questions
  • Review existing issues for solutions

Want to Contribute?

  • Share your customized system prompts (anonymized)
  • Submit subject-specific prompt libraries
  • Improve documentation or add examples
  • Report bugs or suggest features

License

MIT License - Free to use, modify, and distribute

See full license text in the repository.


Citation

If you use this pipeline in academic work:

@software{ai_tutor_pipeline_2025,
  title={AI Tutor Testing Pipeline},
  author={Thomas John Filsell},
  year={2025},
  url={https://github.com/[your-repo]}
}

Final Note

This toolkit is intentionally:

  • Transparent - All code and prompts are open
  • Customizable - Adapt to any subject or institution
  • Model-agnostic - Works with any OpenAI model
  • Practical - Designed for real classroom use, not theory

Goal: Make safe, effective academic AI tutors practicalβ€”not theoretical.


Ready to get started? β†’ Installation Guide

Questions? β†’ Troubleshooting | Open an Issue

Clone this wiki locally