-
Notifications
You must be signed in to change notification settings - Fork 2
Home
(Originally developed and validated as an AI Tutor Testing Pipeline)
A practical toolkit for designing, testing, and validating the behaviour of AI chatbots — without writing code
This repository provides a general-purpose pipeline for shaping, stress-testing, and validating the behaviour of Large Language Model (LLM) chatbots.
It was originally developed to create a safe, academically responsible AI tutor — one that helps students learn without giving away answers — and has been rigorously tested and validated for that purpose. The tools provided help academics, teachers, and course coordinators design AI tutors that support learning without giving away answers to graded questions.
However, the same pipeline can be used far more broadly.
If you can edit a text file, you can use this system.
You define how a chatbot should behave using a system prompt, test that behaviour against carefully chosen inputs, and optionally use another LLM to automatically verify whether the behaviour is being followed — all without writing a single line of code.
🎓 AI Tutors that uphold academic integrity
- Encourage learning and critical thinking.
- Refuse to provide direct answers to graded questions.
- Resist jailbreaks, manipulation, and authority appeals.
- Maintain a supportive, professional teaching tone.
- Adapt easily to specific courses, subjects, or assessment schemes.
This use case has been:
- Tested against 140+ adversarial prompts.
- Manually verified.
- Automatically evaluated using a custom rubric.
- Ready for real deployment in academic settings.
This pipeline is not limited to tutoring. It can be used to:
- Shape the tone, boundaries, and behaviour of any LLM chatbot.
- Verify safety, refusal behaviour, policy compliance, and more.
- Test robustness against adversarial or manipulative prompts.
- Compare different system prompt designs.
- Validate that an LLM consistently follows your rules.
- Automate behaviour testing at scale.
Examples:
- A corporate compliance assistant.
- A mental-health support bot with strict safety rules.
- A customer-support bot that must avoid legal advice.
- A study coach that gives guidance but never solutions.
You do not need programming experience to use this pipeline. All usage assumes:
- You have a local development environment (e.g., VS Code).
- You can open a folder.
- You can edit text files.
- You can run provided commands in a terminal.
You are not expected to modify Python code (unless you want to).
1. Define Behaviour (Customise System Prompt) → 2. Stress-Test → 3. Verify Behaviour → 4. Iterate → 5. Deploy
-
Describe how your chatbot should behave → Edit
system_prompt.txt(or provide your own text file) - Test that behaviour → Provide prompts designed to test, trick, or break it.
- Review the responses → Manually (via CSV output), or automatically using an LLM-as-judge.
- Refine the behaviour → Adjust the system prompt or evaluation rubric.
- Deploy with confidence → Once you're happy with your chatbot's behaviour, use the validated system prompt in your chatbot platform.
- Shape chatbot behaviour via system prompts.
- Test responses at scale using CSV inputs.
- Record refusals, errors, and token usage.
- Automatically evaluate behavior using custom rubrics.
- Support iterative refinement.
- Provide a chatbot UI or frontend.
- Replace your LMS or chatbot platform.
- Generate direct answers to graded questions (AI Tutor)
- Require you to train or fine-tune models.
Think of this as a behaviour design, testing, and validation layer — not the chatbot itself.
- Getting Started Guide - Install, setup, and first run for non-technical users
- Quick Start Tutorial - Get running in ~5 minutes
- System Prompt Guide - Customising and refining tutor behavior
- Batch Processing - Testing your tutor at scale
- Automated Evaluation - Scoring responses automatically
- Customisation Examples - Example behaviour adaptations
- Cost & Pricing - Managing API usage
- Troubleshooting - Common issues and solutions
- Advanced Topics - Deployment, security testing, results interpretation
- Validated across 140+ adversarial prompts
- Zero direct answer leaks on graded questions
- Resists jailbreaks, emotional manipulation, and authority appeals
Traditional LLM metrics (e.g., BLEU, ROUGE) often reward answer correctness but penalise refusal - inappropriate for assessing pedagogical quality in the context of academic integrity. This pipeline evaluates what actually matters:
- Did it refuse when it should (avoid giving away answers to graded questions)?
- Pedagogical quality (is the guidance helpful, educational, not evasive?)
- Was the tone appropriate?
- Did it subtly leak answers?
- Does it resist manipulation?
- Small test suites cost cents.
- Full validation runs typically cost a few dollars.
- Test 140 prompts for ~$1.50-3.00 (full pipeline)
- Budget testing: Use
gpt-4o-minifor $0.10-0.30 - Production validation: Use
gpt-5.2orgpt-5.1
- Edit system prompts without coding
- Create subject-specific rubrics
- Add course-specific red flags
- Works with any OpenAI model
| File | Purpose |
|---|---|
system_prompt.txt |
Defines tutor behavior (customisable) |
llm_batch_processor.py |
Tests tutor with multiple prompts |
llm_evaluator.py |
Automatically scores responses using rubric.txt |
rubric.txt |
Evaluation criteria you can customise |
| File | Description |
|---|---|
prompts.csv |
Example adversarial test prompts |
responses.csv |
Sample tutor responses (GPT-5 and human tested) |
evaluated_responses_gpt5.csv |
Scored results with reasoning |
The AI Tutor toolkit is designed for:
- Educators, university lecturers, and course coordinators
- Teaching staff and academic support teams
- Academic researchers studying AI in education
- Educational technologists
- 💻 Anyone designing constrained LLM behavior.
Requirements:
- A computer
- A local coding environment (e.g., VS Code or similar IDE)
- An OpenAI API key
- A commitment to responsible AI deployment
- No coding background required.
Example: Biology Course Coordinator
Dr. Smith teaches BIO301 and wants to deploy an AI tutor that helps students understand cellular processes without giving away exam answers.
Steps:
- Edits
system_prompt.txtto add specific exam questions as red flags- Creates
bio301_test_prompts.csvwith variations of actual student questions- Runs batch processor to test tutor responses
- Reviews evaluation scores and identifies 2 cases where hints were too strong
- Adjusts system prompt to be more cautious with those topics
- Re-tests and validates: all responses maintain appropriate boundaries
- Deploys system prompt to university's chatbot platform
Result: Students get helpful guidance on concepts while exam integrity is preserved.
Ready to build your AI tutor? Start here:
- Check Troubleshooting for common issues
- Open a GitHub issue for bugs or questions
- Review existing issues for solutions
- Share your customised system prompts
- Submit subject-specific prompt libraries
- Improve documentation or add examples
- Report bugs or suggest features
MIT License - Free to use, modify, and distribute
See full license text in the repository.
If you use this pipeline in academic work:
@software{ai_tutor_pipeline_2025,
title={AI Tutor Testing Pipeline},
author={Thomas Filsell},
year={2025},
url={https://github.com/FIls0010/LLM_tutor}
}- Web-based UI for non-technical users.
- Prompt editing and evaluation dashboards.
- Model comparison tooling.
- Integration with common chatbot platforms.
This project exists to make responsible, constrained, and verifiable AI behavior practical — not theoretical. Whether you’re building an AI tutor, a safety-critical assistant, or a carefully bounded chatbot, this pipeline gives you control, visibility, and confidence.
Ready to get started? → Installation Guide
Questions? → Troubleshooting | Open an Issue