Home

LLM Behaviour Design & Validation Pipeline

(Originally developed and validated as an AI Tutor Testing Pipeline)

A practical toolkit for designing, testing, and validating the behaviour of AI chatbots — without writing code

Welcome!

This repository provides a general-purpose pipeline for shaping, stress-testing, and validating the behaviour of Large Language Model (LLM) chatbots.

It was originally developed to create a safe, academically responsible AI tutor — one that helps students learn without giving away answers — and has been rigorously tested and validated for that purpose. The tools provided help academics, teachers, and course coordinators design AI tutors that support learning without giving away answers to graded questions.

However, the same pipeline can be used far more broadly.

If you can edit a text file, you can use this system.

You define how a chatbot should behave using a system prompt, test that behaviour against carefully chosen inputs, and optionally use another LLM to automatically verify whether the behaviour is being followed — all without writing a single line of code.

What Can This Be Used For?

Primary (Ready-for-Production) Use Case

🎓 AI Tutors that uphold academic integrity

Encourage learning and critical thinking.
Refuse to provide direct answers to graded questions.
Resist jailbreaks, manipulation, and authority appeals.
Maintain a supportive, professional teaching tone.
Adapt easily to specific courses, subjects, or assessment schemes.

This use case has been:

Tested against 140+ adversarial prompts.
Manually verified.
Automatically evaluated using a custom rubric.
Ready for real deployment in academic settings.

Broader Applications

This pipeline is not limited to tutoring. It can be used to:

Shape the tone, boundaries, and behaviour of any LLM chatbot.
Verify safety, refusal behaviour, policy compliance, and more.
Test robustness against adversarial or manipulative prompts.
Compare different system prompt designs.
Validate that an LLM consistently follows your rules.
Automate behaviour testing at scale.

Examples:

A corporate compliance assistant.
A mental-health support bot with strict safety rules.
A customer-support bot that must avoid legal advice.
A study coach that gives guidance but never solutions.

No Coding Required

You do not need programming experience to use this pipeline. All usage assumes:

You have a local development environment (e.g., VS Code).
You can open a folder.
You can edit text files.
You can run provided commands in a terminal.

You are not expected to modify Python code (unless you want to).

How It Works (High-level Workflow)

1. Define Behaviour (Customise System Prompt) → 2. Stress-Test → 3. Verify Behaviour → 4. Iterate → 5. Deploy

The Pipeline

In Plain Language

Describe how your chatbot should behave → Edit system_prompt.txt (or provide your own text file)
Test that behaviour → Provide prompts designed to test, trick, or break it.
Review the responses → Manually (via CSV output), or automatically using an LLM-as-judge.
Refine the behaviour → Adjust the system prompt or evaluation rubric.
Deploy with confidence → Once you're happy with your chatbot's behaviour, use the validated system prompt in your chatbot platform.

What This Is (and Isn’t)

✅ This toolkit does:

Shape chatbot behaviour via system prompts.
Test responses at scale using CSV inputs.
Record refusals, errors, and token usage.
Automatically evaluate behavior using custom rubrics.
Support iterative refinement.

❌ This toolkit does not:

Provide a chatbot UI or frontend.
Replace your LMS or chatbot platform.
Generate direct answers to graded questions (AI Tutor)
Require you to train or fine-tune models.

Think of this as a behaviour design, testing, and validation layer — not the chatbot itself.

Quick Links

Getting Started

Getting Started Guide - Install, setup, and first run for non-technical users
Quick Start Tutorial - Get running in ~5 minutes

📖 Core Documentation

System Prompt Guide - Customising and refining tutor behavior
Batch Processing - Testing your tutor at scale
Automated Evaluation - Scoring responses automatically

Customisation

Customisation Examples - Example behaviour adaptations
Cost & Pricing - Managing API usage

🔧 Support

Troubleshooting - Common issues and solutions
Advanced Topics - Deployment, security testing, results interpretation

Key Features (AI Tutor)

🛡️ Battle-Tested Security

Validated across 140+ adversarial prompts
Zero direct answer leaks on graded questions
Resists jailbreaks, emotional manipulation, and authority appeals

📊 Custom Evaluation (Why This Exists)

Traditional LLM metrics (e.g., BLEU, ROUGE) often reward answer correctness but penalise refusal - inappropriate for assessing pedagogical quality in the context of academic integrity. This pipeline evaluates what actually matters:

Did it refuse when it should (avoid giving away answers to graded questions)?
Pedagogical quality (is the guidance helpful, educational, not evasive?)
Was the tone appropriate?
Did it subtly leak answers?
Does it resist manipulation?

💰 Cost-Effective Testing

Small test suites cost cents.
Full validation runs typically cost a few dollars.
Test 140 prompts for ~$1.50-3.00 (full pipeline)
Budget testing: Use gpt-4o-mini for $0.10-0.30
Production validation: Use gpt-5.2 or gpt-5.1

🔄 Flexible & Customisable

Edit system prompts without coding
Create subject-specific rubrics
Add course-specific red flags
Works with any OpenAI model

What's Included

Core Components

File	Purpose
`system_prompt.txt`	Defines tutor behavior (customisable)
`llm_batch_processor.py`	Tests tutor with multiple prompts
`llm_evaluator.py`	Automatically scores responses using rubric.txt
`rubric.txt`	Evaluation criteria you can customise

Sample Data

File	Description
`prompts.csv`	Example adversarial test prompts
`responses.csv`	Sample tutor responses (GPT-5 and human tested)
`evaluated_responses_gpt5.csv`	Scored results with reasoning

Who Is This For?

The AI Tutor toolkit is designed for:

Educators, university lecturers, and course coordinators
Teaching staff and academic support teams
Academic researchers studying AI in education
Educational technologists
💻 Anyone designing constrained LLM behavior.

Requirements:

A computer
A local coding environment (e.g., VS Code or similar IDE)
An OpenAI API key
A commitment to responsible AI deployment
No coding background required.

Real-World Use Case

Example: Biology Course Coordinator

Dr. Smith teaches BIO301 and wants to deploy an AI tutor that helps students understand cellular processes without giving away exam answers.

Steps:

Edits system_prompt.txt to add specific exam questions as red flags

Creates bio301_test_prompts.csv with variations of actual student questions

Runs batch processor to test tutor responses

Reviews evaluation scores and identifies 2 cases where hints were too strong

Adjusts system prompt to be more cautious with those topics

Re-tests and validates: all responses maintain appropriate boundaries

Deploys system prompt to university's chatbot platform

Result: Students get helpful guidance on concepts while exam integrity is preserved.

Getting Started

Ready to build your AI tutor? Start here:

Support & Contributing

Need Help?

Check Troubleshooting for common issues
Open a GitHub issue for bugs or questions
Review existing issues for solutions

Want to Contribute?

Share your customised system prompts
Submit subject-specific prompt libraries
Improve documentation or add examples
Report bugs or suggest features

License

MIT License - Free to use, modify, and distribute

See full license text in the repository.

Citation

If you use this pipeline in academic work:

@software{ai_tutor_pipeline_2025,
  title={AI Tutor Testing Pipeline},
  author={Thomas Filsell},
  year={2025},
  url={https://github.com/FIls0010/LLM_tutor}
}

Future Work

Web-based UI for non-technical users.
Prompt editing and evaluation dashboards.
Model comparison tooling.
Integration with common chatbot platforms.

Final Note

This project exists to make responsible, constrained, and verifiable AI behavior practical — not theoretical. Whether you’re building an AI tutor, a safety-critical assistant, or a carefully bounded chatbot, this pipeline gives you control, visibility, and confidence.

Ready to get started? → Installation Guide

Questions? → Troubleshooting | Open an Issue

Home

LLM Behaviour Design & Validation Pipeline

Welcome!

What Can This Be Used For?

Primary (Ready-for-Production) Use Case

Broader Applications

No Coding Required

How It Works (High-level Workflow)

The Pipeline

In Plain Language

What This Is (and Isn’t)

✅ This toolkit does:

❌ This toolkit does not:

Quick Links

Getting Started

📖 Core Documentation

Customisation

🔧 Support

Key Features (AI Tutor)

🛡️ Battle-Tested Security

📊 Custom Evaluation (Why This Exists)

💰 Cost-Effective Testing

🔄 Flexible & Customisable

What's Included

Core Components

Sample Data

Who Is This For?

Real-World Use Case

Getting Started

Support & Contributing

Need Help?

Want to Contribute?

License

Citation

Future Work

Final Note

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally