Skip to content
Fils0010 edited this page Dec 28, 2025 · 12 revisions

LLM Behaviour Design & Validation Pipeline

(Originally developed and validated as an AI Tutor Testing Pipeline)

A practical toolkit for designing, testing, and validating the behaviour of AI chatbots — without writing code


Welcome!

This repository provides a general-purpose pipeline for shaping, stress-testing, and validating the behaviour of Large Language Model (LLM) chatbots.

It was originally developed to create a safe, academically responsible AI tutor — one that helps students learn without giving away answers — and has been rigorously tested and validated for that purpose. The tools provided help academics, teachers, and course coordinators design AI tutors that support learning without giving away answers to graded questions.

However, the same pipeline can be used far more broadly.

If you can edit a text file, you can use this system.

You define how a chatbot should behave using a system prompt, test that behaviour against carefully chosen inputs, and optionally use another LLM to automatically verify whether the behaviour is being followed — all without writing a single line of code.


What Can This Be Used For?

Primary (Ready-for-Production) Use Case

🎓 AI Tutors that uphold academic integrity

  • Encourage learning and critical thinking.
  • Refuse to provide direct answers to graded questions.
  • Resist jailbreaks, manipulation, and authority appeals.
  • Maintain a supportive, professional teaching tone.
  • Adapt easily to specific courses, subjects, or assessment schemes.

This use case has been:

  • Tested against 140+ adversarial prompts.
  • Manually verified.
  • Automatically evaluated using a custom rubric.
  • Ready for real deployment in academic settings.

Broader Applications

This pipeline is not limited to tutoring. It can be used to:

  • Shape the tone, boundaries, and behaviour of any LLM chatbot.
  • Verify safety, refusal behaviour, policy compliance, and more.
  • Test robustness against adversarial or manipulative prompts.
  • Compare different system prompt designs.
  • Validate that an LLM consistently follows your rules.
  • Automate behaviour testing at scale.

Examples:

  • A corporate compliance assistant.
  • A mental-health support bot with strict safety rules.
  • A customer-support bot that must avoid legal advice.
  • A study coach that gives guidance but never solutions.

No Coding Required

You do not need programming experience to use this pipeline. All usage assumes:

  1. You have a local development environment (e.g., VS Code).
  2. You can open a folder.
  3. You can edit text files.
  4. You can run provided commands in a terminal.

You are not expected to modify Python code (unless you want to).


How It Works (High-level Workflow)

1. Define Behaviour (Customise System Prompt)2. Stress-Test3. Verify Behaviour4. Iterate5. Deploy

The Pipeline

In Plain Language

  1. Describe how your chatbot should behave → Edit system_prompt.txt (or provide your own text file)
  2. Test that behaviour → Provide prompts designed to test, trick, or break it.
  3. Review the responses → Manually (via CSV output), or automatically using an LLM-as-judge.
  4. Refine the behaviour → Adjust the system prompt or evaluation rubric.
  5. Deploy with confidence → Once you're happy with your chatbot's behaviour, use the validated system prompt in your chatbot platform.

What This Is (and Isn’t)

✅ This toolkit does:

  • Shape chatbot behaviour via system prompts.
  • Test responses at scale using CSV inputs.
  • Record refusals, errors, and token usage.
  • Automatically evaluate behavior using custom rubrics.
  • Support iterative refinement.

❌ This toolkit does not:

  • Provide a chatbot UI or frontend.
  • Replace your LMS or chatbot platform.
  • Generate direct answers to graded questions (AI Tutor)
  • Require you to train or fine-tune models.

Think of this as a behaviour design, testing, and validation layer — not the chatbot itself.


Quick Links

Getting Started

📖 Core Documentation

Customisation

🔧 Support


Key Features (AI Tutor)

🛡️ Battle-Tested Security

  • Validated across 140+ adversarial prompts
  • Zero direct answer leaks on graded questions
  • Resists jailbreaks, emotional manipulation, and authority appeals

📊 Custom Evaluation (Why This Exists)

Traditional LLM metrics (e.g., BLEU, ROUGE) often reward answer correctness but penalise refusal - inappropriate for assessing pedagogical quality in the context of academic integrity. This pipeline evaluates what actually matters:

  • Did it refuse when it should (avoid giving away answers to graded questions)?
  • Pedagogical quality (is the guidance helpful, educational, not evasive?)
  • Was the tone appropriate?
  • Did it subtly leak answers?
  • Does it resist manipulation?

💰 Cost-Effective Testing

  • Small test suites cost cents.
  • Full validation runs typically cost a few dollars.
  • Test 140 prompts for ~$1.50-3.00 (full pipeline)
  • Budget testing: Use gpt-4o-mini for $0.10-0.30
  • Production validation: Use gpt-5.2 or gpt-5.1

🔄 Flexible & Customisable

  • Edit system prompts without coding
  • Create subject-specific rubrics
  • Add course-specific red flags
  • Works with any OpenAI model

What's Included

Core Components

File Purpose
system_prompt.txt Defines tutor behavior (customisable)
llm_batch_processor.py Tests tutor with multiple prompts
llm_evaluator.py Automatically scores responses using rubric.txt
rubric.txt Evaluation criteria you can customise

Sample Data

File Description
prompts.csv Example adversarial test prompts
responses.csv Sample tutor responses (GPT-5 and human tested)
evaluated_responses_gpt5.csv Scored results with reasoning

Who Is This For?

The AI Tutor toolkit is designed for:

  • Educators, university lecturers, and course coordinators
  • Teaching staff and academic support teams
  • Academic researchers studying AI in education
  • Educational technologists
  • 💻 Anyone designing constrained LLM behavior.

Requirements:

  • A computer
  • A local coding environment (e.g., VS Code or similar IDE)
  • An OpenAI API key
  • A commitment to responsible AI deployment
  • No coding background required.

Real-World Use Case

Example: Biology Course Coordinator

Dr. Smith teaches BIO301 and wants to deploy an AI tutor that helps students understand cellular processes without giving away exam answers.

Steps:

  1. Edits system_prompt.txt to add specific exam questions as red flags
  2. Creates bio301_test_prompts.csv with variations of actual student questions
  3. Runs batch processor to test tutor responses
  4. Reviews evaluation scores and identifies 2 cases where hints were too strong
  5. Adjusts system prompt to be more cautious with those topics
  6. Re-tests and validates: all responses maintain appropriate boundaries
  7. Deploys system prompt to university's chatbot platform

Result: Students get helpful guidance on concepts while exam integrity is preserved.


Getting Started

Ready to build your AI tutor? Start here:

  1. Install and Setup →
  2. Understand the System Prompt →
  3. Run Your First Test →

Support & Contributing

Need Help?

  • Check Troubleshooting for common issues
  • Open a GitHub issue for bugs or questions
  • Review existing issues for solutions

Want to Contribute?

  • Share your customised system prompts
  • Submit subject-specific prompt libraries
  • Improve documentation or add examples
  • Report bugs or suggest features

License

MIT License - Free to use, modify, and distribute

See full license text in the repository.


Citation

If you use this pipeline in academic work:

@software{ai_tutor_pipeline_2025,
  title={AI Tutor Testing Pipeline},
  author={Thomas Filsell},
  year={2025},
  url={https://github.com/FIls0010/LLM_tutor}
}

Future Work

  • Web-based UI for non-technical users.
  • Prompt editing and evaluation dashboards.
  • Model comparison tooling.
  • Integration with common chatbot platforms.

Final Note

This project exists to make responsible, constrained, and verifiable AI behavior practical — not theoretical. Whether you’re building an AI tutor, a safety-critical assistant, or a carefully bounded chatbot, this pipeline gives you control, visibility, and confidence.


Ready to get started?Installation Guide

Questions?Troubleshooting | Open an Issue

Clone this wiki locally