Assignment 4: Context-Aware Customer Service Chatbot

Accept this assignment: GitHub Classroom Link

Due: February 6, 2026 at 11:59 PM EST

Click the link above to create your private repository for this assignment. Complete your work in Google Colab, then push your notebook to the repository before the deadline.

Timeline: 1 Week

Overview

In this assignment, you will build a sophisticated, context-aware customer service chatbot that uses modern transformer models and retrieval-augmented generation (RAG) techniques. Unlike traditional rule-based chatbots, your system will leverage semantic understanding to match customer queries with relevant knowledge base entries and generate contextually appropriate responses.

This is a 1-week assignment designed to be achievable with GenAI assistance (ChatGPT, Claude, GitHub Copilot). You can focus on system design, integration, and evaluation rather than getting bogged down in low-level implementation details.

You will implement a complete RAG pipeline that:

Uses transformer-based encoders (BERT or similar) to understand customer queries semantically
Performs efficient semantic search over a knowledge base using vector similarity
Retrieves relevant context to ground responses in factual information
Handles multi-turn conversations while maintaining context
Compares your approach against simpler keyword-matching baselines

This assignment simulates a real-world application where customers need accurate, context-aware assistance, and where hallucinated or incorrect information could damage user trust.

Learning Objectives

By completing this assignment, you will develop the following skills:

Semantic Understanding: Apply transformer-based models (BERT, Sentence-BERT) to encode text into meaningful vector representations
Information Retrieval: Implement efficient semantic search using vector similarity and libraries like FAISS
Retrieval-Augmented Generation (RAG): Combine retrieval and generation to produce grounded, factual responses
Evaluation Design: Develop metrics to assess chatbot quality, including retrieval accuracy and response relevance
System Architecture: Design and implement a complete end-to-end conversational AI system
Baseline Comparison: Understand the importance of baselines by comparing against keyword-matching approaches
Production Considerations: Handle edge cases, multi-turn conversations, and system scalability

Background

Context-Aware Language Understanding

Traditional customer service systems often rely on keyword matching or simple pattern recognition (like your Assignment 1 ELIZA chatbot). However, customers express the same need in many different ways:

"I can't log into my account"
"My password isn't working"
"I'm having authentication issues"
"The login page keeps rejecting my credentials"

Transformer-based models like BERT can understand that these queries are semantically similar, even when they share few keywords. This is achieved through:

Contextualized Embeddings: BERT produces vector representations where semantically similar text has similar vectors
Semantic Similarity: Using cosine similarity or other distance metrics to find relevant knowledge base entries
Dense Retrieval: Unlike sparse keyword methods (TF-IDF, BM25), dense vector representations capture deeper semantic meaning

Retrieval-Augmented Generation (RAG)

RAG systems combine the strengths of retrieval (finding relevant information) and generation (producing natural language). The typical pipeline:

Encode: Convert the user query into a vector representation
Retrieve: Find the most similar entries in your knowledge base
Augment: Include retrieved context with the query
Generate: Produce a response grounded in the retrieved information

This approach helps prevent hallucination and ensures responses are factually grounded in your knowledge base.

Key Papers and Concepts

BERT (Devlin et al., 2018): Bidirectional encoder representations from transformers
Sentence-BERT (Reimers & Gurevych, 2019): Modified BERT for efficient sentence embeddings
RAG (Lewis et al., 2020): Retrieval-augmented generation for knowledge-intensive tasks
Dense Passage Retrieval (Karpukhin et al., 2020): Using dense representations for passage retrieval

Dataset

You will use a customer service FAQ dataset. We recommend one of the following options:

Option 1: HuggingFace Dataset (Recommended)

Use the customer_support_twitter dataset or similar customer service datasets from HuggingFace:

from datasets import load_dataset

# Load customer support conversations
dataset = load_dataset("salesken/customer_support_twitter")

Alternatively, explore these datasets:

salesken/customer_support_twitter: Real customer support conversations
banking77: Banking customer service intents
SetFit/customer_support: Multi-domain customer support dataset

Option 2: Create Your Own Knowledge Base

You can create a synthetic knowledge base for a specific domain (e-commerce, banking, tech support, etc.):

knowledge_base = [
    {
        "question": "How do I reset my password?",
        "answer": "To reset your password, click 'Forgot Password' on the login page. Enter your email address, and we'll send you a reset link. Follow the link to create a new password.",
        "category": "account_access"
    },
    {
        "question": "What is your return policy?",
        "answer": "We offer a 30-day return policy for most items. Products must be in original condition with tags attached. Refunds are processed within 5-7 business days.",
        "category": "returns"
    },
    # Add 100+ entries for a meaningful knowledge base
]

Option 3: Web Scraping

Scrape FAQs from public customer support pages (ensure compliance with terms of service):

# Example: Parse FAQ pages from a website
import requests
from bs4 import BeautifulSoup

# Your scraping code here

Requirements:

Minimum 100 FAQ entries for meaningful evaluation
Diverse topics/categories (at least 5-10 categories)
Both simple and complex queries
Include some ambiguous questions that require context

Your Tasks

1. Build a Semantic Search System

Implement a semantic search system that can find relevant FAQ entries given a customer query.

Requirements:

a) Encode the Knowledge Base:

Use a pre-trained sentence transformer model (e.g., sentence-transformers/all-MiniLM-L6-v2)
Generate embeddings for all FAQ questions/answers
Store embeddings for efficient retrieval

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
# Your code here

b) Implement Efficient Search:

Use FAISS or similar library for fast similarity search
Implement both cosine similarity and L2 distance metrics
Support retrieving top-k results

import faiss
import numpy as np

# Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

c) Query Processing:

Encode incoming queries using the same model
Retrieve top-k most similar FAQ entries
Return results with similarity scores

2. Implement a Baseline (Keyword Matching)

Create a simple baseline using traditional keyword matching to demonstrate the value of semantic search.

Requirements:

Implement TF-IDF with cosine similarity OR BM25
Use the same knowledge base
Compare retrieval quality against your semantic approach

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Your baseline implementation

3. Build the Retrieval Mechanism

Develop a complete retrieval system that:

a) Handles Query Variations:

Test with paraphrased queries
Handle typos and misspellings (optional: add spelling correction)
Support queries of varying length

b) Context Filtering:

Implement category-based filtering (if applicable)
Use confidence thresholds to reject low-quality matches
Handle "no good match" scenarios gracefully

c) Re-ranking (Advanced):

Optionally implement a re-ranking step using cross-encoders
This can improve retrieval quality for ambiguous queries

4. Generate Contextual Responses

Use the retrieved context to generate helpful responses.

Requirements:

a) Template-Based Generation (Minimum):

Use retrieved FAQ answers directly
Format responses naturally
Include confidence indicators

b) LLM-Based Generation (Recommended):

Use a lightweight LLM (via Ollama, HuggingFace, or OpenAI API)
Provide retrieved context in the prompt
Generate natural, conversational responses

# Example prompt structure
def generate_response(query, retrieved_contexts):
    prompt = f"""You are a helpful customer service assistant.

Customer Question: {query}

Relevant Information:
{retrieved_contexts}

Provide a helpful, accurate response based on the information above.
Do not make up information not present in the context."""

    # Call your LLM here
    return response

c) Response Quality:

Ensure responses are grounded in retrieved context
Handle cases where no good match exists
Provide clear, actionable information

5. Handle Multi-Turn Conversations

Extend your system to maintain context across multiple conversation turns.

Requirements:

a) Conversation State:

Track conversation history
Maintain context from previous turns
Update query encoding with conversation context

b) Context Integration:

Combine current query with relevant prior context
Implement conversation summarization for long dialogues
Handle follow-up questions ("What about shipping?", "How long does that take?")

c) Example Multi-Turn Flow:

User: "I want to return an item"
Bot: "Our return policy allows returns within 30 days..."

User: "How do I start the process?"
Bot: [Uses context that this is about returns]

User: "What about shipping costs?"
Bot: [Understands this relates to return shipping]

6. Evaluate Response Quality

Develop comprehensive evaluation metrics for your system.

Required Metrics:

a) Retrieval Metrics:

Precision@k: Of the k retrieved documents, how many are relevant?
Recall@k: Of all relevant documents, how many are in top-k?
MRR (Mean Reciprocal Rank): Position of first relevant result
Create a test set with ground-truth relevant FAQs for queries

b) Response Quality:

Semantic Similarity: Compare generated response to ground truth
Factual Grounding: Verify responses don't hallucinate information
Human Evaluation: Test on sample queries and manually assess quality

c) Baseline Comparison:

Compare semantic search vs. keyword matching on all metrics
Use statistical tests to determine significance
Visualize results with charts/graphs

d) Error Analysis:

Identify failure modes (when does the system fail?)
Categorize errors (retrieval failures vs. generation issues)
Provide examples of good and bad responses

7. Advanced Features (Optional Bonus)

Implement one or more of these for extra credit:

Hybrid Search: Combine semantic and keyword search for better results
Cross-Encoder Re-ranking: Use a cross-encoder model to re-rank retrieved results
Query Expansion: Expand queries with synonyms or related terms
Active Learning: Identify uncertain cases for human review
Multimodal Support: Handle queries about images or documents
Intent Classification: Classify query intent before retrieval
Conversation Analytics: Track common issues and query patterns

Technical Requirements

Required Libraries

# Core ML/NLP
transformers>=4.30.0
sentence-transformers>=2.2.0
torch>=2.0.0

# Vector Search
faiss-cpu>=1.7.4  # or faiss-gpu for GPU support

# Traditional IR (baseline)
scikit-learn>=1.3.0
rank-bm25>=0.2.2

# Data Handling
datasets>=2.14.0
pandas>=2.0.0
numpy>=1.24.0

# Visualization
matplotlib>=3.7.0
seaborn>=0.12.0
plotly>=5.14.0

# Optional: LLM Integration
openai>=0.27.0  # if using OpenAI
# or use Ollama for local LLMs

Recommended Models

Sentence Encoders (choose one or compare multiple):

sentence-transformers/all-MiniLM-L6-v2 (fast, good quality)
sentence-transformers/all-mpnet-base-v2 (higher quality)
BAAI/bge-small-en-v1.5 (state-of-the-art for retrieval)
intfloat/e5-small-v2 (efficient, strong performance)

Cross-Encoders (for re-ranking):

cross-encoder/ms-marco-MiniLM-L-6-v2

Generation Models (optional):

Local: Llama 3 8B via Ollama
API: OpenAI GPT-3.5-turbo or GPT-4
HuggingFace: Flan-T5, GPT-2, or similar

Computational Requirements

CPU: Sufficient for most encoder models with FAISS
GPU: Recommended for faster embedding generation with larger models
Memory: 8GB+ RAM recommended
Storage: ~2GB for models and data

The entire assignment should run in Google Colaboratory with a free tier GPU and is designed to be completable in 1 week with GenAI assistance.

Deliverables

Submit a Google Colaboratory notebook that includes:

1. Code Implementation (60%)

Complete, runnable code for all required components
Clear code organization with functions/classes
Proper error handling and edge case management
Comments explaining key design decisions

2. Documentation (20%)

Markdown cells explaining your approach for each section
System architecture diagram or description
Model selection justification
Design decisions and trade-offs

3. Evaluation and Analysis (15%)

Comprehensive evaluation metrics implementation
Baseline comparison with statistical analysis
Visualizations (charts, confusion matrices, example outputs)
Error analysis with specific examples

4. Examples and Demo (5%)

At least 10 example queries with system responses
Include both successful and failure cases
Demo of multi-turn conversation (at least 3 turns)
Comparison of semantic vs. keyword baseline on same queries

Required Sections in Notebook

Introduction: Overview of your system
Data Loading: Load and explore the knowledge base
Semantic Search Implementation: Encoder model and FAISS
Baseline Implementation: TF-IDF or BM25
Response Generation: Template or LLM-based
Multi-Turn Handling: Conversation state management
Evaluation: Metrics, comparisons, and analysis
Examples: Interactive demos
Conclusion: Findings, limitations, future improvements

Evaluation Criteria

Your assignment will be graded on the following criteria:

Technical Implementation (40 points)

Semantic Search (15 pts): Correct implementation of encoder + FAISS
Baseline (5 pts): Working keyword-matching baseline
Response Generation (10 pts): Quality and grounding of responses
Multi-Turn (10 pts): Effective conversation context handling

Evaluation and Analysis (25 points)

Metrics (10 pts): Proper implementation of retrieval and quality metrics
Comparison (10 pts): Thorough baseline comparison with statistics
Error Analysis (5 pts): Insightful analysis of failure modes

Code Quality and Documentation (20 points)

Code Organization (10 pts): Clean, modular, well-commented code
Documentation (10 pts): Clear markdown explanations throughout

Examples and Presentation (10 points)

Quality Examples (5 pts): Diverse, illustrative examples
Demo (5 pts): Working interactive demonstration

Creativity and Innovation (5 points)

Advanced Features (3 pts): Implementation of optional features
Novel Insights (2 pts): Unique observations or improvements

Total: 100 points

Grading Rubric

A (90-100): Exceptional implementation with advanced features, thorough evaluation, and clear documentation
B (80-89): Complete implementation of all required components with good evaluation
C (70-79): Working system with basic evaluation and some documentation gaps
D (60-69): Partial implementation with significant gaps
F (<60): Incomplete or non-functional submission

Tips for Success

Getting Started (1-Week Timeline)

Start Simple: Begin with a small knowledge base (20-30 FAQs) to test your pipeline
Incremental Development: Build and test each component separately before integration
Use Examples: Work through concrete examples at each step
Validate Early: Check that embeddings and retrieval make sense before moving to generation
Leverage GenAI: Use ChatGPT, Claude, or GitHub Copilot to accelerate implementation. Ask for help understanding libraries, debugging errors, and optimizing code.

Common Pitfalls to Avoid

Ignoring Normalization: Normalize embeddings for cosine similarity
Wrong Distance Metric: FAISS L2 distance requires normalized vectors for cosine similarity, or use IndexFlatIP
Memory Issues: Batch embedding generation for large knowledge bases
Overfitting to Examples: Test on diverse, unseen queries
Hallucination: Always ground responses in retrieved context
Ignoring Edge Cases: Handle no-match scenarios gracefully

Debugging Strategies

Print Similarities: Inspect actual similarity scores to understand retrieval
Manual Inspection: Look at retrieved documents for sample queries
Embedding Visualization: Use t-SNE/UMAP to visualize embedding space
Start Small: Debug with 10 FAQs before scaling to 100+

Performance Optimization

Cache Embeddings: Don't re-encode the knowledge base every time
Batch Processing: Encode multiple queries at once
FAISS GPU: Use GPU-accelerated FAISS for large knowledge bases
Model Selection: Balance model size with quality needs

Going Beyond Requirements

Implement A/B testing framework to compare different models
Add conversation analytics dashboard
Create a web interface using Gradio or Streamlit
Fine-tune sentence transformers on your domain data
Implement feedback loops for continuous improvement

Resources and References

Key Papers

BERT: Devlin et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805
Sentence-BERT: Reimers & Gurevych (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks". arXiv:1908.10084
RAG: Lewis et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". arXiv:2005.11401
Dense Passage Retrieval: Karpukhin et al. (2020). "Dense Passage Retrieval for Open-Domain Question Answering". arXiv:2004.04906
ColBERT: Khattab & Zaharia (2020). "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT". arXiv:2004.12832

Documentation and Tutorials

HuggingFace Transformers: https://huggingface.co/docs/transformers/
Sentence Transformers: https://www.sbert.net/
FAISS: https://github.com/facebookresearch/faiss/wiki
LangChain (optional framework): https://python.langchain.com/

Code Examples

Sentence Transformers Examples: https://www.sbert.net/examples/applications/semantic-search/README.html
FAISS Tutorial: https://github.com/facebookresearch/faiss/wiki/Getting-started
RAG Tutorial: https://huggingface.co/docs/transformers/model_doc/rag

Datasets

HuggingFace Datasets Hub: https://huggingface.co/datasets
- Search for: "customer support", "FAQ", "helpdesk"
Banking77: https://huggingface.co/datasets/banking77
MS MARCO (optional, for retrieval practice): https://microsoft.github.io/msmarco/

Tools and Libraries

Gradio (for UI): https://gradio.app/
Streamlit (for UI): https://streamlit.io/
Weights & Biases (for experiment tracking): https://wandb.ai/

Additional Reading

"Building Chatbots with Python" - Sumit Raj
"Natural Language Processing with Transformers" - Lewis Tunstall et al.
"Speech and Language Processing" (Chapter on Question Answering) - Jurafsky & Martin

Related Techniques

Hybrid Search: Combining dense and sparse retrieval
Query Expansion: Enhancing queries with related terms
Pseudo-Relevance Feedback: Using top results to refine queries
Learning to Rank: ML approaches to re-ranking results

Submission Guidelines

GitHub Classroom Submission

This assignment is submitted via GitHub Classroom. Follow these steps:

Accept the assignment: Click the assignment link provided in Canvas or by your instructor
- Repository: github.com/ContextLab/customer-service-bot-llm-course
- This creates your own private repository for the assignment

Clone your repository:

git clone https://github.com/ContextLab/customer-service-bot-llm-course-YOUR_USERNAME.git

Complete your work:
- Work in Google Colab, Jupyter, or your preferred environment
- Save your notebook to the repository

Commit and push your changes:

git add .
git commit -m "Complete customer service chatbot assignment"
git push

Verify submission: Check that your latest commit appears in your GitHub repository before the deadline

Deadline: February 6, 2026 at 11:59 PM EST

Notebook Requirements

Runtime: The notebook must run from start to finish without errors
Permissions: Ensure the notebook is accessible (include in your GitHub repository)
Dependencies: All required packages should be installed in the notebook
Data: Include code to automatically download any required datasets
Output: Keep cell outputs visible in your submission

Before Submission Checklist

Academic Integrity

You are encouraged to:

Use generative AI tools (ChatGPT, Claude, Copilot) to help write code
Collaborate with classmates on understanding concepts
Search for tutorials and examples online
Ask questions in class or office hours

You must:

Write your own analysis and explanations
Understand every line of code you submit
Cite any significant code you use from external sources
Submit your own original work

Violations of academic integrity will result in a failing grade for the assignment and potential course-level consequences.

Questions?

If you have questions about the assignment:

Check this README thoroughly
Review the resources and references section
Post questions in the course forum
Attend office hours
Email the instructor/TA with specific questions

Good luck, and have fun building your customer service chatbot!

This assignment is designed to give you hands-on experience with modern NLP techniques used in production systems. The skills you develop here—semantic search, RAG, and evaluation—are directly applicable to real-world AI applications.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Assignment4_Customer_Service_Chatbot.ipynb		Assignment4_Customer_Service_Chatbot.ipynb
LICENSE		LICENSE
README.md		README.md

License

ContextLab/customer-service-bot-llm-course

Folders and files

Latest commit

History

Repository files navigation

Assignment 4: Context-Aware Customer Service Chatbot

Overview

Learning Objectives

Background

Context-Aware Language Understanding

Retrieval-Augmented Generation (RAG)

Key Papers and Concepts

Dataset

Option 1: HuggingFace Dataset (Recommended)

Option 2: Create Your Own Knowledge Base

Option 3: Web Scraping

Your Tasks

1. Build a Semantic Search System

2. Implement a Baseline (Keyword Matching)

3. Build the Retrieval Mechanism

4. Generate Contextual Responses

5. Handle Multi-Turn Conversations

6. Evaluate Response Quality

7. Advanced Features (Optional Bonus)

Technical Requirements

Required Libraries

Recommended Models

Computational Requirements

Deliverables

1. Code Implementation (60%)

2. Documentation (20%)

3. Evaluation and Analysis (15%)

4. Examples and Demo (5%)

Required Sections in Notebook

Evaluation Criteria

Technical Implementation (40 points)

Evaluation and Analysis (25 points)

Code Quality and Documentation (20 points)

Examples and Presentation (10 points)

Creativity and Innovation (5 points)

Grading Rubric

Tips for Success

Getting Started (1-Week Timeline)

Common Pitfalls to Avoid

Debugging Strategies

Performance Optimization

Going Beyond Requirements

Resources and References

Key Papers

Documentation and Tutorials

Code Examples

Datasets

Tools and Libraries

Additional Reading

Related Techniques

Submission Guidelines

GitHub Classroom Submission

Notebook Requirements

Before Submission Checklist

Academic Integrity

Questions?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages