Student Question Classification & Routing System

🎯 Project Overview

An intelligent ML-powered system that automatically classifies and prioritizes student questions in AI/ML education contexts. This project addresses the real challenge of managing hundreds of student questions efficiently by categorizing them into technical domains and urgency levels.

🔍 Problem Statement

In AI education settings, instructors receive numerous questions across various topics (Python basics, ML algorithms, debugging, conceptual understanding). Manually triaging these questions is time-consuming and inconsistent. This system uses NLP and machine learning to:

Classify questions by topic (Python/Programming, Machine Learning, Deep Learning, Data Processing, Conceptual/Theory)
Assess urgency level (Critical/Blocking, High Priority, Normal, Low Priority)
Route to appropriate resources or instructors based on classification

🛠️ Technical Approach

Data Collection & Preparation

Synthetic dataset generation based on real educational patterns
Data includes: question text, category labels, urgency levels
Train/test split with stratification to maintain class balance
Text preprocessing: lowercasing, tokenization, handling code snippets

Feature Engineering

TF-IDF vectorization for text representation
Tuned parameters: max_features=5000, ngram_range=(1,2)
Captures both single words and bigrams for better context
Handles code-specific terminology and technical vocabulary

Model Selection & Training

Primary Model: Logistic Regression with L2 regularization
Alternative explored: Random Forest for comparison
Multi-class classification with balanced class weights
Hyperparameter tuning via grid search

Model Evaluation

Classification metrics: Precision, Recall, F1-score
Confusion matrix analysis to identify misclassification patterns
Cross-validation to ensure generalization
Performance analysis across different question types

📊 Results

Category Classification Performance

                    precision    recall  f1-score   support

 Conceptual/Theory       1.00      1.00      1.00        40
   Data Processing       1.00      1.00      1.00        40
     Deep Learning       1.00      1.00      1.00        40
  Machine Learning       1.00      1.00      1.00        40
Python/Programming       1.00      1.00      1.00        40

          accuracy                           1.00       200
         macro avg       1.00      1.00      1.00       200
      weighted avg       1.00      1.00      1.00       200

Analysis: Perfect classification on test set with 79.8% mean confidence. The model strongly leverages technical keywords to distinguish question categories.

Urgency Classification Performance

              precision    recall  f1-score   support

    Critical       0.47      0.50      0.48        28
        High       0.59      0.58      0.59        50
         Low       0.34      0.67      0.45        18
      Normal       0.77      0.63      0.69       104

    accuracy                           0.60       200
   macro avg       0.54      0.60      0.55       200
weighted avg       0.64      0.60      0.62       200

Analysis: Lower accuracy (60%) reflects the inherent subjectivity of urgency assessment. Main confusion occurs between Normal and High priority questions, which aligns with real-world ambiguity.

🚀 Usage

Installation

pip install -r requirements.txt

Training the Model

python train.py

Making Predictions

python predict.py "How do I fix this AttributeError in my neural network?"

Running Evaluation

python evaluate.py

📁 Project Structure

student-question-classifier/
├── README.md                 # Project documentation
├── requirements.txt          # Python dependencies
├── data/
│   ├── generate_data.py     # Synthetic data generation
│   └── questions.csv        # Generated training data
├── src/
│   ├── preprocessing.py     # Text preprocessing utilities
│   ├── feature_engineering.py  # TF-IDF and feature extraction
│   └── models.py            # Model definitions and training
├── train.py                 # Main training script
├── evaluate.py              # Model evaluation script
├── predict.py               # Inference script
├── models/                  # Saved model artifacts
│   ├── category_classifier.pkl
│   ├── urgency_classifier.pkl
│   └── vectorizer.pkl
└── notebooks/
    └── exploratory_analysis.ipynb  # Data exploration and visualization

🔧 Technical Challenges & Solutions

Challenge 1: Class Imbalance

Problem: Not all question categories appear equally frequently in real educational settings. Solution: Applied class weighting in the model to ensure minority classes receive appropriate attention during training.

Challenge 2: Code Snippet Handling

Problem: Questions containing code snippets have different linguistic patterns than natural language. Solution: Preserved code structure in preprocessing while still extracting semantic meaning through character n-grams.

Challenge 3: Multi-label Ambiguity

Problem: Some questions span multiple categories (e.g., "How do I implement gradient descent in Python?") Solution: Built separate models for category and urgency to allow independent classification. Future work could explore multi-label classification.

Challenge 4: Urgency Assessment

Problem: Urgency is contextual and subjective compared to topic classification. Solution: Trained on patterns like "not working", "error", "urgent", "deadline" combined with question sentiment analysis.

📈 Future Improvements

Deep Learning Approach: Implement BERT-based classification for better semantic understanding
Active Learning: Incorporate instructor feedback to continuously improve classification
Multi-label Support: Allow questions to belong to multiple categories simultaneously
Confidence Scores: Add probability outputs to flag uncertain classifications for manual review
Real-time API: Deploy as a REST API for integration with learning management systems
Expanded Features: Include student history, previous questions, and course progress context

🎓 Educational Context

This project emerged from real challenges in teaching AI/ML courses where:

50-100+ students generate 200+ questions per course
Questions range from basic Python syntax to advanced ML theory
Response time directly impacts student learning and retention
Instructors need to prioritize high-impact interventions

The classification system enables:

Automated routing to teaching assistants based on expertise
Priority queuing for critical blocking issues
Self-service recommendations by matching to FAQ/documentation
Analytics on common confusion points to improve curriculum

🤝 Contributing

This is a learning project, but suggestions and improvements are welcome:

Fork the repository
Create a feature branch
Make your changes with clear commit messages
Submit a pull request with description

📝 License

MIT License - feel free to use this for educational purposes.

👤 Author

Christopher Lee

Product Management Consultant & AI Educator
Teaching AI/ML Mastery classes in Queens, NY
GitHub: @pmchrislee
Email: [email protected]

🙏 Acknowledgments

Built as part of learning journey in practical ML engineering
Inspired by real challenges in AI education delivery
Thanks to the open-source ML community for excellent tools and resources

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Student Question Classification & Routing System

🎯 Project Overview

🔍 Problem Statement

🛠️ Technical Approach

Data Collection & Preparation

Feature Engineering

Model Selection & Training

Model Evaluation

📊 Results

Category Classification Performance

Urgency Classification Performance

🚀 Usage

Installation

Training the Model

Making Predictions

Running Evaluation

📁 Project Structure

🔧 Technical Challenges & Solutions

Challenge 1: Class Imbalance

Challenge 2: Code Snippet Handling

Challenge 3: Multi-label Ambiguity

Challenge 4: Urgency Assessment

📈 Future Improvements

🎓 Educational Context

🤝 Contributing

📝 License

👤 Author

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
models		models
src		src
.gitignore		.gitignore
LINKEDIN_REACH_PROJECT_WALKTHROUGH.md		LINKEDIN_REACH_PROJECT_WALKTHROUGH.md
README.md		README.md
evaluate.py		evaluate.py
predict.py		predict.py
requirements.txt		requirements.txt
train.py		train.py

pmchrislee/student-question-classifier

Folders and files

Latest commit

History

Repository files navigation

Student Question Classification & Routing System

🎯 Project Overview

🔍 Problem Statement

🛠️ Technical Approach

Data Collection & Preparation

Feature Engineering

Model Selection & Training

Model Evaluation

📊 Results

Category Classification Performance

Urgency Classification Performance

🚀 Usage

Installation

Training the Model

Making Predictions

Running Evaluation

📁 Project Structure

🔧 Technical Challenges & Solutions

Challenge 1: Class Imbalance

Challenge 2: Code Snippet Handling

Challenge 3: Multi-label Ambiguity

Challenge 4: Urgency Assessment

📈 Future Improvements

🎓 Educational Context

🤝 Contributing

📝 License

👤 Author

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages