This project implements an automated evaluation system for elicited imitation (EIT) task responses, scoring learner transcriptions against prompt sentences using a meaning-based rubric.
The system focuses on semantic understanding rather than exact matching, combining SBERT embeddings, feature engineering, and a neural network for robust scoring.
- Semantic similarity using Sentence-BERT (SBERT)
- Hybrid feature engineering (semantic + lexical + structural)
- Feature interaction modeling (absolute difference, element-wise product)
- Attention-based neural network for scoring
- Class imbalance handling (oversampling + weighted loss)
- Reproducible pipeline with structured outputs
- Used SBERT (all-MiniLM-L6-v2)
- Generated 384-dimensional embeddings for:
- Prompt sentences
- Learner transcriptions
- Prompt embeddings
- Learner embeddings
- Absolute difference
- Element-wise product
- Cosine similarity
- Edit distance (Levenshtein)
- Word overlap ratio
- Length difference
All features are combined into a 1540-dimensional vector: [384 + 384 + 384 + 384 + 4] = 1540
Custom PyTorch model (EITScorer):
- Fully connected layers: 1540 → 512 → 256
- Batch Normalization
- Attention mechanism
- Dropout (0.3)
- Output: 5 classes (scores 0–4)
- Stratified train-test split
- Oversampling minority classes
- Class-weighted loss function
- Consistency checks across similar errors
- Manual validation for semantic correctness
- Edge case testing (paraphrases, incomplete responses)
- Feature ablation insights
AutoEIT/ ├── python.ipynb ├── Soumya_TestResults.csv ├── Soumya_TestResults.pdf ├── auto_eit_clean_dataset.csv ├── README.md
- Python
- PyTorch
- Sentence-Transformers (SBERT)
- Scikit-learn
- RapidFuzz
- Pandas / NumPy
pip install -r requirements.txt
jupyter notebook
- Capturing meaning beyond surface-level matching
- Handling class imbalance in scoring labels
- Designing robust multi-feature representations
- Preventing overfitting
- Fine-tune SBERT on domain-specific data
- Add explainable scoring (feature importance)
- Integrate LLM-based evaluation
- Deploy as an API
Developed as part of the AutoEIT Evaluation Test for GSoC application (CERN Human-AI Team).
Feel free to fork, improve, or suggest enhancements!