AI-powered answer sheet evaluation system using modern NLP techniques
Built with Sentence-Transformers, TextBlob, PyMuPDF, and PDFPlumber
The Automated Answer Sheet Evaluation System is an AI-powered solution designed to revolutionize how academic institutions grade student answer sheets. Using advanced Natural Language Processing (NLP) techniques, our system can:
- Parse PDF answer sheets with complex layouts
- Extract answers and match them to questions
- Evaluate responses based on grammar, keywords, and semantic similarity
- Generate comprehensive score reports
This project represents a significant advancement in educational technology, reducing grading time by up to 85% compared to manual methods.
| PDF Processor 📄→📝 |
→ | Answer Parser 📝→❓❔ |
→ | Scoring Engine ❓❔→🔢 |
→ | Result Generator 🔢→📊 |
| Extracts text with layout preservation | Matches questions to answers | Evaluates with weighted scoring | Creates detailed reports |
- Intelligent PDF Processing: Handles various formats, column layouts, and page structures
- Adaptive Scoring System: Weighted evaluation based on grammar (10-20%), keywords (40-60%), and semantic similarity (20-50%)
- Flexible Question Detection: Supports 15+ question numbering formats (Q1, 1), Question 2, etc.)
- Semantic Understanding: Recognizes conceptually correct answers even with different phrasing
- Format Tolerance: Handles spacing issues, line breaks, and various formatting inconsistencies
- PDF Processing: PyMuPDF, PDFPlumber
- NLP & ML:
- TextBlob (Grammar Analysis)
- Sentence-Transformers (Semantic Similarity)
- Regular Expressions (Answer Parsing)
- Data Handling: Pandas, NumPy
Our system employs a three-pronged approach to scoring:
| Component | Tool | Weight | Function |
|---|---|---|---|
| Grammar Check | TextBlob | 10-20% | Evaluates spelling, syntax, and structural correctness |
| Keyword Matching | Custom Algorithm | 40-60% | Identifies presence of critical concepts and terms |
| Semantic Similarity | Sentence-Transformers | 20-50% | Measures conceptual alignment with model answers |
| ✅ Completed | 🚧 In Progress | 🔮 Future Goals |
|
- Core PDF processing engine - Answer extraction algorithm - Scoring system fundamentals - Initial test dataset creation - Proof-of-concept in Colab |
- Improving parser accuracy - Expanding test dataset - Local server implementation - Directory structure refinement - Enhanced error handling |
- Web-based frontend - Handwriting recognition - Multilingual support - Diagram/equation evaluation - LMS integration |
One of the most innovative aspects of this project is our approach to dataset creation:
- Source Material: We started with the Software Engineering Interview Questions dataset from Kaggle
- Transformation Process: Developed Python scripts to generate various PDF formats of answer sheets
- Test Variations: Created four distinct types of answer sheets:
test_perfect.pdf- Ideal formatting with proper structuretest_perfect_refined.pdf- Ideal content with varying spacingtest_anomalous.pdf- Challenging format with irregular question orderingtest_anomalous_refined.pdf- Complex formatting with intentional errors
- Expansion Plan: Currently developing scripts to generate 50-60 additional synthetic answer sheets with controlled variations to further improve parsing accuracy
This methodical approach to dataset creation enables systematic testing and improvement of our parsing algorithms across a wide variety of real-world scenarios.
Clone the repository git clone https://github.com/yourusername/answer-sheet-evaluation-system.git
Install dependencies pip install -r requirements.txt
Run the Jupyter notebook in Google Colab
or
For local development (future implementation) python src/main.py --pdf path/to/answer_sheet.pdf --rollno S001 --name "John Doe"
- Google Colab notebook (
Answer_Sheet_Evaluation_System.ipynb) - Requires uploaded PDFs and CSV files
- Standalone application with proper directory structure
- Web interface for easier interaction
- Containerized deployment for educational institutions
Current Capabilities:
- PDF text extraction rate: ~70%
- Scoring accuracy: ~75% alignment with human evaluators
- Processing speed: ~2.3 seconds per page
- Cost efficiency: ~$0.01 per sheet
Current Limitations:
- Handwriting recognition not yet implemented
- No support for diagrams or mathematical equations
- English-only language support
- Limited to text-based PDFs
- Requires well-structured answer formats for best results
| Phase | Focus | Status |
|---|---|---|
| Phase 1 | Core functionality and proof of concept | ✅ Complete |
| Phase 2 | Improved parsing accuracy and expanded dataset | 🚧 In Progress |
| Phase 3 | Web interface and local server implementation | 🔮 Planned |
| Phase 4 | Advanced features (OCR, multilingual support) | 🔮 Future |
| Phase 5 | Integration with LMS platforms | 🔮 Vision |
This project is licensed under the MIT License - see the LICENSE file for details.