📚 Educational Intervention Skill Tagging

Overview

This application analyzes educational text passages to identify optimal intervention points by mapping content to specific academic skills and providing targeted discussion questions for follow-up learning. It uses large language models (LLMs) to intelligently detect, rate, and explain skill alignment—helping educators personalize instruction and improve learning outcomes.

Sample Output

See full output here: output/combined_data_final.xlsx

🧠 System Design & Methodology

🧩 1. Skill-Based Analysis Framework

Uses a comprehensive taxonomy of 69 educational competencies from input/skills.csv
Skills span a wide range of domains:
- Science (e.g., life sciences, physics, earth science)
- Social Studies (e.g., history, geography, civics)
- Language Arts
- Mathematics
- Arts & Physical Education
- Digital Literacy

🤖 2. AI-Powered Skill Assessment

Powered by Groq LLM API (llama3-70b-8192) through llm_service.py
Uses structured prompt templates to ensure consistency
Low temperature (0.01) for deterministic, repeatable outputs

Each model response includes:

Identified skill tag(s)
Alignment rating (scale of 0–10)
Pedagogical explanation
Highlighted text excerpt supporting the alignment

🎯 3. Intervention Point Detection

The system pinpoints passages that:

Strongly align with specific skills (ratings: 9–10)
Show partial alignment or emerging understanding (ratings: 5–6)
Offer opportunities for teacher-led discussion or review
Map multiple skills to the same passage when relevant

💬 4. Follow up Discussion

The system generates targeted discussion points that:

Reinforce key concepts through guided questioning
Connect skills across different subject areas
Promote critical thinking with open-ended prompts
Support differentiated instruction with varying difficulty levels

🛠️ Technical Implementation

Python-based processing pipeline
Structured prompt engineering - JSON Output
Various Pompt Techniques (Few-Shot Learning, Tooling, Chaining)
LLM output stored and analyzed using DataFrames
Embedding-based dataset joins to reduce hallucinations
Final output: Excel reports for easy review & collaboration

▶️ How to Run

Sign up at Groq and obtain a free API key.
Set up your environment variable:
```
export GROQ_API_KEY='your-api-key-here'
```

Set up the virtual environment:

# Create a new virtual environment
python -m venv skill-venv

# Activate the virtual environment
source skill-venv/bin/activate  # On macOS/Linux
# or
.\skill-venv\Scripts\activate  # On Windows

# Install dependencies
pip install -r requirements.txt

Run the skill alignment script:
```
python run_01_align_skills_to_stories.py
```
This will process the stories from input/stories.csv and generate skill alignments.
Combine the data:
```
python run_02_combine_data.py
```
This will generate the final combined output in output/combined_data_final.xlsx.
(Optional) Generate discussion questions:
```
python run_03_generate_discussion_questions.py
```
This will create additional discussion questions in output/discussion_questions.xlsx.

The final outputs will be available in the output/ directory:

combined_data_final.xlsx: Main output with skill alignments
discussion_questions.xlsx: Secondary output with associated questions for skill discussion

🤖 LLM Service Implementation

The llm_service.py file provides a robust implementation for processing educational content using the Groq LLM API. Here's a detailed breakdown of its functionality:

Key Prompt Components:
- Skills Augmented Analysis: Analyzes text passages to identify and rate educational skills
- Discussion Question Generation: Creates targeted questions based on identified skills
- Few-Shot Learning: Uses example-based prompting from examples/few_shot_examples_discussion_questions.json
- Custom Tooling: Supports GPT-4 function calling for structured outputs

Output Structure & Sample Output:

Skills Analysis Output Structure:

{
  "skills": [
    {
      "skill": "skill description",
      "explanation": "why it is aligned",
      "story_excerpt": "where in the story to stop to review this skill",
      "rating": 0-10
    }
  ]
}

Sample Output:

{
  "skills": [
    {
      "skill": "Knows about transportation",
      "explanation": "The story mentions going in a car and on a train, showing an understanding of different modes of transportation.",
      "story_excerpt": "Some days, Dad and I go in the car. Dad drives. I ride. Some days, Dad and I go on the train.",
      "rating": 10
    }
  ]
}

Discussion Questions Output Structure:

[
  {
    "question": "question text",
    "type": "Recall/Comprehension/Application",
    "instructional_purpose": "purpose of the question"
  }
]

Sample Output:

{
  "questions": [
    {
      "question": "What are two ways the family travels?",
      "type": "Recall",
      "instructional_purpose": "Assess whether the student can recall the modes of transportation mentioned in the story."
    },
    {
      "question": "Why did the family choose to take the train for their vacation?",
      "type": "Comprehension",
      "instructional_purpose": "Assess whether the student understands the reason behind the family's transportation choice."
    },
    {
      "question": "What other ways can people travel besides cars and trains?",
      "type": "Application",
      "instructional_purpose": "Requires the student to think about other modes of transportation beyond what was mentioned in the story."
    }
  ]
}

Quality Control:
- Prompt Templates: Implements structured prompt templates
- Validation: Uses JSON schema validation
- Error Handling: Includes comprehensive error handling and retry mechanisms
- Debugging: Supports debugging through message printing
Sample Usage:
```
python llm_service.py
```

To see a sample output for one story, checkout the output/sample_prompt_chain.txt file, which demonstrates the full processing pipeline from story analysis to question generation.

🧪 Planned Human Review & Quality Control

🔍 Evaluation with Human Ratings

Randomly sample and review LLM-generated outputs.
Human raters evaluate skill alignment, clarity, and pedagogical value.

🧠 Prompt Optimization & Edge Case Analysis

Compare human and model ratings to better engineer prompts.
Identify skill categories or content formats where the model underperforms.

🤖 LLM-as-a-Judge for Scalable Review

Use a separate model with a prompt that mimics human evaluation behavior to assess content.
Helps reduce reliance on manual reviews for future outputs.

🧪 Lightweight A/B Testing

Run controlled comparisons of LLM-generated interventions.
Use engagement or comprehension metrics to assess effectiveness.

📈 Planned Improvements

Improve LLM output validation and error handling
Implement a scalable LLM-as-a-Judge system for reviews
Add another prompt for skills assessment
Add dynamic text highlighting based on skill strength
Integrate student engagement metrics for optimization
Visualize and track skill dependencies across stories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!