Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions student_analysis_pipeline/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Database Configuration
# format: postgresql://user:pass@host:port/dbname
DATABASE_URL=postgresql://user:password@localhost:5433/pilotgenai_dev_pg

# Portkey LLM Gateway Configuration
PORTKEY_API_KEY=your_api_key_here
PORTKEY_BASE_URL=https://ai-gateway.apps.cloud.rt.nyu.edu/v1
25 changes: 25 additions & 0 deletions student_analysis_pipeline/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Python
__pycache__/
*.py[cod]
*$py.class
venv/
.env

# Project Specific
extracted_data/
outputs/
temp/
centralized_data/
analysis_results*.json
*.json
!requirements.txt
!todo.md
!.env.example

# OS
.DS_Store

# IDE
.vscode/
.idea/
.ipynb_checkpoints/
110 changes: 108 additions & 2 deletions student_analysis_pipeline/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ Automated homework analysis using LLM-as-judge to evaluate student performance a
## Overview

This pipeline analyzes student homework conversations and produces:

- **Quantitative metrics**: Total attempted, solved, and errors
- **Qualitative metrics**: Topic proficiency (Mastered vs Needs Practice)
- **Practice problems**: 4-5 new problems for weak topics
Expand All @@ -13,11 +14,33 @@ This pipeline analyzes student homework conversations and produces:

- Python 3.7+
- `requests` library (for Portkey API calls)
- `python-dotenv` (for environment configuration)

## Security & Environment Variables

**CRITICAL: Never commit your `.env` file to version control.**

1. **Create your `.env` file**:
Copy the template provided:

```bash
cp .env.example .env
```

2. **Configure sensitive values**:
Open `.env` and fill in your:
- `DATABASE_URL` (Includes DB password)
- `PORTKEY_API_KEY` (Your LLM Gateway key)

The pipeline and connector API will automatically load these values at runtime.

## File Structure

```
student_analysis_pipeline/
├── db_connector/ # Database extraction API
│ ├── connector_api.py # FastAPI service
│ └── db_models.py # SQLAlchemy ORM models
├── main.py # Pipeline orchestrator (run this)
├── pipeline.py # Core pipeline logic (Steps 1-4)
├── data_loader.py # Input file parsers
Expand Down Expand Up @@ -97,6 +120,7 @@ python3 main.py
To analyze a different student or homework:

Edit paths in `main.py`:

```python
QUESTIONS_PATH = "hw4/hw4_question.md"
SOLUTIONS_PATH = "hw4/hw4_reference_solution.md"
Expand All @@ -106,6 +130,7 @@ CHAT_PATH = "hw4/student_conversations/ab12167_hw4_chats.json"
## API Configuration

The pipeline uses NYU's Portkey gateway:

- Base URL: `https://ai-gateway.apps.cloud.rt.nyu.edu/v1`
- Model: GPT-4o (`@gpt-4o/gpt-4o`)
- Credentials are in `utils.py`
Expand All @@ -118,9 +143,10 @@ The pipeline uses NYU's Portkey gateway:

To use it for other subjects (Algebra, Statistics, Physics, etc.), you MUST update the following in `pipeline.py`:

### Changes Required:
### Changes Required

**1. Step 1: Topic Mapping (Line 16)**

```python
# Current (Calculus-specific):
system_prompt = """You are an expert Calculus tutor. Your task is to identify the calculus concepts each question tests."""
Expand All @@ -135,6 +161,7 @@ system_prompt = """You are an expert [SUBJECT] tutor. Your task is to identify t
```

**2. Step 2: Per-Question Evaluation (Line 49)**

```python
# Current (Calculus-specific):
system_prompt = """You are an expert Calculus tutor acting as a judge."""
Expand All @@ -144,6 +171,7 @@ system_prompt = """You are an expert [SUBJECT] tutor acting as a judge."""
```

**3. Step 4: Practice Problem Generation (Line 238)**

```python
# Current (Calculus-specific):
system_prompt = """You are an expert Calculus tutor. Generate practice problems..."""
Expand All @@ -152,21 +180,25 @@ system_prompt = """You are an expert Calculus tutor. Generate practice problems.
system_prompt = """You are an expert [SUBJECT] tutor. Generate practice problems..."""
```

### Examples for Different Subjects:
### Examples for Different Subjects

**For Algebra:**

- Replace "Calculus tutor" with "Algebra tutor"
- Replace "calculus concepts" with "algebra concepts"

**For Statistics:**

- Replace "Calculus tutor" with "Statistics tutor"
- Replace "calculus concepts" with "statistics concepts"

**For Physics:**

- Replace "Calculus tutor" with "Physics tutor"
- Replace "calculus concepts" with "physics concepts"

**For Chemistry:**

- Replace "Calculus tutor" with "Chemistry tutor"
- Replace "calculus concepts" with "chemistry concepts"

Expand All @@ -179,27 +211,101 @@ system_prompt = """You are an expert [SUBJECT] tutor. Generate practice problems
This project includes several documentation files to help you understand and use the pipeline:

**Core Documentation:**

- `README.md` (this file) - Main documentation and usage guide
- `metrics_definitions.md` - Detailed definitions of all metrics and evaluation criteria

**Technical Documentation:**

- `HOW_LLM_WORKS.md` - Explanation of how LLM evaluation works and why we send full conversation
- `LLM_CALLS_ANALYSIS.md` - Breakdown of all 17 LLM calls, input data, and efficiency analysis

**Input Files:**

- `hw4_question.md` - Homework questions
- `hw4_reference_solution.md` - Reference solutions
- `ab12167_hw4_conversation.md` - Student conversation with AI tutor

**Code Files:**

- `main.py` - Pipeline orchestrator (run this)
- `pipeline.py` - Core pipeline logic (Steps 1-4)
- `data_loader.py` - Input file parsers
- `utils.py` - Portkey API client with retry logic
- `export_conversation.py` - Script to convert JSON chats to markdown

**Output:**

- `analysis_results.json` - Generated analysis report

**Start by reading `metrics_definitions.md` to understand what the pipeline measures, then read `HOW_LLM_WORKS.md` to understand the implementation.**

---

## Database Connector (API)

The `db_connector` subfolder contains a standalone FastAPI service designed to extract student conversation data directly from the Open WebUI PostgreSQL database.

### Features

- **Group/Model Filtering**: Extract data for a specific group (e.g., "Math_Class") and model (e.g., "Homework4").
- **Local Storage**: Automatically creates folders and saves student conversations as JSON arrays (aggregated by user).
- **Time Range Support**: Optional filtering by `start_date` and `end_date` (Unix timestamps).

### How to Run the API

#### Installation

```bash
cd student_analysis_pipeline
pip install -r requirements.txt
```

#### Option A: Local Deployment (for Development/Testing)

To connect to the database from your local machine, you must establish a tunnel (port-forward) to the OpenShift database pod.

1. **Start the Tunnel**:

```bash
oc port-forward pod/<DB_POD_NAME> 5433:5432 -n rit-genai-naga-dev
```

2. **Set Local URL**:

```bash
export DATABASE_URL="postgresql://user:pass@localhost:5433/dbname"
```

3. **Run API**:

```bash
uvicorn connector_api:app --reload
```

#### Option B: Server Deployment (OpenShift/Production)

When deployed on the server, the API connects directly to the internal database service.

1. **Environment Variables**:
Ensure `DATABASE_URL` points to the internal service DNS (default port 5432).

```bash
DATABASE_URL="postgresql://user:pass@db-service-name:5432/dbname"
```

2. **Run API**:

```bash
uvicorn connector_api:app --host 0.0.0.0 --port 8000
```

### Triggering an Export

Once the server is running, trigger an extraction via `curl`:

```bash
curl "http://localhost:8000/export?group=MY_GROUP&model=MY_MODEL"
```

The data will be saved to `student_analysis_pipeline/extracted_data/{group}__{model}/`.
Loading