feat/past-exam-ingestion

# feat/past-exam-ingestion

Implement the past exam ingestion service. When a user uploads past exam PDFs for a module, this service extracts their full text content and sends it to Claude Sonnet 4.6 to produce a structured style fingerprint — a concise summary of how exam questions are written, what types of questions appear, difficulty distribution, and distractor patterns. This style fingerprint is stored in the `style_guides` table and injected into every MCQ generation prompt for the module, ensuring generated questions match the real exam's style.

## Context

Refer to the project context document for the full tech stack, architecture, and design principles. This issue depends on `feat/file-upload`, `feat/pdf-to-markdown`, and `feat/enriched-markdown-store` being completed. The `PDFParser.extract_full_text()` method introduced in `feat/pdf-to-markdown` is used here to extract raw text from exam PDFs. The `style_guides` table was defined in `feat/database-schema`.

**Key design decisions:**
- Style extraction uses **Claude Sonnet 4.6** — this is a reasoning-intensive one-time task per module and quality matters significantly here since it affects every generated question
- Style extraction is **not** done via the Batch API — it is a single synchronous call (one per module) and the latency is acceptable
- If a module has multiple past exam PDFs uploaded, their text is **concatenated** before sending to the LLM — more examples produce a better style fingerprint
- The style fingerprint is stored as structured JSON in `style_guides.style_summary` so it can be reliably injected into generation prompts
- The style guide is regenerated from scratch if new exam PDFs are uploaded — it is not incrementally updated

## Todos

### Backend

- [ ] Add to `core/config.py`:
  - `STYLE_EXTRACTION_MODEL` (default: `"claude-sonnet-4-6"`)
  - `STYLE_EXTRACTION_MAX_TOKENS` (default: `1500`)

- [ ] Create `/backend/app/services/exam_ingestion_service.py` with:

  **`extract_exam_text(module_id: UUID, db: AsyncSession) -> str`**
  - Fetches all `File` records for the module where `file_type == exam_pdf`
  - For each file, calls `PDFParser.extract_full_text(file.stored_path)`
  - Concatenates all extracted texts separated by `\n\n--- EXAM SEPARATOR ---\n\n`
  - Returns the combined string
  - Raises `NoExamFilesError` if no exam PDFs exist for the module

  **`build_style_extraction_prompt(exam_text: str) -> str`**
  - Returns a prompt instructing Claude Sonnet to analyse the exam text and produce a JSON style fingerprint
  - The prompt must instruct the model to identify and return **only a JSON object** (no preamble, no markdown fences) with the following structure:
    ```json
    {
      "question_styles": [
        "Questions typically begin with action verbs like 'Which', 'What', 'Identify'",
        "Questions often present a scenario or code snippet before asking the question"
      ],
      "distractor_patterns": [
        "Wrong answers are plausible variations of the correct answer, not obviously wrong",
        "Distractors often include common misconceptions students have"
      ],
      "difficulty_distribution": {
        "easy": 0.2,
        "medium": 0.5,
        "hard": 0.3
      },
      "question_types_observed": ["single_correct", "multi_correct", "code_based", "definition_based"],
      "average_options_count": 4,
      "topic_focus_notes": "Questions focus heavily on edge cases and runtime complexity rather than basic definitions",
      "additional_notes": "Any other stylistic observations relevant to generating realistic questions"
    }
    ```
  - The prompt should include the first 8000 tokens of exam text (truncate if longer to stay within context limits)
  - Instruct the model to base observations only on the provided exam content, not general knowledge

  **`extract_style_fingerprint(module_id: UUID, db: AsyncSession) -> StyleGuide`**
  - Calls `extract_exam_text()` to get combined exam content
  - Calls `build_style_extraction_prompt()` to construct the prompt
  - Sends to Anthropic API: `client.messages.create(model=STYLE_EXTRACTION_MODEL, ...)`
  - Parses the JSON response — if parsing fails, retries once with an explicit instruction to return only valid JSON
  - Upserts the `StyleGuide` record for the module (create if not exists, update if exists)
  - Sets `style_guides.raw_content` to the concatenated exam text
  - Sets `style_guides.style_summary` to the parsed JSON string
  - Returns the `StyleGuide` record

- [ ] Create Pydantic schemas in `/backend/app/schemas/style_guide.py`:
  - `StyleGuideResponse`: id, module_id, style_summary (parsed as dict), created_at, updated_at
  - `StyleExtractionStatus`: has_exam_files (bool), has_style_guide (bool), style_guide (StyleGuideResponse or null)

- [ ] Create API endpoints in `/backend/app/api/routes/style_guide.py` (all protected):
  - `POST /api/modules/{module_id}/extract-style` — triggers style extraction, runs synchronously (not background task — fast enough and user should wait for confirmation), returns `StyleGuideResponse`
  - `GET /api/modules/{module_id}/style-guide` — returns current style guide or 404 if not yet extracted
  - `DELETE /api/modules/{module_id}/style-guide` — deletes the style guide so it can be re-extracted after uploading more exam files

- [ ] Add error handling:
  - `NoExamFilesError` → HTTP 400: `"No past exam files uploaded for this module. Upload at least one exam PDF before extracting style."`
  - JSON parse failure after retry → HTTP 500: store raw response text in `style_guides.raw_content` and `style_guides.style_summary` as null, return error with the raw text so the user can report it
  - Anthropic API error → HTTP 502 with message forwarded from the API error

### Frontend

- [ ] On the module detail page, add a "Exam Style" section that shows:
  - Upload status: how many exam PDFs are uploaded
  - A "Extract Style Guide" button — enabled only when at least one exam PDF is uploaded
  - A loading spinner while extraction is in progress (this call is synchronous, typically 5–15 seconds)
  - Once extracted: render the style guide as a readable summary card showing:
    - Question styles (as a bullet list)
    - Difficulty distribution (as a simple bar or percentage display)
    - Distractor patterns
    - Topic focus notes
  - A "Re-extract" button to regenerate after uploading additional exam files
- [ ] The "Generate Questions" button (introduced in `feat/mcq-generation`) should be disabled with a tooltip `"Extract exam style guide first"` if no style guide exists for the module

## Acceptance Criteria

- [ ] `POST /api/modules/{module_id}/extract-style` with one or more exam PDFs uploaded returns a valid `StyleGuideResponse` with a non-null `style_summary`
- [ ] The `style_summary` is valid JSON matching the expected schema (all required fields present)
- [ ] With two exam PDFs uploaded, both are concatenated and the style guide reflects content from both
- [ ] Calling `POST /api/modules/{module_id}/extract-style` a second time overwrites the existing style guide — no duplicate records
- [ ] `GET /api/modules/{module_id}/style-guide` returns the style guide after extraction and 404 before
- [ ] `DELETE /api/modules/{module_id}/style-guide` removes the record and subsequent `GET` returns 404
- [ ] Calling extract with no exam PDFs uploaded returns HTTP 400 with the correct error message
- [ ] If the LLM returns malformed JSON, the service retries once before returning an error — the raw response is stored for debugging
- [ ] The frontend style guide card renders all fields correctly and the "Re-extract" button correctly regenerates the guide
- [ ] The "Generate Questions" button is disabled when no style guide exists

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/past-exam-ingestion #10

feat/past-exam-ingestion

Context

Todos

Backend

Frontend

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat/past-exam-ingestion #10

Description

feat/past-exam-ingestion

Context

Todos

Backend

Frontend

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions