-
Notifications
You must be signed in to change notification settings - Fork 0
feat/past-exam-ingestion #10
Description
feat/past-exam-ingestion
Implement the past exam ingestion service. When a user uploads past exam PDFs for a module, this service extracts their full text content and sends it to Claude Sonnet 4.6 to produce a structured style fingerprint — a concise summary of how exam questions are written, what types of questions appear, difficulty distribution, and distractor patterns. This style fingerprint is stored in the style_guides table and injected into every MCQ generation prompt for the module, ensuring generated questions match the real exam's style.
Context
Refer to the project context document for the full tech stack, architecture, and design principles. This issue depends on feat/file-upload, feat/pdf-to-markdown, and feat/enriched-markdown-store being completed. The PDFParser.extract_full_text() method introduced in feat/pdf-to-markdown is used here to extract raw text from exam PDFs. The style_guides table was defined in feat/database-schema.
Key design decisions:
- Style extraction uses Claude Sonnet 4.6 — this is a reasoning-intensive one-time task per module and quality matters significantly here since it affects every generated question
- Style extraction is not done via the Batch API — it is a single synchronous call (one per module) and the latency is acceptable
- If a module has multiple past exam PDFs uploaded, their text is concatenated before sending to the LLM — more examples produce a better style fingerprint
- The style fingerprint is stored as structured JSON in
style_guides.style_summaryso it can be reliably injected into generation prompts - The style guide is regenerated from scratch if new exam PDFs are uploaded — it is not incrementally updated
Todos
Backend
-
Add to
core/config.py:STYLE_EXTRACTION_MODEL(default:"claude-sonnet-4-6")STYLE_EXTRACTION_MAX_TOKENS(default:1500)
-
Create
/backend/app/services/exam_ingestion_service.pywith:extract_exam_text(module_id: UUID, db: AsyncSession) -> str- Fetches all
Filerecords for the module wherefile_type == exam_pdf - For each file, calls
PDFParser.extract_full_text(file.stored_path) - Concatenates all extracted texts separated by
\n\n--- EXAM SEPARATOR ---\n\n - Returns the combined string
- Raises
NoExamFilesErrorif no exam PDFs exist for the module
build_style_extraction_prompt(exam_text: str) -> str- Returns a prompt instructing Claude Sonnet to analyse the exam text and produce a JSON style fingerprint
- The prompt must instruct the model to identify and return only a JSON object (no preamble, no markdown fences) with the following structure:
{ "question_styles": [ "Questions typically begin with action verbs like 'Which', 'What', 'Identify'", "Questions often present a scenario or code snippet before asking the question" ], "distractor_patterns": [ "Wrong answers are plausible variations of the correct answer, not obviously wrong", "Distractors often include common misconceptions students have" ], "difficulty_distribution": { "easy": 0.2, "medium": 0.5, "hard": 0.3 }, "question_types_observed": ["single_correct", "multi_correct", "code_based", "definition_based"], "average_options_count": 4, "topic_focus_notes": "Questions focus heavily on edge cases and runtime complexity rather than basic definitions", "additional_notes": "Any other stylistic observations relevant to generating realistic questions" } - The prompt should include the first 8000 tokens of exam text (truncate if longer to stay within context limits)
- Instruct the model to base observations only on the provided exam content, not general knowledge
extract_style_fingerprint(module_id: UUID, db: AsyncSession) -> StyleGuide- Calls
extract_exam_text()to get combined exam content - Calls
build_style_extraction_prompt()to construct the prompt - Sends to Anthropic API:
client.messages.create(model=STYLE_EXTRACTION_MODEL, ...) - Parses the JSON response — if parsing fails, retries once with an explicit instruction to return only valid JSON
- Upserts the
StyleGuiderecord for the module (create if not exists, update if exists) - Sets
style_guides.raw_contentto the concatenated exam text - Sets
style_guides.style_summaryto the parsed JSON string - Returns the
StyleGuiderecord
- Fetches all
-
Create Pydantic schemas in
/backend/app/schemas/style_guide.py:StyleGuideResponse: id, module_id, style_summary (parsed as dict), created_at, updated_atStyleExtractionStatus: has_exam_files (bool), has_style_guide (bool), style_guide (StyleGuideResponse or null)
-
Create API endpoints in
/backend/app/api/routes/style_guide.py(all protected):POST /api/modules/{module_id}/extract-style— triggers style extraction, runs synchronously (not background task — fast enough and user should wait for confirmation), returnsStyleGuideResponseGET /api/modules/{module_id}/style-guide— returns current style guide or 404 if not yet extractedDELETE /api/modules/{module_id}/style-guide— deletes the style guide so it can be re-extracted after uploading more exam files
-
Add error handling:
NoExamFilesError→ HTTP 400:"No past exam files uploaded for this module. Upload at least one exam PDF before extracting style."- JSON parse failure after retry → HTTP 500: store raw response text in
style_guides.raw_contentandstyle_guides.style_summaryas null, return error with the raw text so the user can report it - Anthropic API error → HTTP 502 with message forwarded from the API error
Frontend
- On the module detail page, add a "Exam Style" section that shows:
- Upload status: how many exam PDFs are uploaded
- A "Extract Style Guide" button — enabled only when at least one exam PDF is uploaded
- A loading spinner while extraction is in progress (this call is synchronous, typically 5–15 seconds)
- Once extracted: render the style guide as a readable summary card showing:
- Question styles (as a bullet list)
- Difficulty distribution (as a simple bar or percentage display)
- Distractor patterns
- Topic focus notes
- A "Re-extract" button to regenerate after uploading additional exam files
- The "Generate Questions" button (introduced in
feat/mcq-generation) should be disabled with a tooltip"Extract exam style guide first"if no style guide exists for the module
Acceptance Criteria
-
POST /api/modules/{module_id}/extract-stylewith one or more exam PDFs uploaded returns a validStyleGuideResponsewith a non-nullstyle_summary - The
style_summaryis valid JSON matching the expected schema (all required fields present) - With two exam PDFs uploaded, both are concatenated and the style guide reflects content from both
- Calling
POST /api/modules/{module_id}/extract-stylea second time overwrites the existing style guide — no duplicate records -
GET /api/modules/{module_id}/style-guidereturns the style guide after extraction and 404 before -
DELETE /api/modules/{module_id}/style-guideremoves the record and subsequentGETreturns 404 - Calling extract with no exam PDFs uploaded returns HTTP 400 with the correct error message
- If the LLM returns malformed JSON, the service retries once before returning an error — the raw response is stored for debugging
- The frontend style guide card renders all fields correctly and the "Re-extract" button correctly regenerates the guide
- The "Generate Questions" button is disabled when no style guide exists