Skip to content

feat/past-exam-ingestion #10

@alretum

Description

@alretum

feat/past-exam-ingestion

Implement the past exam ingestion service. When a user uploads past exam PDFs for a module, this service extracts their full text content and sends it to Claude Sonnet 4.6 to produce a structured style fingerprint — a concise summary of how exam questions are written, what types of questions appear, difficulty distribution, and distractor patterns. This style fingerprint is stored in the style_guides table and injected into every MCQ generation prompt for the module, ensuring generated questions match the real exam's style.

Context

Refer to the project context document for the full tech stack, architecture, and design principles. This issue depends on feat/file-upload, feat/pdf-to-markdown, and feat/enriched-markdown-store being completed. The PDFParser.extract_full_text() method introduced in feat/pdf-to-markdown is used here to extract raw text from exam PDFs. The style_guides table was defined in feat/database-schema.

Key design decisions:

  • Style extraction uses Claude Sonnet 4.6 — this is a reasoning-intensive one-time task per module and quality matters significantly here since it affects every generated question
  • Style extraction is not done via the Batch API — it is a single synchronous call (one per module) and the latency is acceptable
  • If a module has multiple past exam PDFs uploaded, their text is concatenated before sending to the LLM — more examples produce a better style fingerprint
  • The style fingerprint is stored as structured JSON in style_guides.style_summary so it can be reliably injected into generation prompts
  • The style guide is regenerated from scratch if new exam PDFs are uploaded — it is not incrementally updated

Todos

Backend

  • Add to core/config.py:

    • STYLE_EXTRACTION_MODEL (default: "claude-sonnet-4-6")
    • STYLE_EXTRACTION_MAX_TOKENS (default: 1500)
  • Create /backend/app/services/exam_ingestion_service.py with:

    extract_exam_text(module_id: UUID, db: AsyncSession) -> str

    • Fetches all File records for the module where file_type == exam_pdf
    • For each file, calls PDFParser.extract_full_text(file.stored_path)
    • Concatenates all extracted texts separated by \n\n--- EXAM SEPARATOR ---\n\n
    • Returns the combined string
    • Raises NoExamFilesError if no exam PDFs exist for the module

    build_style_extraction_prompt(exam_text: str) -> str

    • Returns a prompt instructing Claude Sonnet to analyse the exam text and produce a JSON style fingerprint
    • The prompt must instruct the model to identify and return only a JSON object (no preamble, no markdown fences) with the following structure:
      {
        "question_styles": [
          "Questions typically begin with action verbs like 'Which', 'What', 'Identify'",
          "Questions often present a scenario or code snippet before asking the question"
        ],
        "distractor_patterns": [
          "Wrong answers are plausible variations of the correct answer, not obviously wrong",
          "Distractors often include common misconceptions students have"
        ],
        "difficulty_distribution": {
          "easy": 0.2,
          "medium": 0.5,
          "hard": 0.3
        },
        "question_types_observed": ["single_correct", "multi_correct", "code_based", "definition_based"],
        "average_options_count": 4,
        "topic_focus_notes": "Questions focus heavily on edge cases and runtime complexity rather than basic definitions",
        "additional_notes": "Any other stylistic observations relevant to generating realistic questions"
      }
    • The prompt should include the first 8000 tokens of exam text (truncate if longer to stay within context limits)
    • Instruct the model to base observations only on the provided exam content, not general knowledge

    extract_style_fingerprint(module_id: UUID, db: AsyncSession) -> StyleGuide

    • Calls extract_exam_text() to get combined exam content
    • Calls build_style_extraction_prompt() to construct the prompt
    • Sends to Anthropic API: client.messages.create(model=STYLE_EXTRACTION_MODEL, ...)
    • Parses the JSON response — if parsing fails, retries once with an explicit instruction to return only valid JSON
    • Upserts the StyleGuide record for the module (create if not exists, update if exists)
    • Sets style_guides.raw_content to the concatenated exam text
    • Sets style_guides.style_summary to the parsed JSON string
    • Returns the StyleGuide record
  • Create Pydantic schemas in /backend/app/schemas/style_guide.py:

    • StyleGuideResponse: id, module_id, style_summary (parsed as dict), created_at, updated_at
    • StyleExtractionStatus: has_exam_files (bool), has_style_guide (bool), style_guide (StyleGuideResponse or null)
  • Create API endpoints in /backend/app/api/routes/style_guide.py (all protected):

    • POST /api/modules/{module_id}/extract-style — triggers style extraction, runs synchronously (not background task — fast enough and user should wait for confirmation), returns StyleGuideResponse
    • GET /api/modules/{module_id}/style-guide — returns current style guide or 404 if not yet extracted
    • DELETE /api/modules/{module_id}/style-guide — deletes the style guide so it can be re-extracted after uploading more exam files
  • Add error handling:

    • NoExamFilesError → HTTP 400: "No past exam files uploaded for this module. Upload at least one exam PDF before extracting style."
    • JSON parse failure after retry → HTTP 500: store raw response text in style_guides.raw_content and style_guides.style_summary as null, return error with the raw text so the user can report it
    • Anthropic API error → HTTP 502 with message forwarded from the API error

Frontend

  • On the module detail page, add a "Exam Style" section that shows:
    • Upload status: how many exam PDFs are uploaded
    • A "Extract Style Guide" button — enabled only when at least one exam PDF is uploaded
    • A loading spinner while extraction is in progress (this call is synchronous, typically 5–15 seconds)
    • Once extracted: render the style guide as a readable summary card showing:
      • Question styles (as a bullet list)
      • Difficulty distribution (as a simple bar or percentage display)
      • Distractor patterns
      • Topic focus notes
    • A "Re-extract" button to regenerate after uploading additional exam files
  • The "Generate Questions" button (introduced in feat/mcq-generation) should be disabled with a tooltip "Extract exam style guide first" if no style guide exists for the module

Acceptance Criteria

  • POST /api/modules/{module_id}/extract-style with one or more exam PDFs uploaded returns a valid StyleGuideResponse with a non-null style_summary
  • The style_summary is valid JSON matching the expected schema (all required fields present)
  • With two exam PDFs uploaded, both are concatenated and the style guide reflects content from both
  • Calling POST /api/modules/{module_id}/extract-style a second time overwrites the existing style guide — no duplicate records
  • GET /api/modules/{module_id}/style-guide returns the style guide after extraction and 404 before
  • DELETE /api/modules/{module_id}/style-guide removes the record and subsequent GET returns 404
  • Calling extract with no exam PDFs uploaded returns HTTP 400 with the correct error message
  • If the LLM returns malformed JSON, the service retries once before returning an error — the raw response is stored for debugging
  • The frontend style guide card renders all fields correctly and the "Re-extract" button correctly regenerates the guide
  • The "Generate Questions" button is disabled when no style guide exists

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions