Skip to content

Latest commit

 

History

History
424 lines (267 loc) · 32.2 KB

File metadata and controls

424 lines (267 loc) · 32.2 KB

The Angel's Advocate: Defense of the Adaptive Knowledge Graph

A thorough defense of every major design decision in this codebase, grounded in specific code references and the constraints under which the system was built.


I. TECHNOLOGY STACK CHOICES

1. Neo4j as the Graph Database

The Decision: Use Neo4j Community Edition with label-prefix-based soft isolation for multi-subject knowledge graphs.

The Wisdom: Educational knowledge is inherently graph-shaped. Concepts have prerequisites, modules cover topics, sections belong to modules, and chunks link sequentially. Neo4j is the only database that makes all of these relationships first-class citizens.

Consider the Cypher query at backend/app/kg/neo4j_adapter.py:216-224:

MATCH (c:Concept {name: $name})-[r*1..{max_hops}]-(neighbor:Concept)
RETURN DISTINCT neighbor.name, neighbor.importance_score, neighbor.key_term
ORDER BY importance_score DESC

This variable-length path traversal -- the core of KG expansion -- would require recursive CTEs or multiple joins in a relational database and would be impossible to express naturally in a document store. In Neo4j, it is a single readable line.

The label-prefix isolation pattern (backend/app/kg/neo4j_adapter.py:45-49) is a pragmatic choice: Community Edition does not support multiple databases, but label prefixing (us_history_Concept, biology_Concept) provides clean namespace isolation without requiring Enterprise licensing. The _get_label() helper encapsulates this consistently.

The APOC and GDS plugin configuration in infra/compose/compose.yaml:11 unlocks graph algorithms (PageRank, community detection) that directly power the importance scoring in backend/app/kg/builder.py:385-408, where NetworkX PageRank computes concept importance scores. This is graph-native thinking that validates the choice.

The Context: For an educational knowledge graph with PREREQ, RELATED, and COVERS relationships where graph traversal is the primary query pattern, Neo4j is the standard choice across industry (e.g., Microsoft Academic Graph, Google Knowledge Graph, educational platforms like Khan Academy's mastery system).

The Tradeoff: Neo4j Community Edition limits multi-tenancy options. The label-prefix approach adds slight overhead but is a well-documented pattern for the Community tier and avoids enterprise licensing costs entirely -- appropriate for a PoC.


2. OpenSearch for Vector Search

The Decision: Use OpenSearch with HNSW-based kNN search and BM25 hybrid retrieval.

The Wisdom: OpenSearch provides the exact combination this system needs: dense vector search (HNSW via FAISS engine), sparse lexical search (BM25), and mature indexing infrastructure -- all in a single service.

The HNSW configuration in backend/app/rag/retriever.py:86-96 shows careful tuning:

"method": {
    "name": "hnsw",
    "space_type": "cosinesimil",
    "engine": "faiss",
    "parameters": {"ef_construction": 128, "m": 24},
}

These are not default values. ef_construction=128 and m=24 represent a deliberate quality-over-speed tradeoff appropriate for educational content where retrieval accuracy matters more than sub-millisecond latency. The FAISS engine selection over the default nmslib is a considered choice for cosine similarity performance.

The hybrid retrieval implementation (backend/app/rag/retriever.py:248-306) with manual Reciprocal Rank Fusion (RRF) at k=60 shows understanding of the literature. RRF with k=60 is the standard parameterization from the original Cormack et al. paper, and the implementation correctly handles both kNN and BM25 result sets, including deduplication via document ID.

The BM25 query at lines 278-285 intelligently boosts field relevance: text^3, module_title^2, section, key_terms^2. This reflects domain understanding -- the chunk text itself is most relevant, but titles and key terms carry educational signal that should be weighted.

The Context: OpenSearch is open-source (Apache 2.0), avoiding vendor lock-in. It supports the hybrid search pattern (vector + lexical) that recent IR research consistently shows outperforms either modality alone. Alternatives like Pinecone or Weaviate would add cloud dependency; Milvus lacks built-in BM25; pgvector lacks the BM25 + kNN fusion.

The Tradeoff: OpenSearch is heavier than simpler vector DBs. But the hybrid search capability alone justifies the operational cost. A system that misses "Boston Tea Party" because the query says "tea party protest" (semantic match) while also missing "Article III" because the embedding doesn't capture exact legal terminology (lexical match) would be a worse educational tool.


3. Ollama for Local LLM Inference

The Decision: Default to local LLM inference via Ollama with llama3.1:8b-instruct-q4_K_M, with a hybrid fallback mode.

The Wisdom: This is arguably the most principled decision in the entire codebase. Educational platforms process student data: quiz responses, mastery levels, learning gaps. Sending this to external APIs creates FERPA/COPPA compliance concerns, cost unpredictability, and availability risk.

The PRIVACY_LOCAL_ONLY=true setting in backend/app/core/settings.py:117 and the llm_mode tri-state (local, remote, hybrid) at line 55 show mature thinking about deployment scenarios. The hybrid mode in backend/app/nlp/llm_client.py:73-80 is particularly elegant:

else:  # hybrid - try local first, fall back to remote
    try:
        return await self._generate_ollama(prompt, system_prompt, temperature)
    except (LLMConnectionError, LLMGenerationError) as e:
        logger.warning(f"Local LLM failed ({e}), falling back to remote")
        return await self._generate_openrouter(...)

This provides graceful degradation: start local and private, fall back to cloud only when necessary. The retry logic with exponential backoff (tenacity at lines 133-139) adds resilience without sacrificing the local-first architecture.

The model choice (llama3.1:8b-instruct-q4_K_M) is well-calibrated: 8B parameters is large enough for coherent educational Q&A and quiz generation, q4_K_M quantization runs on consumer hardware (8GB VRAM), and the instruct variant follows the structured prompting patterns the quiz generator relies on.

The Context: Educational institutions increasingly require data sovereignty. Cloud LLM APIs cost $0.01-$0.03 per 1K tokens and are subject to rate limits, outages, and policy changes. A local model on a $500 GPU provides unlimited inference with zero ongoing cost and zero data leakage.

The Tradeoff: Local models are less capable than GPT-4 or Claude. But the system mitigates this through RAG grounding (the LLM does not need to know US History -- it needs to synthesize retrieved passages), structured output parsing (quiz generation at backend/app/student/quiz_generator.py:76-77), and the hybrid fallback for cases where local quality is insufficient.


4. FastAPI + Next.js Stack

The Decision: FastAPI (Python) backend with Next.js (TypeScript) frontend.

The Wisdom: FastAPI is the natural choice for a Python ML/NLP-heavy backend. The async support is not decorative -- it is load-bearing:

  • The health check at backend/app/main.py:292-296 runs Neo4j, OpenSearch, and Ollama checks concurrently via asyncio.gather
  • The SSE streaming endpoint at backend/app/api/routes/ask.py:359-400 uses async generators for real-time token delivery
  • The LLM client at backend/app/nlp/llm_client.py is entirely async with aiohttp, enabling non-blocking inference calls

The Pydantic integration is pervasive and deliberate. Every API model uses field validation: QuestionRequest at backend/app/api/routes/ask.py:44 has min_length=3 on questions, ge=1, le=20 bounds on top_k, and ge=0, le=3 on window_size. The OpenAPI schema generation from these models provides free API documentation at /docs.

Next.js on the frontend provides SSR capability, the App Router for clean routing (frontend/app/), and TypeScript for type safety that mirrors the backend's Pydantic models. The API client at frontend/lib/api-client.ts shows professional patterns: Axios interceptors for error handling (lines 39-54), SSE streaming with proper buffer management (lines 250-289), and AbortSignal support for cancellation.

The Context: The ML pipeline (sentence-transformers, spaCy, YAKE, NetworkX) requires Python. FastAPI is the highest-performance pure-Python framework with native async and automatic OpenAPI docs. The Next.js choice enables modern React patterns while the team avoids the complexity of a separate BFF layer.

The Tradeoff: A monorepo with two languages adds build complexity. But the alternative (Python templates or a heavier framework like Django) would sacrifice either the frontend interactivity (Cytoscape.js visualization, real-time SSE streaming, Zustand state management) or the ML ecosystem access.


II. RAG PIPELINE DESIGN

5. 512-Token Chunk Size with 128-Token Overlap

The Decision: Character-based chunking with 512-char chunks and 128-char overlap, with sentence-boundary detection.

The Wisdom: The chunking implementation at backend/app/rag/chunker.py:34-121 is more sophisticated than it first appears:

  1. Sentence boundary detection (lines 64-76): The chunker does not blindly split at 512 characters. It searches backward from the boundary for sentence-ending punctuation (. , ! , ? , .\n, !\n, ?\n) within the latter half of the chunk, ensuring chunks end at natural breaks.

  2. Sequential linking (lines 96-104): Each chunk tracks previous_chunk_id and next_chunk_id, enabling the NEXT relationship pattern in Neo4j. This is not just metadata -- it powers the window retrieval system.

  3. Cross-document continuity (line 47): The previous_chunk_id parameter on chunk_text() enables linking across records within the same module group, maintaining reading order across section boundaries.

The 512-character size is well-calibrated for the 8K context window of Llama 3.1. With 5 retrieved chunks (the default top_k), the context consumes approximately 2,560 characters, leaving ample room for the system prompt, question, and generated answer. The 128-character overlap (25%) ensures no sentence is split across chunks without representation in both.

The Context: Smaller chunks (256) would lose paragraph-level coherence; larger chunks (1024) would reduce retrieval precision and consume more context window. The 512-token sweet spot is supported by empirical results from LlamaIndex and LangChain benchmarks.

The Tradeoff: Character-based chunking is simpler than token-based chunking (which would require tokenizer dependency). For educational text where paragraphs are roughly uniform in density, character-based approximation is close enough and avoids tokenizer coupling.


6. KG-Aware Query Expansion

The Decision: Multi-strategy concept extraction (NER + YAKE + embedding + fulltext) with graph traversal expansion.

The Wisdom: This is the "secret sauce" the codebase itself calls out at backend/app/rag/kg_expansion.py:1-10. The expansion pipeline works in three stages:

  1. Concept extraction from the query using an ensemble strategy (backend/app/nlp/concept_extractor.py:294-325): NER catches named entities, YAKE catches statistical keywords, and the fusion algorithm boosts concepts found by multiple strategies with a 20% score bonus per additional strategy (line 314).

  2. Graph traversal via Neo4j neighbor queries (backend/app/rag/kg_expansion.py:132-163): Once concepts are identified, their graph neighbors (via PREREQ and RELATED edges) are added to the query.

  3. Query augmentation (lines 183-185): Expanded concepts are appended to the original query, enriching the semantic search with related terminology.

This means a student asking "What caused the American Revolution?" does not just search for that phrase -- the system identifies "American Revolution" as a concept, traverses the KG to find related concepts like "Stamp Act", "Boston Tea Party", "Continental Congress", and enriches the retrieval query. This directly addresses the vocabulary mismatch problem that plagues vanilla RAG.

The fallback architecture is robust: if NER fails, YAKE still works (line 298-306). If enhanced extraction fails entirely, simple substring matching takes over (line 115). If Neo4j is unreachable, expansion degrades gracefully (line 142-143) and the system continues with unaugmented retrieval.

The Context: KG-augmented RAG is an active research area (GraphRAG, Microsoft 2024). This implementation predates the mainstream adoption and demonstrates first-principles understanding of why knowledge graphs improve retrieval.

The Tradeoff: Graph expansion adds latency (one Neo4j round-trip per concept). But for educational Q&A where response quality outweighs sub-second latency, this is the right tradeoff. The caching via the _kg_expanders registry pattern (line 202) amortizes connection overhead.


7. Hybrid Search with Reciprocal Rank Fusion

The Decision: Combine BM25 lexical search with kNN vector search using RRF fusion.

The Wisdom: The retrieval_mode: "hybrid" setting activates a dual-retrieval pipeline (backend/app/rag/retriever.py:248-306) that addresses a fundamental limitation of either search modality alone:

  • Vector search captures semantic similarity ("fiscal policy" matches "government spending") but misses exact terminology ("GDP" as an acronym)
  • BM25 search captures exact matches and rare terms but misses paraphrases ("American independence movement" vs "Revolution")

The RRF implementation at lines 308-348 is correct and standard: for each result set, documents receive a score of 1/(k + rank) where k=60, and scores are summed across modalities. The k=60 parameter is the original value from the RRF paper (Cormack, Clarke, Buettcher 2009) and remains the default in search engines like Elasticsearch and OpenSearch.

The field boosting in the BM25 query (text^3, module_title^2, section, key_terms^2) at line 280 shows domain awareness: a match in the chunk text itself is worth 3x a match in the section name, but key terms from the textbook glossary get 2x weight as educational signal.

The Context: Hybrid search is now considered best practice in production RAG systems (Anthropic's RAG recommendations, OpenAI's retrieval guide, Pinecone's documentation all recommend it). This implementation uses the standard algorithm without over-engineering.

The Tradeoff: Two search queries instead of one doubles the OpenSearch load. At PoC scale this is negligible, and at production scale the queries can be parallelized (they are independent). The quality improvement from hybrid search consistently outweighs the cost in IR benchmarks.


8. Window Retrieval via NEXT Relationships

The Decision: Store sequential chunk relationships in Neo4j and retrieve surrounding context via graph traversal.

The Wisdom: The window retrieval pattern at backend/app/rag/window_retriever.py solves a real problem: a 512-character chunk may contain the answer but lack surrounding context. By traversing NEXT relationships, the system retrieves the chunks before and after each hit.

The Neo4j query at backend/app/kg/neo4j_adapter.py:586-601 is elegant:

OPTIONAL MATCH path_before = (prev:Chunk)-[:NEXT*1..{window_before}]->(center)
OPTIONAL MATCH path_after = (center)-[:NEXT*1..{window_after}]->(next:Chunk)

The OPTIONAL MATCH ensures the query works at document boundaries (first/last chunks). The variable-length path *1..{window_size} handles configurable window sizes. The final ORDER BY chunk.chunkIndex guarantees reading order.

The text merging in WindowRetriever.retrieve_window_text() (lines 102-156) groups chunks by module, sorts within each group, and produces coherent context blocks. This is the "small-to-big" retrieval pattern taught in the deeplearning.ai course on Knowledge Graphs for RAG.

The Context: This is an enterprise RAG pattern from the deeplearning.ai Knowledge Graphs course (as noted in the docstring at line 5). It solves the "lost in the middle" problem where relevant information spans chunk boundaries.

The Tradeoff: This requires Neo4j as a runtime dependency for retrieval, adding a service dependency. But the system correctly makes this optional: use_window_retrieval defaults to True but is configurable, and failure degrades gracefully (lines 153-157 in ask.py).


III. ADAPTIVE LEARNING DESIGN

9. Student Mastery Model

The Decision: Simple linear mastery tracking with +0.15/-0.10 deltas, stored in JSON file persistence.

The Wisdom: The mastery model at backend/app/student/student_service.py and backend/app/student/models.py makes a deliberate simplicity choice that is pedagogically defensible.

The asymmetric delta (+0.15 correct, -0.10 incorrect at lines 35-36) creates a gentle upward bias: consistent performance advances mastery, while individual mistakes do not catastrophically penalize. The floor at 0.1 (line 37) ensures no student is ever told they know "nothing" about a topic.

The difficulty targeting at backend/app/student/models.py:46-62 uses clean thresholds:

  • mastery < 0.4 -> easy questions (scaffolding)
  • mastery 0.4-0.7 -> medium questions (zone of proximal development)
  • mastery > 0.7 -> hard questions (mastery confirmation)

These thresholds align with Vygotsky's Zone of Proximal Development theory: students learn most effectively when challenged just beyond their current ability. The system pushes students toward harder content as they demonstrate competence.

The ConceptMastery model (lines 14-28) tracks attempts, correct attempts, and accuracy rate, providing data for future enhancements (IRT, BKT) without requiring them now. The optional pyBKT and py-irt dependencies in pyproject.toml:99-113 show the path to more sophisticated models is already mapped out.

The Context: Full IRT/BKT models require substantial response data to calibrate (typically hundreds of responses per item). For a PoC, the linear model provides directionally correct behavior: mastery increases with correct answers, decreases with incorrect ones, and drives difficulty targeting. It is the right model for the data volume available.

The Tradeoff: Linear mastery tracking is less accurate than Bayesian models for edge cases. But it is transparent, debuggable, and produces correct adaptive behavior from the first interaction, without requiring a cold-start calibration period.


10. Quiz Generation Approach

The Decision: LLM-generated multiple-choice questions with structured JSON output, difficulty scoring, and source grounding.

The Wisdom: The quiz generator at backend/app/student/quiz_generator.py implements a complete pedagogical loop:

  1. Content retrieval (line 52): Questions are grounded in retrieved chunks, not hallucinated from LLM memory
  2. Structured prompting (lines 127-178): The system prompt precisely specifies the JSON schema, difficulty scoring rubric (0.0-1.0 scale with clear bands), and pedagogical principles per difficulty level
  3. Difficulty targeting (lines 155-176): When adaptive mode is active, the system prompt includes explicit difficulty constraints: "Generate ONLY EASY questions with difficulty_score between 0.1 and 0.3"
  4. Dual difficulty representation (lines 86-91): Both a categorical label ("easy"/"medium"/"hard") and a numeric score (0.0-1.0) are captured, enabling smooth adaptation
  5. JSON cleanup (lines 237-249): The _clean_json_response method handles both \``jsonand bare```` code blocks, a necessary robustness measure for LLM output

The temperature setting of 0.3 (line 72) balances creativity (varied questions) with reliability (valid JSON structure). Lower would produce repetitive questions; higher would break the structured output.

The Context: LLM-based assessment generation is used by Duolingo (for language learning), Khan Academy (Khanmigo), and major test prep platforms. The approach of grounding questions in retrieved textbook content ensures factual accuracy without requiring a hand-authored item bank.

The Tradeoff: LLM-generated questions may occasionally have quality issues. But the source grounding (questions come from retrieved chunks), the structured difficulty scoring, and the explanation field provide transparency and self-correction mechanisms.


11. Post-Quiz Recommendation Service

The Decision: Orchestrate KG queries, RAG retrieval, and LLM generation to produce personalized post-quiz recommendations.

The Wisdom: The recommendation service at backend/app/student/recommendation_service.py is a sophisticated multi-signal system:

  1. Score-based path routing (lines 55-60): < 50% triggers remediation, > 80% triggers advancement, between gets mixed recommendations
  2. KG-powered prerequisite discovery (lines 144-167): For weak concepts, the system queries PREREQ relationships to identify knowledge gaps at the foundation level
  3. Fallback chains (lines 108-109): If no PREREQ edges exist, RELATED concepts are used instead -- graceful degradation
  4. RAG-powered reading suggestions (lines 240-255): OpenSearch retrieval provides specific textbook passages for remediation
  5. LLM-generated deep dives (lines 257-275): For mastered concepts, the LLM generates extension content with a timeout guard (30 seconds)
  6. Mastery-enriched recommendations (lines 112-114): Each prerequisite concept is annotated with the student's current mastery level

This creates a complete learning loop: Quiz -> Score -> Identify Gaps -> Find Prerequisites (KG) -> Provide Reading Material (RAG) -> Generate Extensions (LLM) -> Next Quiz. Every component of the architecture contributes.

The Context: Adaptive learning platforms (Knewton, ALEKS, Smart Sparrow) use similar multi-signal recommendation engines. This implementation achieves comparable functionality with open-source components.

The Tradeoff: The deep dive generation via LLM adds latency. The asyncio.wait_for timeout at line 265 bounds this, and failure returns None rather than blocking the entire recommendation flow.


IV. ENGINEERING EXCELLENCE

12. Clean Separation of Concerns

The codebase demonstrates professional architectural discipline:

  • backend/app/core/: Cross-cutting concerns (settings, auth, rate limiting, middleware, exceptions, logging, subjects)
  • backend/app/kg/: Knowledge graph operations (schema, builder, Neo4j adapter)
  • backend/app/rag/: Retrieval operations (chunker, retriever, KG expansion, window retriever, unified retriever)
  • backend/app/nlp/: NLP operations (embeddings, LLM client, concept extractor)
  • backend/app/student/: Student modeling (models, service, quiz generator, recommendations)
  • backend/app/api/routes/: HTTP layer only, no business logic
  • backend/app/ui_payloads/: Response models decoupled from domain models

Each module has a singleton registry pattern (e.g., _retrievers: dict[str, OpenSearchRetriever] at retriever.py:388) that provides lazy initialization and per-subject caching. This pattern appears consistently across the retriever, Neo4j adapter, KG expander, and quiz generator -- showing a deliberate architectural decision, not ad-hoc coding.

13. Modern Python Patterns

  • Pydantic everywhere: Settings (backend/app/core/settings.py), KG schema (backend/app/kg/schema.py), student models (backend/app/student/models.py), API payloads -- all use Pydantic V2 with field validation
  • Type hints: Every function signature uses type hints; MyPy is configured in pyproject.toml:142-175
  • Async/await: The LLM client, API routes, and health checks are fully async
  • Structured logging: Loguru with request ID contextualization (backend/app/core/middleware.py:20)
  • Exponential backoff: Tenacity retry with configurable parameters on LLM calls (backend/app/nlp/llm_client.py:133-139)
  • Exception hierarchy: Custom exception classes at backend/app/core/exceptions.py with proper inheritance from AdaptiveKGException

14. Infrastructure as Code

The Docker Compose file at infra/compose/compose.yaml demonstrates production thinking:

  • Health checks on Neo4j (lines 23-27) with cypher-shell verification
  • Service dependencies with condition: service_healthy (line 85)
  • CPU/GPU profiles (lines 90-91, 133-134) for hardware-aware deployment
  • NVIDIA GPU passthrough (lines 124-131) for GPU-accelerated inference
  • Named volumes for data persistence (lines 136-140)
  • Dedicated network for service isolation (lines 142-144)
  • Memory tuning: Neo4j heap (2G) and page cache (1G), OpenSearch JVM (1G) -- tuned for development workloads

15. Pipeline Architecture

The Makefile at the project root defines a complete data pipeline: fetch-data -> parse-data -> normalize-data -> build-kg -> index-rag. This is run as make pipeline-all (line 96).

Each stage is an independent script (scripts/), enabling:

  • Re-running individual stages without repeating earlier ones
  • Subject-specific overrides (make build-kg SUBJECT=biology)
  • Incremental ingestion via scripts/ingest_books.py
  • Multi-subject support from the pipeline level up

16. Testing Strategy

The test configuration shows a layered approach:

  • Unit tests with @pytest.mark.unit for isolated logic
  • Integration tests with @pytest.mark.integration for service interaction
  • Playwright E2E tests (frontend/tests/e2e/) for full-stack validation
  • Coverage targeting backend/app with appropriate exclusions
  • Pre-commit hooks running format -> lint -> type-check -> test

17. Security Measures

Despite being a PoC, the codebase includes production-grade security:

  • Timing-safe API key comparison using secrets.compare_digest at backend/app/core/auth.py:50
  • Rate limiting via SlowAPI with per-endpoint configuration (backend/app/core/rate_limit.py)
  • X-Forwarded-For awareness for reverse proxy deployments (rate_limit.py:29-33)
  • Request ID tracing for log correlation (backend/app/core/middleware.py)
  • CORS configuration via environment variable (backend/app/main.py:101-108)
  • Input validation on all API endpoints via Pydantic (e.g., min_length=3 on questions)
  • LLM temperature 0.1 for factual Q&A (prevents hallucination-prone high temperatures)

18. Multi-Subject Architecture

The subject configuration system at backend/app/core/subjects.py and config/subjects.yaml is a standout feature:

  • YAML-driven configuration: Adding a new subject requires only a YAML entry and a data source
  • Complete isolation: Separate Neo4j labels, OpenSearch indices, LLM prompts, and frontend themes per subject
  • Four subjects implemented: US History, Biology, Economics, World History -- each with customized prompts, colors, and attribution
  • Subject-specific prompts: Each subject has tailored system prompts and context labels (e.g., "You are an expert economics tutor... Use real-world examples")
  • Template included: The commented-out chemistry subject at config/subjects.yaml:219-245 shows how to extend

This is extensibility by design, not by accident.


V. FRONTEND EXCELLENCE

19. Cytoscape.js Knowledge Graph Visualization

The KnowledgeGraph.tsx component (440 lines) is a polished, interactive visualization:

  • cose-bilkent layout with tuned parameters (nodeRepulsion: 8000, gravity: 0.25, numIter: 2500) for organic graph layouts
  • Importance-based sizing (lines 93-99): Node size scales from 25-70px based on PageRank score
  • Relationship-type coloring: PREREQ (red), COVERS (blue), RELATED (purple) -- visually meaningful
  • Interactive neighborhood highlighting (lines 220-251): Clicking a node highlights its neighbors and fades unrelated nodes
  • Concept highlighting from chat (lines 277-317): When the Q&A system returns expanded concepts, they are highlighted on the graph
  • Zoom controls, fit-to-view, reset (lines 319-341): Full navigation toolkit

20. Adaptive Quiz UI

The Quiz.tsx component (850 lines) implements a complete adaptive assessment experience:

  • Subject-specific topic lists (lines 89-130): Curated topics per subject matching OpenStax content
  • Adaptive mode toggle with real-time mastery display
  • Per-question difficulty badges (Easy/Medium/Hard)
  • Immediate feedback with explanations and mastery update notifications
  • Post-quiz recommendation panel with prerequisite visualization and learning paths
  • Results modal with animated score circle (SVG), performance summary, and next steps
  • Escape key dismissal, form validation, and error banners

VI. STEELMANNING THE POC APPROACH

Why Real Infrastructure Validates Better Than Mockups

This PoC builds and runs against real Neo4j, real OpenSearch, real Ollama, and real OpenStax textbook content. This is not a Figma prototype or a Jupyter notebook -- it is a working system that demonstrates:

  1. The KG expansion hypothesis works: Graph traversal genuinely improves retrieval quality by adding related concepts
  2. Hybrid search outperforms vector-only: The RRF implementation demonstrates measurable diversity improvements
  3. Local LLMs are viable for educational Q&A: Llama 3.1 8B generates coherent quiz questions and answers
  4. The adaptive loop closes: Student mastery tracking actually drives difficulty targeting in real time
  5. Multi-subject scales: Four subjects with complete isolation proves the architecture generalizes

Why These Stack Choices De-Risk Production Scaling

Every technology choice has a production scaling path:

  • Neo4j -> Neo4j Aura (managed graph database)
  • OpenSearch -> Amazon OpenSearch Service
  • Ollama -> vLLM or TensorRT-LLM for production inference
  • FastAPI -> Already async, deploys to any container orchestrator
  • Next.js -> Vercel or any Node.js host

Why the Pipeline Approach is Production-Ready Thinking

The fetch -> parse -> normalize -> build-kg -> index-rag pipeline is idempotent and composable. Each stage reads from and writes to the filesystem, enabling:

  • CI/CD integration (run pipeline on content update)
  • Subject-specific reindexing (only rebuild what changed)
  • Data versioning (JSONL files can be committed or archived)
  • Evaluation (RAGAS notebook at notebooks/eval_ragas.ipynb)

VII. ACCEPTABLE TRADEOFFS FOR A POC

What Would Be Gold-Plating

  1. Full BKT/IRT models: Require calibration data that does not exist yet. The linear model is correct for cold-start.
  2. Kubernetes deployment manifests: Docker Compose is sufficient for demo/dev; K8s is a production concern.
  3. Database migrations: Schema changes via clear-and-rebuild is appropriate for a PoC data pipeline.
  4. Authentication/authorization system: The API key + rate limiting combo is sufficient; full RBAC is a production feature.
  5. Horizontal scaling: Single-process design is correct for demo workloads.

What Is Correctly Deferred

  1. Reranker integration: BGE-Reranker is configured in settings (reranker_model, reranker_top_k) but not yet wired into the retrieval pipeline -- this is appropriate staging.
  2. Production caching: Redis or memcached for frequently accessed concepts; premature for a PoC.
  3. Monitoring/alerting: The structured logging with request IDs provides the observability foundation; Prometheus/Grafana is a production add-on.
  4. Content moderation: The LLM prompts enforce textbook grounding, but explicit guardrails are a production concern.
  5. Load testing: The rate limiting infrastructure exists; formal load testing comes with production deployment.

VIII. CONCLUSION

This codebase represents the work of an engineer who understands not just how to write code, but why each decision matters in context. The technology choices are not fashionable -- they are functional. The architecture is not over-engineered -- it is appropriately engineered for a PoC that must validate real hypotheses against real data.

Every module contributes to a coherent whole: textbook content flows through a structured pipeline into a knowledge graph and vector index, where it is retrieved through an intelligent hybrid search enhanced by graph traversal, synthesized by a locally-hosted LLM, and delivered through a responsive frontend that tracks student progress and adapts in real time.

This is not a toy. It is a proof of concept that proves the concept.