PPT-RAG is an end-to-end intelligent assistant designed specifically for understanding presentation files (PPT/PPTX) as multimodal learning material.
Unlike conventional plain-text document QA systems, this project treats each slide as a structured multimodal unit that may contain concise bullet text, dense visual cues, tables, mathematical equations, and layout-level semantics. The system augments multimodal RAG with PPT-specific preprocessing to recover hidden information that presenters often omit from slides but imply through short phrases, visual context, and page organization.
The project targets two primary student-centered scenarios:
- Before-class preview: Quickly build a conceptual overview of what a lecture deck covers and establish connections between key topics.
- Before-exam review: Retrieve key points, relationships, and visual evidence across long slide decks with semantic understanding.
At runtime, users can interact through a web UI, QQ, or WeChat. Under the hood, an intelligent agent orchestrates multimodal retrieval and targeted image understanding to produce grounded, evidence-backed answers.
- PPT-first multimodal RAG pipeline: Comprehensive grounded QA across text, images, tables, and mathematical content with entity-relationship awareness.
- Hidden-information expansion for concise slides: Intelligently detects high-compression pages and expands implicit content into grounded, explanatory text based on slide context.
- Page-topic extraction and structural linking: Extracts per-page topics and semantically links related slides to model section-level continuity and hierarchical structure in long decks.
- Multi-channel accessibility: One unified backend supports Web, QQ, and WeChat, making the assistant seamlessly accessible within familiar student study workflows.
The architecture follows a top-down, retrieval-augmented agent design:
- Ingests PPT/PPTX documents and converts them into typed, structured content items (text, images, tables, mathematical equations).
- Preserves page indices and comprehensive structural metadata for downstream slide-level semantic reasoning.
- Applies project-specific semantic enhancement after initial parsing.
- Performs hidden-information candidate detection and context-grounded expansion to recover presenter intent.
- Extracts page-level topics and computes topic similarity to identify and connect related slide groups across sections.
- Inserts enriched multimodal content into a knowledge graph-backed storage system.
- Organizes entities, relationships, semantic chunks, and document references for hybrid retrieval with both semantic and lexical matching.
- Employs an efficient tool-calling agent loop to determine when and how to apply retrieval, reasoning, and image understanding.
- Combines retrieval evidence with optional iterative image analysis and visual reasoning.
- Generates final answers with document-grounded evidence chains rather than free-form generation.
- Exposes the unified reasoning backend through multiple user interfaces:
- Streamlit-based web application with interactive visualization.
- QQ bot runtime pipeline for instant messaging integration.
- WeChat bot runtime pipeline for social platform accessibility.


