Skip to content

JiahaoZhang258/ppt-RAG

 
 

Repository files navigation

PPT-RAG Logo

PPT-RAG

PPT-Centric Multimodal RAG for Study Preview & Exam Review

Python Multimodal RAG Agent Tool Calling Channels


Overview

PPT-RAG is an end-to-end intelligent assistant designed specifically for understanding presentation files (PPT/PPTX) as multimodal learning material.

Unlike conventional plain-text document QA systems, this project treats each slide as a structured multimodal unit that may contain concise bullet text, dense visual cues, tables, mathematical equations, and layout-level semantics. The system augments multimodal RAG with PPT-specific preprocessing to recover hidden information that presenters often omit from slides but imply through short phrases, visual context, and page organization.

The project targets two primary student-centered scenarios:

  • Before-class preview: Quickly build a conceptual overview of what a lecture deck covers and establish connections between key topics.
  • Before-exam review: Retrieve key points, relationships, and visual evidence across long slide decks with semantic understanding.

At runtime, users can interact through a web UI, QQ, or WeChat. Under the hood, an intelligent agent orchestrates multimodal retrieval and targeted image understanding to produce grounded, evidence-backed answers.

Key Features

  • PPT-first multimodal RAG pipeline: Comprehensive grounded QA across text, images, tables, and mathematical content with entity-relationship awareness.
  • Hidden-information expansion for concise slides: Intelligently detects high-compression pages and expands implicit content into grounded, explanatory text based on slide context.
  • Page-topic extraction and structural linking: Extracts per-page topics and semantically links related slides to model section-level continuity and hierarchical structure in long decks.
  • Multi-channel accessibility: One unified backend supports Web, QQ, and WeChat, making the assistant seamlessly accessible within familiar student study workflows.

Architecture

The architecture follows a top-down, retrieval-augmented agent design:

PPT-RAG Architecture

Figure: PPT-RAG index-building workflow (RAG Index pipeline).

PPT-RAG Agent Runtime Workflow

Figure: PPT-RAG agent runtime workflow (Agent Loop pipeline).

1. Parsing & Normalization Layer

  • Ingests PPT/PPTX documents and converts them into typed, structured content items (text, images, tables, mathematical equations).
  • Preserves page indices and comprehensive structural metadata for downstream slide-level semantic reasoning.

2. PPT-Oriented Preprocessing Layer

  • Applies project-specific semantic enhancement after initial parsing.
  • Performs hidden-information candidate detection and context-grounded expansion to recover presenter intent.
  • Extracts page-level topics and computes topic similarity to identify and connect related slide groups across sections.

3. Multimodal Knowledge Layer

  • Inserts enriched multimodal content into a knowledge graph-backed storage system.
  • Organizes entities, relationships, semantic chunks, and document references for hybrid retrieval with both semantic and lexical matching.

4. Agent Runtime Layer

  • Employs an efficient tool-calling agent loop to determine when and how to apply retrieval, reasoning, and image understanding.
  • Combines retrieval evidence with optional iterative image analysis and visual reasoning.
  • Generates final answers with document-grounded evidence chains rather than free-form generation.

5. Interaction Layer

  • Exposes the unified reasoning backend through multiple user interfaces:
    • Streamlit-based web application with interactive visualization.
    • QQ bot runtime pipeline for instant messaging integration.
    • WeChat bot runtime pipeline for social platform accessibility.

About

A RAG framework focused on PPT

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%