Skip to content

[Feature]: Multi-Modal RAG (Video/Audio Ingestion via Whisper & Vision Models) #3684

@Saurav-Gupta-13

Description

@Saurav-Gupta-13

The Feature

I am proposing the implementation of a native Multi-Modal ingestion and retrieval pipeline for Quivr.

Proposed Architecture & Implementation Steps:

  1. Audio Extraction: When a video or audio file (e.g., .mp4, .mp3) is uploaded, use ffmpeg to extract the audio track.
  2. Transcription Pipeline: Pass the extracted audio through a Whisper model (local or via API) to generate timestamped text chunks.
  3. Vision Sampling (Optional/Phase 2): Sample video frames every N seconds and pass them through a Vision-Language Model (like LLaVA or GPT-4o) to generate descriptions of the visual context.
  4. Multi-Vector Embedding: Embed both the transcribed audio chunks and the visual descriptions into the vector database alongside standard text documents.
  5. Retrieval & Citation: Allow the LLM to synthesize answers using the transcribed chunks and return the exact timestamp of the video/audio as the source citation for the user.

I would like to take ownership of building this feature. I plan to start by building the backend audio-transcription pipeline using Whisper as a Proof of Concept (PoC) pull request.

Motivation, pitch

Currently, Quivr is an exceptional "Second Brain" for text and PDF documents. However, the vast majority of modern enterprise knowledge is trapped in audio and video formats. Companies have thousands of hours of Zoom meeting recordings, product demo videos, and internal podcasts that are completely inaccessible to standard text-based RAG pipelines.

While users could rely on external SaaS tools to transcribe videos manually before uploading them to Quivr, building this ingestion pipeline natively makes Quivr a true Multi-Modal Second Brain.

This feature will significantly bridge the gap between Quivr and enterprise-grade multi-modal AI systems, allowing users to ask questions like "What was the Q3 revenue number John showed in last week's all-hands meeting recording?"

Twitter / LinkedIn details

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestepicUsed to tag the issue describing the whole epic

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions