Skip to content

Build an adaptive video-understanding pipeline that probes each video and automatically chooses the right mix of ASR, OCR, and a vision-language model. Process in chunks (Decord), transcribe speech (Whisper), read on-screen text (Tesseract), describe frames (Ovis2.5), then condense everything into a clear summary (Qwen2.5).

Notifications You must be signed in to change notification settings

mail2mhossain/ai_driven_video_understanding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

From raw footage to concise insights: Build an adaptive video-understanding pipeline that probes each video and automatically chooses the right mix of ASR, OCR, and a vision-language model. Process in chunks (Decord), transcribe speech (Whisper), read on-screen text (Tesseract), describe frames (Ovis2.5), then condense everything into a clear summary (Qwen2.5).

Crerate Conda Environment

conda create --prefix D:\\conda_env\\video_understanding Python=3.11 -y && conda activate D:\conda_env\video_understanding 

Install requirements:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

pip install git+https://github.com/huggingface/transformers

pip install git+https://github.com/huggingface/accelerate

pip install -r requirements.txt

To remove the environment when done:

conda remove --prefix D:\\conda_env\\video_understanding --all

Run the App:

streamlit run app.py

About

Build an adaptive video-understanding pipeline that probes each video and automatically chooses the right mix of ASR, OCR, and a vision-language model. Process in chunks (Decord), transcribe speech (Whisper), read on-screen text (Tesseract), describe frames (Ovis2.5), then condense everything into a clear summary (Qwen2.5).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages