The app has a very simple one page UI to ensure simplicity for blind people.
The frontend has explicit voice audio instructions at every step to ensure convinience, so that blind people can easily record videos.
The process used for caption generation is as follows:
- First audio and video are seperated for processing.
- Key frame extraction is done using opencv by converting frames to luv format, smoothening them and selecting the frames which have the maximum difference from the ones around them.
- An image captioning model by Salesforce is used for captioning each image.
- Whisper is used for audio transcription.
- Gemini is used for combining the data produced from captioning and transcription.
The frontend utilizes bloc state management to ensure a smooth user experience.
For faster processing celery and redis have been used so that backend processing can be asynchronous.
The summary returned by Gemini is read using text-to-speech for the blind person.

- cd ./backend/video_summary_backend
- Make a virtual environment and install all dependencies mention in the requirement.txt
- Make a redis instance on port 6379
- Check the working with redis-cli ping command
- To open the celery worker use the command: celery -A video_summary_backend worker --pool=solo -l info
- To setup database: run python manage.py makemigrations
- run python manage.py migrate
- run python manage.py runserver 0.0.0.0:8000
- To setup the keys in the .env refer to the sample.envv
- To ensure the latest version run:flutter upgrade (optional)
- use flutter pub get
- use flutter run