Towards Chain-of-Thought Reasoning for Video Understanding with Gemini: An Agentic Chain-of-Chain-of-Thoughts Framework
A comprehensive platform for analyzing video content using Google's Gemini models via the unified google-genai SDK, supporting both Vertex AI and Gemini API backends. This project enables downloading videos, preparing them (including critical optimizations like video slowdown), and generating intelligent responses using various prompting strategies. It notably introduces an Agentic Chain-of-Chain-of-Thoughts (CoCoT) framework, leveraging specifically formatted Chain-of-Thought prompts even with CoT-capable models, primarily showcased in Full_Inference_CoCoT_Generated_Questions.ipynb, for sophisticated context-aware reasoning.
This project provides a modular platform for advanced video understanding using Google's Gemini models. Key to its performance are several optimizations:
- Video Slowdown: Globally configurable (
VIDEO_SPEED_FACTOR), this technique (e.g., 0.5x speed) significantly improves the model's ability to capture temporal details and nuances within video content, leading to better comprehension and more accurate answers. This is applied during the video preparation stage usingffmpeg. - Optimized Chain-of-Thought (CoT) Prompting Formats: Even when using models designed for CoT, the specific structure and phrasing of our prompts are crucial. We employ carefully crafted templates (see
models/CoT_output_models.py) to guide the model through a detailed reasoning process, eliciting more comprehensive and accurate thought chains. - Agentic Chain-of-Chain-of-Thoughts (CoCoT) Framework: This advanced strategy, implemented in
Full_Inference_CoCoT_Generated_Questions.ipynb, uses pre-generated guideline questions and their CoT answers as conversational context. This allows the model to build a rich understanding of the video before tackling the original, often more complex, dataset questions.
The platform's capabilities are primarily accessed through the following Jupyter notebooks, each designed for specific tasks and incorporating the aforementioned optimizations:
-
Full_Inference_CoCoT_Generated_Questions.ipynb(Enhanced Agentic CoCoT - Recommended for Optimal Performance):- Purpose: Implements our latest, enhanced CoT architecture designed explicitly for updated competition requirements. This is the recommended primary solution due to its superior overall structure and reasoning effectiveness.
- Execution & Optimizations:
- Leverages context-aware reasoning by using conversational history (generated by
Generated_Questions_By_Videos.ipynb) from a series of heuristically ordered guideline questions. - Employs optimized CoT prompting formats (from
models/CoT_output_models.py) for both answering guideline questions and the final original dataset question, ensuring detailed thought processes. - Benefits significantly from video slowdown applied during video preparation, allowing the model to better analyze fine-grained temporal events.
- Optionally uses a summary model to distill the final CoT answer.
- Leverages context-aware reasoning by using conversational history (generated by
- Output: Detailed CSV results including full CoT reasoning, ideal for in-depth analysis.
-
Full_Inference.ipynb(Stricter, Question-by-Question Inference):- Purpose: Provides a robust alternative designed for explicit, isolated processing of each question, meticulously adhering to the most conservative interpretation of competition guidelines.
- Execution & Optimizations:
- Processes each original dataset question individually without leveraging external conversational context from generated questions.
- Can use Non-CoT models for direct answers or basic CoT models with specifically formatted CoT prompts for single-turn reasoning on the original question.
- Also benefits from video slowdown for improved detail capture.
- Output: CSV results, structured for simplicity and clear auditability.
-
Generated_Questions_By_Videos.ipynb(Context Generation for Advanced CoCoT):- Purpose: A crucial pre-processing step for the
Full_Inference_CoCoT_Generated_Questions.ipynbnotebook. - Execution & Optimizations:
- Generates a list of relevant guideline questions for each video, benefiting from video slowdown for more insightful question generation.
- Answers these generated questions sequentially using optimized CoT prompting formats, with the (slowed-down) video provided at each turn, to build a rich conversational history.
- Output:
questions.csv(generated questions) and JSON files inchat_history/(serialized conversation history per video).
- Purpose: A crucial pre-processing step for the
-
Testing_UI_Prompting.ipynb(Interactive Exploration & Prompt Engineering):- Purpose: Facilitates interactive testing of individual videos, questions, and the CoCoT mechanism.
- Execution & Optimizations:
- Allows users to select a video (which would have undergone video slowdown if
VIDEO_SPEED_FACTORis set) and an original question. - Demonstrates the CoCoT context-building process.
- Enables experimentation with different prompt formats and immediate observation of their impact on model responses.
- Allows users to select a video (which would have undergone video slowdown if
The platform streamlines video understanding tasks through:
- Fetching question/video metadata from HuggingFace (
lmms-lab/AISG_Challenge). - Downloading associated videos.
- Processing videos (including video slowdown using
ffmpeg). - Uploading prepared videos to Google Cloud Storage (Vertex AI) or Gemini File API.
- Performing bulk inference using the strategies outlined in the Core Notebooks section.
- Saving results to CSV and JSON (for chat histories) for analysis.
The platform's workflow:
- Data Source: HuggingFace
datasetsfor metadata; ZIP archive for videos. - Data Preparation: Video download, extraction,
ffmpegprocessing (speed adjustment), upload to GCS/File API, andvideo_metadata_*.csvcreation. - Storage: Processed videos in GCS or Gemini File API; metadata/results in local CSVs/JSONs.
- Inference Engine:
google-genaiSDK for Gemini models.- Vertex AI Backend: ADC/service account auth; requires
PROJECT_ID,LOCATION,GCS_BUCKET. - Gemini API Backend: API Key; uses File API (files expire ~1 day).
- Vertex AI Backend: ADC/service account auth; requires
- Execution & Control: Orchestrated by the Jupyter Notebooks detailed above.
- Results: Inference outputs in CSVs (e.g.,
results_*.csv); chat histories ingenerated_questions/as JSON.
- Dual Backend Support: Vertex AI & Gemini API via
USE_VERTEXflag. - Unified SDK: Modern
google-genailibrary. - Flexible Storage: GCS (Vertex) or File API (Gemini API).
- Efficient & Robust Processing: Skip flags for existing data,
asynciofor speed, basic API retries. - Comprehensive Metadata Tracking:
video_metadata_*.csvfor resource management. - Modular Model Configuration: Prompts and model settings defined in
models/. - Clear Logging & Resume Support.
- Python: 3.10+.
- Package Management:
pip,virtualenv. - Google Cloud Account: Required.
- Vertex AI: Billing, Vertex AI API, Cloud Storage API enabled.
- Gemini API: Gemini API Key from AI Studio.
- Google Cloud CLI (
gcloud): For Vertex AI auth. Install from official guide. ffmpeg: For video processing. Install from ffmpeg.org.- Linux:
sudo apt update && sudo apt install ffmpeg - macOS:
brew install ffmpeg - Windows:
winget install --id=Gyan.FFmpeg -e - Verify:
ffmpeg -version
- Linux:
If you plan to use the Vertex AI backend, follow these steps to configure your Google Cloud environment:
- Install the Google Cloud CLI.
- Initialize the Google Cloud CLI:
gcloud init - Log in to Google Cloud:
gcloud auth login - Configure Application Default Credentials (ADC):
gcloud auth application-default login - Set Your Default Project and Region:
gcloud config set project YOUR_PROJECT_ID,gcloud config set compute/region YOUR_REGION
- Clone the repository:
git clone https://github.com/Team-SeekDeep/TikTokSubmission && cd TikTokSubmission - Create and activate a virtual environment.
- Install dependencies:
pip install -r requirements.txt
Key settings in the Config Settings cell of each notebook:
PROJECT_ID,LOCATION,GCS_BUCKET(Vertex AI).USE_VERTEX(Backend choice).GEMINI_API_KEY(Gemini API, prefer env varGOOGLE_API_KEY).- File paths (
DATASET_CSV,METADATA_FILE,RESULTS_FILE,QUESTIONS_DIR,ANSWERS_DIR). SKIP_*flags for bypassing processed steps.MAX_VIDEOS_TO_PROCESSfor testing.MODEL_NAME,QUESTION_MODEL_NAME(for CoCoT/Summary models).VIDEO_SPEED_FACTOR(e.g.,0.5for half speed, crucial for performance).
Review and adjust configuration before running any notebook.
- Launch Jupyter (
jupyter laborjupyter notebook). - Open a Notebook based on your goal (see "Core Notebooks & Inference Strategies").
- Configure settings in that notebook.
- Run Cells Sequentially, paying attention to outputs and logs.
- Data Preparation: Ensure videos are downloaded, (slowed down if
VIDEO_SPEED_FACTOR < 1.0), and uploaded. - Context Generation (for CoCoT): Run
Generated_Questions_By_Videos.ipynbbeforeFull_Inference_CoCoT_Generated_Questions.ipynb. - Inference: Execute the chosen full inference notebook.
- Data Preparation: Ensure videos are downloaded, (slowed down if
Switching Backends (Vertex <-> Gemini API):
Re-run "Prepare Videos" step (SKIP_PREPARE = False) after changing USE_VERTEX. SKIP_DOWNLOAD_ZIP and SKIP_EXTRACT can be True if local videos exist.
ffmpegNot Found.- Google Cloud Authentication Errors (Vertex AI).
- Gemini API Key Errors.
- File API Not Found / Expired (Gemini API).
- GCS Errors (Vertex AI).
- Video Download/Extraction Issues.
- API Quota Errors (
ResourceExhausted). - Model Not Found / Invalid Model Name.
Python's logging module outputs to Jupyter cells. For long runs (bulk inference), consider redirecting logs to a file.
- Submission Strategy / Recommended Usage:
- For achieving the best benchmark scores and aligning with updated competition requirements that benefit from advanced reasoning, we strongly recommend using the
Full_Inference_CoCoT_Generated_Questions.ipynbnotebook. Its enhanced CoT architecture, context-aware reasoning using generated questions, optimized CoT prompting, and benefits from video slowdown provide superior performance. - The
Full_Inference.ipynbnotebook serves as an essential alternative for demonstrating strict compliance with guidelines requiring isolated, question-by-question analysis without complex contextual chaining. It offers methodological rigor and transparency.
- For achieving the best benchmark scores and aligning with updated competition requirements that benefit from advanced reasoning, we strongly recommend using the
- Virtual Environments: Always use.
- Configuration Management: Review settings carefully; use
SKIP_*flags effectively. SetSKIP_PREPARE=Falsewhen switching backends. Remember to setVIDEO_SPEED_FACTOR(e.g., to0.5) for better performance. - API Key Security: Never commit API keys. Use environment variables.
- Resource Cleanup & Cost Management: Be mindful of GCS/Vertex AI/Gemini API costs and File API expiration.
- Version Control (
.gitignore):# Python __pycache__/ *.py[cod] *$py.class # Environment venv/ .env* # Data / Cache / Videos downloads/ extracted_videos/ speed_videos/ hf_cache/ *.zip *.mp4 video_metadata_*.csv results_*.csv all_results/ generated_questions/ # Contains generated CSVs and JSONs # Jupyter .ipynb_checkpoints/ # Logs *.log # IDE/OS specific .DS_Store .vscode/
- Understand Backend Differences.
For detailed support, notebook-specific issues, or configuration questions, please contact Dylan at dadevchia@gmail.com, referencing the notebook name and specific configuration settings involved.
To help diagnose the problem effectively, please include the following information in your communication:
- Which Notebook?
- Which Cell?
- Configuration: (especially
USE_VERTEX,MODEL_NAME,VIDEO_SPEED_FACTOR, relevantSKIP_*flags). Remove API Keys. - Error Message:
- Goal:
- (If using Vertex AI):
gcloud infooutput.
MIT License