CS2270-Video-Captioning

System Overview

This system enables users to query the content of videos using natural language by automatically extracting structured information from video frames and storing it in a relational database. The system is adaptive: if a user's query cannot be answered with the current database schema, it will automatically update the schema and extract the missing attributes in the form of new tables or/and new columns.

1. Before Query: Video Processing & Database Construction

Frame Sampling:
Input videos are sampled into individual frames at a configurable rate.
Frame Captioning:
Each frame is processed by a vision-language model, which generates detailed captions describing all visible objects, their categories, attributes (e.g., brand, color), actions, and locations.
Table Generation:
The system parses these captions and organizes the extracted information into SQL tables. Each object and its attributes are stored in a structured, queryable format. The database schema is generated dynamically based on the detected object types and their properties.

2. User Query: Natural Language to SQL

Query Submission:
Users submit natural language queries (e.g., "Find all frames where a white BMW is turning right") through the web frontend.
Text-to-SQL Translation:
The system uses a language model to translate the user's query into an SQL statement, leveraging the current database schema.
Result Retrieval:
The SQL query is executed, and the relevant frames or information are returned to the user.

3. If the Query Can't Be Answered: Schema Adaptation & Re-kicking

Schema Sufficiency Check:
If the system determines that the current database schema is missing key attributes or tables needed to answer the query, it invokes an LLM-based "Judge" to assess what's missing.
Schema Extension:
- If new object categories or attributes are required, the system will:
  - Generate new tables if entirely new table(s) are needed.
  - Add new columns to existing tables if only additional attribute(s) are required.
Re-kicking the Pipeline:
The system then "re-kicks" the pipeline: it extracts the missing information from the video, updates the database, and automatically re-attempts the user's query.

Directory Structure

pipeline.py: Backend entry point. Orchestrates the full video-to-table and query pipeline.
models/: Submodules for vision-language models, language models, and text-to-SQL models. Includes configuration files and model wrappers.
datasets/: References to all datasets (raw data) used for training, evaluation, and testing.
data_processing/: Scripts and configs for preprocessing and sampling videos.
outputs/: Contains the generated PostgreSQL tables and processed outputs.
frontend/: Flask app and HTML/JS interface for uploading videos and querying data. WIP.
database_integration/: Interfaces for SQLite and vector database integration.
pipelines/: Modular pipeline components for frame extraction, captioning, text-to-SQL, text-to-table, and text-to-column.
requirements.txt: Python dependencies for the project.

Setup

1. Install Dependencies

pip install -r requirements.txt

2. Request GPU from OSCAR (if needed)

srun --partition=gpu --gres=gpu:2 --cpus-per-task=4 --mem=16G --time=02:00:00 --pty bash

Edit the Config file

Before running the pipeline, edit config/config.py to set the appropriate options for your environment and task. Key settings include:

Input video path and filename:

video_path = 'datasets/bdd/'
video_filename = '00afa5b2-c14a542f.mov'

Model selection:

caption_model_name = 'OpenAI;gpt-4o-mini'
text2table_model_name = 'OpenAI;gpt-4o-mini'
text2sql_model_name = 'OpenAI;gpt-4o-mini'

Database paths and names
Pipeline toggles and parameters (e.g., batch size, precision, enabling/disabling modules)

Usage

Run the Backend Pipeline

Edit pipeline.py as needed for your use case (see the __main__ section for examples). Example to process a query for a missing table:

# In pipeline.py
query_pipeline = VideoQueryPipeline()
# Example: Query for a missing table (advanced use case)
import time
question = "What is the first frame in which a damaged SUV stops at a red light?"
start_time = time.time()
result = asyncio.run(query_pipeline.process_query(language_query=question, llm_judge=Config.llm_judge))
end_time = time.time()
print("SYSTEM RESPONSE: ", result)
print(f"Time taken: {end_time - start_time}")

Profiling

Profile backend latency using Pyinstrument:

pyinstrument -o test_async_latency_text2table.html pipeline.py

For questions or contributions, please open an issue or contact the maintainers.

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
config		config
database_integration		database_integration
datasets		datasets
evaluation		evaluation
frontend		frontend
models		models
pipelines		pipelines
tests		tests
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
environment.txt		environment.txt
pipeline.py		pipeline.py
requirements.txt		requirements.txt
test_frame_extraction.py		test_frame_extraction.py
verb_label_range.csv		verb_label_range.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CS2270-Video-Captioning

System Overview

1. Before Query: Video Processing & Database Construction

2. User Query: Natural Language to SQL

3. If the Query Can't Be Answered: Schema Adaptation & Re-kicking

Directory Structure

Setup

1. Install Dependencies

2. Request GPU from OSCAR (if needed)

Edit the Config file

Usage

Run the Backend Pipeline

Profiling

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

CS2270-Video-Analytics/CS2270-Video-Captioning

Folders and files

Latest commit

History

Repository files navigation

CS2270-Video-Captioning

System Overview

1. Before Query: Video Processing & Database Construction

2. User Query: Natural Language to SQL

3. If the Query Can't Be Answered: Schema Adaptation & Re-kicking

Directory Structure

Setup

1. Install Dependencies

2. Request GPU from OSCAR (if needed)

Edit the Config file

Usage

Run the Backend Pipeline

Profiling

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages