When language and vision meet road safety: Leveraging multimodal large language models for video-based traffic accident analysis (SeeUnsafe)
SeeUnsafe is an MLLM-integrated framework for traffic accident analysis.
SeeUnsafe relies on several open-source projects, including: GroundingDINO, SAM, SAM2, LLaVA-NeXT. The code has been tested on Ubuntu 22.04, with CUDA version 11.8, PyTorch version 2.3.1+cu118.
- Install SeeUnsafe and create a new environment
git clone https://github.com/ai4ce/SeeUnsafe
conda create --name SeeUnsafe python=3.10
conda activate SeeUnsafe
cd SeeUnsafe
pip install -r requirements.txt
- Install PyTorch
pip install torch==2.3.1+cu118 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
- Install GroundingDINO, SAM, SAM2 in the same environment
git clone https://github.com/IDEA-Research/GroundingDINO
git clone https://github.com/facebookresearch/segment-anything.git
git clone https://github.com/facebookresearch/segment-anything-2.git
Make sure these models are installed in editable packages by pip install -e .
- The code still uses one checkpoint from segment-anything.
Make sure you download it in the SeeDo folder.
default
or vit_h
: ViT-H SAM model.
- Obtain an OpenAI API key and create a
key.py
file under path SeeUnsafe/
touch key.py
echo 'projectkey="YOUR_OPENAI_API_KEY"' > key.py
There are two critical parts of SeeUnsafe, frame-wise information augmentation part and task-specific mLLM evaluation part.
-
Frame-Wise Information Augmentation This part integrates various computer vision models to extract the locations of cars and pedestrians, segment them, and add visual prompts in different colors to facilitate understanding by the multimodal large language model (mLLM).
track_objects.py
: takes an MP4 video file and the number of key frames as input. It adds visual prompts to cars and pedestrians in the video, uniformly samples the specified number of key frames, and saves the indices of the key frames along with the center coordinates of the detected objects in those frames.--input
: the input video path (mp4)--output
: output path to the video with augmented visual prompt (mp4)--num_key_frames
: number of keyframes for normal sampling (int)--bbx_file
: file to store the center coordinates of the detected objects in keyframes (csv)--index_file
: file to store the indexes for keyframes (csv) -
Task-Specific mLLM Evaluation This part takes the visually augmented video and key frame indices obtained from the frame-wise information augmentation stage, segments the enhanced video into multiple key clips, and calls the multimodal large language model (mLLM) to analyze the severity of the accident in each clip.
vlm.py
: calls openai api (gpt-4o default)--input
: input video path (mp4)--list
: list of key frame indexes (str)--output
: output file path of responses by mLLM (txt)Note that the length of each clip can be specified in
def process_images
vlm.sh
: A script that can batch process videos by callingvlm.py
.
-
Main logic: Accident analysis pipeline using GPT-4o and GPT-4o mini
-
Pipeline using other models: LLaVA-NeXT, VideoCLIP
-
IMS calculation: Information match score
-
Other experimental functionalities: RAG, key frame selection, trajectory-by-grounding, and maybe even more!
-
Dataset preparation: Toyota Woven Traffic Safety Dataset
Stay tuned for updates!
If you find this work useful for your research, please cite our paper:
@article{zhang2025language,
title={When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis},
author={Zhang, Ruixuan and Wang, Beichen and Zhang, Juexiao and Bian, Zilin and Feng, Chen and Ozbay, Kaan},
journal={arXiv preprint arXiv:2501.10604},
year={2025}
}
The codebase builds upon these wonderful projects SeeDo, SAM, GroundingDINO, and MASA. Feel free to play around with them!