When language and vision meet road safety: Leveraging multimodal large language models for video-based traffic accident analysis (SeeUnsafe)

SeeUnsafe is an MLLM-integrated framework for traffic accident analysis.

Setup Instructions

SeeUnsafe relies on several open-source projects, including: GroundingDINO, SAM, SAM2, LLaVA-NeXT. The code has been tested on Ubuntu 22.04, with CUDA version 11.8, PyTorch version 2.3.1+cu118.

Install SeeUnsafe and create a new environment

git clone https://github.com/ai4ce/SeeUnsafe
conda create --name SeeUnsafe python=3.10
conda activate SeeUnsafe
cd SeeUnsafe
pip install -r requirements.txt

Install PyTorch

pip install torch==2.3.1+cu118 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Install GroundingDINO, SAM, SAM2 in the same environment

git clone https://github.com/IDEA-Research/GroundingDINO
git clone https://github.com/facebookresearch/segment-anything.git
git clone https://github.com/facebookresearch/segment-anything-2.git

Make sure these models are installed in editable packages by pip install -e .

The code still uses one checkpoint from segment-anything.

Make sure you download it in the SeeDo folder. default or vit_h: ViT-H SAM model.

Obtain an OpenAI API key and create a key.py file under path SeeUnsafe/

touch key.py
echo 'projectkey="YOUR_OPENAI_API_KEY"' > key.py

Pipeline

There are two critical parts of SeeUnsafe, frame-wise information augmentation part and task-specific mLLM evaluation part.

Frame-Wise Information Augmentation This part integrates various computer vision models to extract the locations of cars and pedestrians, segment them, and add visual prompts in different colors to facilitate understanding by the multimodal large language model (mLLM).

track_objects.py: takes an MP4 video file and the number of key frames as input. It adds visual prompts to cars and pedestrians in the video, uniformly samples the specified number of key frames, and saves the indices of the key frames along with the center coordinates of the detected objects in those frames.

--input: the input video path (mp4)

--output: output path to the video with augmented visual prompt (mp4)

--num_key_frames: number of keyframes for normal sampling (int)

--bbx_file: file to store the center coordinates of the detected objects in keyframes (csv)

--index_file: file to store the indexes for keyframes (csv)
Task-Specific mLLM Evaluation This part takes the visually augmented video and key frame indices obtained from the frame-wise information augmentation stage, segments the enhanced video into multiple key clips, and calls the multimodal large language model (mLLM) to analyze the severity of the accident in each clip.

vlm.py: calls openai api (gpt-4o default)

--input: input video path (mp4)

--list: list of key frame indexes (str)

--output: output file path of responses by mLLM (txt)

Note that the length of each clip can be specified in def process_images

vlm.sh: A script that can batch process videos by calling vlm.py.

To-Do List

Main logic: Accident analysis pipeline using GPT-4o and GPT-4o mini
Pipeline using other models: LLaVA-NeXT, VideoCLIP
IMS calculation: Information match score
Other experimental functionalities: RAG, key frame selection, trajectory-by-grounding, and maybe even more!
Dataset preparation: Toyota Woven Traffic Safety Dataset

Stay tuned for updates!

Citation

If you find this work useful for your research, please cite our paper:

@article{zhang2025language,
  title={When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis},
  author={Zhang, Ruixuan and Wang, Beichen and Zhang, Juexiao and Bian, Zilin and Feng, Chen and Ozbay, Kaan},
  journal={arXiv preprint arXiv:2501.10604},
  year={2025}
}

Acknowledgments

The codebase builds upon these wonderful projects SeeDo, SAM, GroundingDINO, and MASA. Feel free to play around with them!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
media		media
.gitignore		.gitignore
README.md		README.md
key.py		key.py
track.sh		track.sh
track_batch.py		track_batch.py
track_objects.py		track_objects.py
vlm.py		vlm.py
vlm.sh		vlm.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

When language and vision meet road safety: Leveraging multimodal large language models for video-based traffic accident analysis (SeeUnsafe)

Setup Instructions

Pipeline

To-Do List

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

ai4ce/SeeUnsafe

Folders and files

Latest commit

History

Repository files navigation

When language and vision meet road safety: Leveraging multimodal large language models for video-based traffic accident analysis (SeeUnsafe)

Setup Instructions

Pipeline

To-Do List

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages