Skip to content

Integrate language and vision for traffic accident identification, reasoning, and visual grounding

Notifications You must be signed in to change notification settings

ai4ce/SeeUnsafe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

When language and vision meet road safety: Leveraging multimodal large language models for video-based traffic accident analysis (SeeUnsafe)

SeeUnsafe is an MLLM-integrated framework for traffic accident analysis.

main

Setup Instructions

SeeUnsafe relies on several open-source projects, including: GroundingDINO, SAM, SAM2, LLaVA-NeXT. The code has been tested on Ubuntu 22.04, with CUDA version 11.8, PyTorch version 2.3.1+cu118.

  • Install SeeUnsafe and create a new environment
git clone https://github.com/ai4ce/SeeUnsafe
conda create --name SeeUnsafe python=3.10
conda activate SeeUnsafe
cd SeeUnsafe
pip install -r requirements.txt
  • Install PyTorch
pip install torch==2.3.1+cu118 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
  • Install GroundingDINO, SAM, SAM2 in the same environment
git clone https://github.com/IDEA-Research/GroundingDINO
git clone https://github.com/facebookresearch/segment-anything.git
git clone https://github.com/facebookresearch/segment-anything-2.git

Make sure these models are installed in editable packages by pip install -e .

  • The code still uses one checkpoint from segment-anything.

Make sure you download it in the SeeDo folder. default or vit_h: ViT-H SAM model.

  • Obtain an OpenAI API key and create a key.py file under path SeeUnsafe/
touch key.py
echo 'projectkey="YOUR_OPENAI_API_KEY"' > key.py

Pipeline

There are two critical parts of SeeUnsafe, frame-wise information augmentation part and task-specific mLLM evaluation part.

  1. Frame-Wise Information Augmentation This part integrates various computer vision models to extract the locations of cars and pedestrians, segment them, and add visual prompts in different colors to facilitate understanding by the multimodal large language model (mLLM).

    track_objects.py: takes an MP4 video file and the number of key frames as input. It adds visual prompts to cars and pedestrians in the video, uniformly samples the specified number of key frames, and saves the indices of the key frames along with the center coordinates of the detected objects in those frames.

    --input: the input video path (mp4)

    --output: output path to the video with augmented visual prompt (mp4)

    --num_key_frames: number of keyframes for normal sampling (int)

    --bbx_file: file to store the center coordinates of the detected objects in keyframes (csv)

    --index_file: file to store the indexes for keyframes (csv)

  2. Task-Specific mLLM Evaluation This part takes the visually augmented video and key frame indices obtained from the frame-wise information augmentation stage, segments the enhanced video into multiple key clips, and calls the multimodal large language model (mLLM) to analyze the severity of the accident in each clip.

    vlm.py: calls openai api (gpt-4o default)

    --input: input video path (mp4)

    --list: list of key frame indexes (str)

    --output: output file path of responses by mLLM (txt)

    Note that the length of each clip can be specified in def process_images

    vlm.sh: A script that can batch process videos by calling vlm.py.

To-Do List

  • Main logic: Accident analysis pipeline using GPT-4o and GPT-4o mini

  • Pipeline using other models: LLaVA-NeXT, VideoCLIP

  • IMS calculation: Information match score

  • Other experimental functionalities: RAG, key frame selection, trajectory-by-grounding, and maybe even more!

  • Dataset preparation: Toyota Woven Traffic Safety Dataset

Stay tuned for updates!

Citation

If you find this work useful for your research, please cite our paper:

@article{zhang2025language,
  title={When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis},
  author={Zhang, Ruixuan and Wang, Beichen and Zhang, Juexiao and Bian, Zilin and Feng, Chen and Ozbay, Kaan},
  journal={arXiv preprint arXiv:2501.10604},
  year={2025}
}

Acknowledgments

The codebase builds upon these wonderful projects SeeDo, SAM, GroundingDINO, and MASA. Feel free to play around with them!

About

Integrate language and vision for traffic accident identification, reasoning, and visual grounding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •