Python 3.10.14
deep_audio_features0.2.18librosa0.10.2.post1pyAudioAnalysis0.3.14tensorflow2.10.1pandas2.2.2numpy1.26.4
Python 3.10.16
accelerate1.2.1pytorchvideo0.1.5scikit-image0.24.0torchaudio2.5.1torchvision0.20.1transformers4.48.0datasets2.14.5fsspec2023.6.0torch2.5.1evaluate0.4.3pandas2.2.3numpy1.26.4
Python 3.7.12
Help resources:
Dependencies:
cmake3.30.3dlib19.12.0opencv-contrib-python4.10.0.84numpy1.21.6
Python 3.6.13
References:
Dependencies:
cmake3.28.4mtcnn0.1.1pyannote.video1.6.3ffmpeg2.7numpy1.19.2pandas1.1.5
To run the project end-to-end, follow the order below.
Detailed installation instructions for each component are provided later.
Use the gsoc18_RedHenLab folder to create datasets from raw videos into usable small segments.
Ensure proper configuration of:
gsoc18_RedHenLab/video_processing_pipeline/stages.py
Once configured, the pipeline should run without issues.
For additional help: GSoC18 RedHenLab Video Processing Pipeline
All scripts are located in the Scripts directory.
Follow this order for execution:
Outputs video segments (~10 seconds).
Example:
python 1.video_segmentation.py -c sober -o Scripts/testJoins audio and video files.
Example:
python 2.fetch_audio_to_video.py \
-Vsrc "..\VIDEO_DATA\Segment_Output\drunk" \
-Ssrc "..\gsoc18_RedHenLab\video_processing_pipeline\4_face_cropping\audio_drunk_output" \
-Sdest "Scripts/test"Outputs audio segments (~10 seconds). Example:
python 3.audio_segmentation.py -c drunkUse the training split block to create a 70% train / 30% test data split.
Reorders audio segments based on the train-test split CSV. Example:
python 5.video_to_sound_splits.py \
-csv "Thesis/CSVs/Sound_Video_Train_Tests/video_sound_train_val_split_26_processed_v3.csv" \
-d "Thesis/Sound_For_Training" \
--drunk_dir "Thesis/Segment_Output_Audio_Test/drunk" \
--sober_dir "Thesis/Segment_Output_Audio_Test/sober"The remining scripts in the directory are helper modules used for training or utility functions.
All notebooks are located in the Notebooks directory:
-
LSTM_Model.ipynbTrains an LSTM model using VGG-extracted features. Run the extractor first:Scripts/VGG_feature_extractor.py (dl environment) -
Ts_Audio.ipynbExtracts CNN and MFCC features to train CNN and DNN-MFCC models using TensorFlow-Keras. (dl environment) -
Video_Transformer.ipynb&Audio_Transformer.ipynbContain the full code to train video and audio transformers. Can be run locally or on Google Colab. (transformers environment) -
Predictions.ipynb&Late_Fusion.ipynbGenerate validation predictions and combine multimodal results. (transformers environment)
| Script / Notebook | Environment |
|---|---|
video_segmentation.py |
pyannote |
fetch_audio_to_video.py |
pyannote |
audio_segmentation.py |
pyannote |
Train_Val_Split_Frames_Visualizer.ipynb |
dl |
video_to_sound_splits.py |
pyannote |
VGG_feature_extractor.py |
dl |
To create the dataset, run the scripts in the gsoc18_RedHenLab folder.
Follow the detailed instructions here:
GSoC18 RedHenLab Pipeline Guide
⚠️ Note: The original codebase is relatively old. Following the instructions exactly may lead to dependency conflicts. To resolve these issues:
- Create one environment with Python 3.6
- Create another with Python 3.7 (required for OpenCV installation, especially on Windows)
Summary
- Use pyannote for segmentation tasks.
- Use dl for deep learning (VGG, LSTM, CNN).
- Use transformers for model training and inference.
- Ensure dataset generation is completed before running notebooks.