- Extract audio from a video file.
- Segment audio into 5-seconds chunks.
- Enhance audio quantity and save as new chunks.
- Perform speech-to-text transcription using Whisper.
- Conduct sentiment analysis on the transcribed text using TextBlob.
- Save results in CSV.
- Visualize:
- Histogram - voice activity per 5-second window
- Sentiment Classification Chart - Statistical Positive / Neutral / Negative Ratio
pip install -r requirements.txt
sudo apt install ffmpeg # Linux
brew install ffmpeg # macOS
winget install ffmpeg # Windowspandas
numpy
opencv-python
matplotlib
tqdm
ipython
speechrecognition
pydub
textblob
nltk
seaborn
vosk
waveEnsure you have download the video.
python main.ipynb🔹 Model: OpenCV 🔹 Input: Video files (.mp4 format) 🔹 Output: Metadata
| MetaData | |
|---|---|
| # of frames | 10691.0 |
| Frame Height | 10691.0 |
| Frame Width | 10691.0 |
| Frame per second | 29.964424428092247 |
🔹 Model: ffmpeg
🔹 Input: Video files (.mp4 format)
🔹 Output: Audio files (.wav format)
1️⃣ Invoke ffmpeg to extract audio from video.
2️⃣ "-ac", "1": Convert audio to single-channel
3️⃣ "-ar", "16000": Set audio sample rate to 16 kHz (16,000 Hz).
4️⃣ "-acodec", "pcm_s16le": Specify the audio encoding format as pcm_s16le, i.e. linear PCM encoding, 16-bit small-end format (uncompressed audio format, commonly used for high-quality audio)
🔹 Model: AudioSegment
🔹 Input: Audio files (.wav format)
🔹 Output: chunk_i.wav
1️⃣ Split into 5-second segments for efficient processing.
🔹 Model: AudioSegment
🔹 Input: chunk_i.wav
🔹 Output: chunk_i_enhanced.wav
1️⃣ Enhance the volume
audio + 10: Increase audio by 10 decibels (dB) to increase the volume and make the audio louder
2️⃣ Human voice is usually between 300 Hz and 3 kHz.
high_pass_filter(louder_audio, cutoff=300): Frequencies in the audio that are lower than 300 Hz are filtered out, removing low-frequency noises, such as wind, bass, etc.
low_pass_filter(filtered_audio, cutoff=3000): Frequencies in the audio above 3000 Hz will be filtered out.
3️⃣ Normalise to balance the volume
normalized_audio = normalize(filtered_audio): adjust the overall volume of the audio so that the maximum volume of the audio reaches a certain standard value (usually 0 dB) to help balance the volume of the audio from being too high or too low
4️⃣ Save the processed audio
🔹 Model: Whisper (base model)
🔹 Input: chunk_i_enhanced.wav
🔹 Output: CSV file with timestamps and transcriptions
| Timestamp | Transcription |
|---|---|
| 0s | Okay, so the |
| 5s | |
| 10s | you're going to complete and use the key auto... |
| 15s | upgrade vehicle, and keep your needs off the ... |
| 20s | Okay, so when you see that some driver indica... |
The output file is named: whisper_video_1731617801_transcripts.csv
🔹 Model: TextBlob (A lexicon-based sentiment analysis tool)
🔹 Input: chunk_i_enhanced.wav
🔹 Output: CSV file with timestamps, transcriptions, Sentiment
| Timestamp | Transcription | Sentiment |
|---|---|---|
| 0 | Neutral | |
| 5 | Okay, so the | Positive |
| 10 | you're going to complete and use the key auto... | Positive |
| 15 | upgrade vehicle, and keep your needs off the ... | Positive |
| 20 | Okay, so when you see that some driver indica... | Positive |
The output file is named: video_1731617801_sentiments.csv
1️⃣ Voice Activity Histogram
- Histogram - voice activity per 5-second window
2️⃣ Sentiment Distribution Plot
- Sentiment Classification Chart - Statistical Positive / Neutral / Negative Ratio





