Haley Egan
This project was conducted in collaboration with Nan Hauser at the Center for Cetacean Research & Conservation, who generously shared lots of wonderful humpback whale audio data, in the hopes of gaining a deeper understanding of language behaviors through Machine Learning.
This project implements a deep learning pipeline to classify humpback whale vocalizations by geographic location using Convolutional Neural Networks (CNNs). The approach transforms raw audio recordings into spectrogram images, which are then processed using a CNN in Tensorflow for location-based classification.
Audio → Spectrogram → CNN → Location Classification -> Model and Prediction Evaluation
- Audio Preprocessing: Raw whale audio files are loaded and preprocessed (cleaned, cropped, etc)
- Spectrogram Generation: Audio signals are converted to time-frequency representations (spectrograms)
- CNN Classification: Convolutional Neural Network analyzes spectrograms to find patterns based on whale location
- Output: Location classification for each audio sample, and prediction of location for new audio samples
Location-Based Classification: Multi-class classification framework
- Classify each audio sample into one of multiple geographic locations
- Each location represents a distinct class in the classification problem
- Model outputs probability distribution across all possible locations
Alternative Approaches:
- Multi-class: One model predicting among all locations (Location A, B, C, D...)
- Binary per location: Multiple binary models (one per location) asking "is this location X or not?"
- Hierarchical: Group locations by region, then classify within regions
- TensorFlow Audio Classification Tutorial - Official TensorFlow guide for audio processing
- CNNs for Audio Classification - Theory and implementation of CNNs for audio
- MNIST Audio Classification with Spectrograms - Practical Keras implementation example
- Custom Audio Classification with TensorFlow - Building custom audio classification models
- Audio Echo Processing - Audio augmentation and noise reduction techniques
Source: Center for Cetacean Research and Conservation (Nan Hauser's dataset) from Bermuda and the Cook Islands, and open source audio recordings from Hawaii and Monterey.
- Channels: Stereo audio (2 channels)
- Structure: Left and right audio channels are interleaved in single files (alternating left/right channel samples)
- Implication: Requires channel separation during preprocessing to access individual left/right audio streams
- Channel separation may be needed to analyze left vs right audio independently
- Stereo format could provide spatial audio information useful for classification
- File format and sample rate specifications should be documented for consistent processing
The classification pipeline was tested on the original full-length audio files, as well as shorter 75 second and 30 second clips. 30 second clips proved to be as effective, and occationally better than longer clips in predicting location, and were significantly less computationally expensive, so 30 second clips were used for analysis and development of the pipeline. Further experimentation with audio file lengths is encouraged.
-
The segmenting process of audio files can be found in the SplitAudio_30sec.ipynb and SplitAudio_75sec.ipynb notebooks.
-
A verbose walkthrough of converting humpback audio files to spectrograms can be found in Audio_to_Specrogram.ipynb. The simplified version of the process is included in the main notebook, HumpbackWhale_SpectrogramCNN_30SecAudioClips.ipynb.
-
Testing the CNN on the full audio files can be seen at Spectrogram_to_CNN_FullSong.ipynb.
-
The full classification notebook with model evaluation and predications on 30 second audio segments can be found in the notebook HumpbackWhale_SpectrogramCNN_30SecAudioClips.ipynb.
-
Interactive Map of Pacific Ocean Humpback Whale Migration routes
The below results and visuals can be seen in the notebook HumpbackWhale_SpectrogramCNN_30SecAudioClips.ipynb. Further model evaluation metrics can be found in the notebook, including precision, recall, f1-score, accuracy, and loss.
Sample of Waveforms from Humpback Whale Audio Segments by Location

Example of a Waveform and Corresponding Spectrogram for an Audio Segment

Sample of Spectrograms from Humpback Whale Audio Segments by Location

Confusion Matrix of CNN Classification Results

Example of Model Prediction on New (never seen) Audio File

The class distribution in this notebook is imbalanced, with Bermuda containing the least amount of data. This is visible in the results, with Bermuda containing the highest number of misclassifications. This is something that can be adjusted and experimented with in the future. Ideally, all locations would have significantly more data, spanning many years, different types of recording equipment, and various recording locations within the regions. Data quantity and diversity was a constraint for this project, but with more data and expanded modeling techniques, the future possibilities are endless!