Skip to content

Commit dbd68a1

Browse files
committed
cog-ified
1 parent 2a1bd63 commit dbd68a1

24 files changed

+162
-59
lines changed

README.md

+46-2
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,16 @@
11
# Speech Emotion Recognition
22
## Introduction
3+
<a href="https://replicate.ai/x4nth055/emotion-recognition-using-speech"><img src="https://img.shields.io/static/v1?label=Replicate&message=Demo and Docker Image&color=darkgreen" height=20></a>
4+
5+
36
- This repository handles building and training Speech Emotion Recognition System.
47
- The basic idea behind this tool is to build and train/test a suited machine learning ( as well as deep learning ) algorithm that could recognize and detects human emotions from speech.
58
- This is useful for many industry fields such as making product recommendations, affective computing, etc.
69
- Check this [tutorial](https://www.thepythoncode.com/article/building-a-speech-emotion-recognizer-using-sklearn) for more information.
710
## Requirements
811
- **Python 3.6+**
912
### Python Packages
13+
- **tensorflow**
1014
- **librosa==0.6.3**
1115
- **numpy**
1216
- **pandas**
@@ -38,7 +42,7 @@ Feature extraction is the main part of the speech emotion recognition system. It
3842

3943
In this repository, we have used the most used features that are available in [librosa](https://github.com/librosa/librosa) library including:
4044
- [MFCC](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum)
41-
- Chromagram
45+
- Chromagram
4246
- MEL Spectrogram Frequency (mel)
4347
- Contrast
4448
- Tonnetz (tonal centroid features)
@@ -102,6 +106,7 @@ print("Prediction:", rec.predict("data/tess_ravdess/validation/Actor_25/25_01_01
102106
Prediction: neutral
103107
Prediction: sad
104108
```
109+
You can pass any audio file, if it's not in the appropriate format (16000Hz and mono channel), then it'll be automatically converted, make sure you have `ffmpeg` installed in your system and added to *PATH*.
105110
## Example 2: Using RNNs for 5 Emotions
106111
```python
107112
from deep_emotion_recognition import DeepEmotionRecognizer
@@ -143,6 +148,45 @@ true_neutral 3.846154 8.974360 82.051285 2.564103
143148
true_ps 2.564103 0.000000 1.282051 83.333328 12.820514
144149
true_happy 20.512821 2.564103 2.564103 2.564103 71.794876
145150
```
151+
## Example 3: Not Passing any Model and Removing the Custom Dataset
152+
Below code initializes `EmotionRecognizer` with 3 chosen emotions while removing Custom dataset, and setting `balance` to `False`:
153+
```python
154+
from emotion_recognition import EmotionRecognizer
155+
# initialize instance, this will take a bit the first time executed
156+
# as it'll extract the features and calls determine_best_model() automatically
157+
# to load the best performing model on the picked dataset
158+
rec = EmotionRecognizer(emotions=["angry", "neutral", "sad"], balance=False, verbose=1, custom_db=False)
159+
# it will be trained, so no need to train this time
160+
# get the accuracy on the test set
161+
print(rec.confusion_matrix())
162+
# predict angry audio sample
163+
prediction = rec.predict('data/validation/Actor_10/03-02-05-02-02-02-10_angry.wav')
164+
print(f"Prediction: {prediction}")
165+
```
166+
**Output:**
167+
```
168+
[+] Best model determined: RandomForestClassifier with 93.454% test accuracy
169+
170+
predicted_angry predicted_neutral predicted_sad
171+
true_angry 98.275864 1.149425 0.574713
172+
true_neutral 0.917431 88.073395 11.009174
173+
true_sad 6.250000 1.875000 91.875000
174+
175+
Prediction: angry
176+
```
177+
You can print the number of samples on each class:
178+
```python
179+
rec.get_samples_by_class()
180+
```
181+
**Output:**
182+
```
183+
train test total
184+
angry 910 174 1084
185+
neutral 650 109 759
186+
sad 862 160 1022
187+
total 2422 443 2865
188+
```
189+
In this case, the dataset is only from TESS and RAVDESS, and not balanced, you can pass `True` to `balance` on the `EmotionRecognizer` instance to balance the data.
146190
## Algorithms Used
147191
This repository can be used to build machine learning classifiers as well as regressors for the case of 3 emotions {'sad': 0, 'neutral': 1, 'happy': 2} and the case of 5 emotions {'angry': 1, 'sad': 2, 'neutral': 3, 'ps': 4, 'happy': 5}
148192
### Classifiers
@@ -207,4 +251,4 @@ plot_histograms(classifiers=True)
207251
**Output:**
208252

209253
<img src="images/Figure.png">
210-
<p align="center">A Histogram shows different algorithms metric results on different data sizes as well as time consumed to train/predict.</p>
254+
<p align="center">A Histogram shows different algorithms metric results on different data sizes as well as time consumed to train/predict.</p>

cog.yaml

+18
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
build:
2+
python_version: "3.6"
3+
gpu: false
4+
python_packages:
5+
- pandas==1.1.5
6+
- numpy==1.17.3
7+
- wave==0.0.2
8+
- sklearn==0.0
9+
- librosa==0.6.3
10+
- soundfile==0.9.0
11+
- tqdm==4.28.1
12+
- matplotlib==2.2.3
13+
- pyaudio==0.2.11
14+
- numba==0.48
15+
system_packages:
16+
- "ffmpeg"
17+
- "portaudio19-dev"
18+
predict: "predict.py:EmoPredictor"

convert_wavs.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,11 @@ def convert_audio(audio_path, target_path, remove=False):
1717
remove (bool): whether to remove the old file after converting
1818
Note that this function requires ffmpeg installed in your system."""
1919

20-
os.system(f"ffmpeg -i {audio_path} -ac 1 -ar 16000 {target_path}")
20+
v = os.system(f"ffmpeg -i {audio_path} -ac 1 -ar 16000 {target_path}")
2121
# os.system(f"ffmpeg -i {audio_path} -ac 1 {target_path}")
2222
if remove:
2323
os.remove(audio_path)
24+
return v
2425

2526

2627
def convert_audios(path, target_path, remove=False):

create_csv.py

+8-6
Original file line numberDiff line numberDiff line change
@@ -69,18 +69,20 @@ def write_tess_ravdess_csv(emotions=["sad", "neutral", "happy"], train_name="tra
6969

7070
for category in emotions:
7171
# for training speech directory
72-
for i, path in enumerate(glob.glob(f"data/training/Actor_*/*_{category}.wav")):
72+
total_files = glob.glob(f"data/training/Actor_*/*_{category}.wav")
73+
for i, path in enumerate(total_files):
7374
train_target["path"].append(path)
7475
train_target["emotion"].append(category)
75-
if verbose:
76-
print(f"[TESS&RAVDESS] There are {i} training audio files for category:{category}")
76+
if verbose and total_files:
77+
print(f"[TESS&RAVDESS] There are {len(total_files)} training audio files for category:{category}")
7778

7879
# for validation speech directory
79-
for i, path in enumerate(glob.glob(f"data/validation/Actor_*/*_{category}.wav")):
80+
total_files = glob.glob(f"data/validation/Actor_*/*_{category}.wav")
81+
for i, path in enumerate(total_files):
8082
test_target["path"].append(path)
8183
test_target["emotion"].append(category)
82-
if verbose:
83-
print(f"[TESS&RAVDESS] There are {i} testing audio files for category:{category}")
84+
if verbose and total_files:
85+
print(f"[TESS&RAVDESS] There are {len(total_files)} testing audio files for category:{category}")
8486
pd.DataFrame(test_target).to_csv(test_name)
8587
pd.DataFrame(train_target).to_csv(train_name)
8688

deep_emotion_recognition.py

+15-27
Original file line numberDiff line numberDiff line change
@@ -3,26 +3,13 @@
33
import sys
44
stderr = sys.stderr
55
sys.stderr = open(os.devnull, 'w')
6-
import keras
7-
sys.stderr = stderr
8-
# to use CPU uncomment below code
9-
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" # see issue #152
10-
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
11-
# disable tensorflow logs
12-
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
136
import tensorflow as tf
147

15-
config = tf.ConfigProto(intra_op_parallelism_threads=5,
16-
inter_op_parallelism_threads=5,
17-
allow_soft_placement=True,
18-
device_count = {'CPU' : 1,
19-
'GPU' : 0}
20-
)
21-
from keras.layers import LSTM, GRU, Dense, Activation, LeakyReLU, Dropout
22-
from keras.layers import Conv1D, MaxPool1D, GlobalAveragePooling1D
23-
from keras.models import Sequential
24-
from keras.callbacks import ModelCheckpoint, TensorBoard
25-
from keras.utils import to_categorical
8+
from tensorflow.keras.layers import LSTM, GRU, Dense, Activation, LeakyReLU, Dropout
9+
from tensorflow.keras.layers import Conv1D, MaxPool1D, GlobalAveragePooling1D
10+
from tensorflow.keras.models import Sequential
11+
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard
12+
from tensorflow.keras.utils import to_categorical
2613

2714
from sklearn.metrics import accuracy_score, mean_absolute_error, confusion_matrix
2815

@@ -82,7 +69,7 @@ def __init__(self, **kwargs):
8269
regression.
8370
"""
8471
# init EmotionRecognizer
85-
super().__init__(None, **kwargs)
72+
super().__init__(**kwargs)
8673

8774
self.n_rnn_layers = kwargs.get("n_rnn_layers", 2)
8875
self.n_dense_layers = kwargs.get("n_dense_layers", 2)
@@ -103,7 +90,7 @@ def __init__(self, **kwargs):
10390

10491
# training attributes
10592
self.batch_size = kwargs.get("batch_size", 64)
106-
self.epochs = kwargs.get("epochs", 1000)
93+
self.epochs = kwargs.get("epochs", 500)
10794

10895
# the name of the model
10996
self.model_name = ""
@@ -264,7 +251,7 @@ def train(self, override=False):
264251
model_filename = self._get_model_filename()
265252

266253
self.checkpointer = ModelCheckpoint(model_filename, save_best_only=True, verbose=1)
267-
self.tensorboard = TensorBoard(log_dir=f"logs/{self.model_name}")
254+
self.tensorboard = TensorBoard(log_dir=os.path.join("logs", self.model_name))
268255

269256
self.history = self.model.fit(self.X_train, self.y_train,
270257
batch_size=self.batch_size,
@@ -335,8 +322,8 @@ def confusion_matrix(self, percentage=True, labeled=True):
335322
columns=[ f"predicted_{e}" for e in self.emotions ])
336323
return matrix
337324

338-
def n_emotions(self, emotion, partition):
339-
"""Returns number of `emotion` data samples in a particular `partition`
325+
def get_n_samples(self, emotion, partition):
326+
"""Returns number data samples of the `emotion` class in a particular `partition`
340327
('test' or 'train')
341328
"""
342329
if partition == "test":
@@ -361,8 +348,8 @@ def get_samples_by_class(self):
361348
test_samples = []
362349
total = []
363350
for emotion in self.emotions:
364-
n_train = self.n_emotions(self.emotions2int[emotion]+1, "train")
365-
n_test = self.n_emotions(self.emotions2int[emotion]+1, "test")
351+
n_train = self.get_n_samples(self.emotions2int[emotion]+1, "train")
352+
n_test = self.get_n_samples(self.emotions2int[emotion]+1, "test")
366353
train_samples.append(n_train)
367354
test_samples.append(n_test)
368355
total.append(n_train + n_test)
@@ -396,9 +383,10 @@ def get_random_emotion(self, emotion, partition="train"):
396383

397384
return index
398385

399-
def determine_best_model(self, train=True):
386+
def determine_best_model(self):
400387
# TODO
401-
raise TypeError("This method isn't supported yet for deep nn")
388+
# raise TypeError("This method isn't supported yet for deep nn")
389+
pass
402390

403391

404392
if __name__ == "__main__":

emotion_recognition.py

+18-17
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,11 @@
1919
class EmotionRecognizer:
2020
"""A class for training, testing and predicting emotions based on
2121
speech's features that are extracted and fed into `sklearn` or `keras` model"""
22-
def __init__(self, model, **kwargs):
22+
def __init__(self, model=None, **kwargs):
2323
"""
2424
Params:
25-
model (sklearn model): the model used to detect emotions.
25+
model (sklearn model): the model used to detect emotions. If `model` is None, then self.determine_best_model()
26+
will be automatically called
2627
emotions (list): list of emotions to be used. Note that these emotions must be available in
2728
RAVDESS_TESS & EMODB Datasets, available nine emotions are the following:
2829
'neutral', 'calm', 'happy', 'sad', 'angry', 'fear', 'disgust', 'ps' ( pleasant surprised ), 'boredom'.
@@ -42,8 +43,6 @@ def __init__(self, model, **kwargs):
4243
Note that when `tess_ravdess`, `emodb` and `custom_db` are set to `False`, `tess_ravdess` will be set to True
4344
automatically.
4445
"""
45-
# model
46-
self.model = model
4746
# emotions
4847
self.emotions = kwargs.get("emotions", ["sad", "neutral", "happy"])
4948
# make sure that there are only available emotions
@@ -79,6 +78,12 @@ def __init__(self, model, **kwargs):
7978
self.data_loaded = False
8079
self.model_trained = False
8180

81+
# model
82+
if not model:
83+
self.determine_best_model()
84+
else:
85+
self.model = model
86+
8287
def _set_metadata_filenames(self):
8388
"""
8489
Protected method to get all CSV (metadata) filenames into two instance attributes:
@@ -182,7 +187,7 @@ def predict_proba(self, audio_path):
182187
feature = extract_feature(audio_path, **self.audio_config).reshape(1, -1)
183188
proba = self.model.predict_proba(feature)[0]
184189
result = {}
185-
for emotion, prob in zip(self.emotions, proba):
190+
for emotion, prob in zip(self.model.classes_, proba):
186191
result[emotion] = prob
187192
return result
188193
else:
@@ -199,12 +204,10 @@ def grid_search(self, params, n_jobs=2, verbose=1):
199204
grid_result = grid.fit(self.X_train, self.y_train)
200205
return grid_result.best_estimator_, grid_result.best_params_, grid_result.best_score_
201206

202-
def determine_best_model(self, train=True):
207+
def determine_best_model(self):
203208
"""
204209
Loads best estimators and determine which is best for test data,
205210
and then set it to `self.model`.
206-
if `train` is True, then train that model on train data, so the model
207-
will be ready for inference.
208211
In case of regression, the metric used is MSE and accuracy for classification.
209212
Note that the execution of this method may take several minutes due
210213
to training all estimators (stored in `grid` folder) for determining the best possible one.
@@ -240,11 +243,9 @@ def determine_best_model(self, train=True):
240243
result.append((detector.model, accuracy))
241244

242245
# sort the result
243-
if self.classification:
244-
result = sorted(result, key=lambda item: item[1], reverse=True)
245-
else:
246-
# regression, best is the lower, not the higher
247-
result = sorted(result, key=lambda item: item[1], reverse=False)
246+
# regression: best is the lower, not the higher
247+
# classification: best is higher, not the lower
248+
result = sorted(result, key=lambda item: item[1], reverse=self.classification)
248249
best_estimator = result[0][0]
249250
accuracy = result[0][1]
250251
self.model = best_estimator
@@ -316,8 +317,8 @@ def draw_confusion_matrix(self):
316317
pl.imshow(matrix, cmap="binary")
317318
pl.show()
318319

319-
def n_emotions(self, emotion, partition):
320-
"""Returns number of `emotion` data samples in a particular `partition`
320+
def get_n_samples(self, emotion, partition):
321+
"""Returns number data samples of the `emotion` class in a particular `partition`
321322
('test' or 'train')
322323
"""
323324
if partition == "test":
@@ -337,8 +338,8 @@ def get_samples_by_class(self):
337338
test_samples = []
338339
total = []
339340
for emotion in self.emotions:
340-
n_train = self.n_emotions(emotion, "train")
341-
n_test = self.n_emotions(emotion, "test")
341+
n_train = self.get_n_samples(emotion, "train")
342+
n_test = self.get_n_samples(emotion, "test")
342343
train_samples.append(n_train)
343344
test_samples.append(n_test)
344345
total.append(n_train + n_test)
1.02 MB
Binary file not shown.
-1.1 MB
Binary file not shown.
689 KB
Binary file not shown.
-771 KB
Binary file not shown.
5.35 MB
Binary file not shown.
-5.44 MB
Binary file not shown.
3.4 MB
Binary file not shown.
-3.48 MB
Binary file not shown.

grid/best_classifiers.pickle

440 KB
Binary file not shown.

grid/best_regressors.pickle

-328 KB
Binary file not shown.

grid_search.py

+5-2
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,10 @@
1212
from emotion_recognition import EmotionRecognizer
1313
from parameters import classification_grid_parameters, regression_grid_parameters
1414

15+
# emotion classes you want to perform grid search on
1516
emotions = ['sad', 'neutral', 'happy']
17+
# number of parallel jobs during the grid search
18+
n_jobs = 4
1619

1720
best_estimators = []
1821

@@ -23,7 +26,7 @@
2326
params['n_neighbors'] = [len(emotions)]
2427
d = EmotionRecognizer(model, emotions=emotions)
2528
d.load_data()
26-
best_estimator, best_params, cv_best_score = d.grid_search(params=params)
29+
best_estimator, best_params, cv_best_score = d.grid_search(params=params, n_jobs=n_jobs)
2730
best_estimators.append((best_estimator, best_params, cv_best_score))
2831
print(f"{emotions} {best_estimator.__class__.__name__} achieved {cv_best_score:.3f} cross validation accuracy score!")
2932

@@ -39,7 +42,7 @@
3942
params['n_neighbors'] = [len(emotions)]
4043
d = EmotionRecognizer(model, emotions=emotions, classification=False)
4144
d.load_data()
42-
best_estimator, best_params, cv_best_score = d.grid_search(params=params)
45+
best_estimator, best_params, cv_best_score = d.grid_search(params=params, n_jobs=n_jobs)
4346
best_estimators.append((best_estimator, best_params, cv_best_score))
4447
print(f"{emotions} {best_estimator.__class__.__name__} achieved {cv_best_score:.3f} cross validation MAE score!")
4548

parameters.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232
'p': [1, 2, 3, 4, 5],
3333
},
3434
MLPClassifier(): {
35-
'hidden_layer_sizes': [(200,), (300,), (400,)],
35+
'hidden_layer_sizes': [(200,), (300,), (400,), (128, 128), (256, 256)],
3636
'alpha': [0.001, 0.005, 0.01],
3737
'batch_size': [128, 256, 512, 1024],
3838
'learning_rate': ['constant', 'adaptive'],

0 commit comments

Comments
 (0)