Say What? Accent Detection using Machine Learning
Published:
Overview
This project was done as a final individual requirement for Machine Learning class in AIM Msc Data Science. In this project, I apply multiple transformations to sound data and convert this to data that is readable by traditional machine learning tools.
Note:
Most of the following is written as a technical report, and as such contains a significant amount of code. If you're interested in the methodoloy and results, I highly recommend browsing through the presentation deck instead.
The presentation can be found here: pdf
Executive Summary:
Accent detection could be very beneficial for voice processing technologies. As different big companies invest more into the voice AI assistant industry, detecting a users accent could yield more accurate results for the algorithms as well as unlock more information on the user. This study uses XGBoostClassifier to achieve a 59.4% accuracy on detecting the accent of three different native and non-native english speakers. These participants in the dataset are either American, Korean, or Chinese nationals that spoke the same english phrase. The analysis shows that the classifier was able to achieve a relatively high recall and precision on determining Korean speakers of english as opposed to those of Chinese, which had a low precision and recall score.
conf_mat = pd.DataFrame(np.array([[352, 73, 170], [113, 223, 161], [120, 86, 484]]))
conf_mat.columns = ['English', 'Chinese', 'Korean']
conf_mat.index = ['English', 'Chinese', 'Korean']
conf_mat
English | Chinese | Korean | |
---|---|---|---|
English | 352 | 73 | 170 |
Chinese | 113 | 223 | 161 |
Korean | 120 | 86 | 484 |
precision recall f1-score support
english 0.60 0.59 0.60 595
chinese 0.58 0.45 0.51 497
korean 0.59 0.70 0.64 690
This could potentially be due to the close difference in the speaking style of the two languages, which may lead to the classifier to have a hard time distinguishing Chinese speakers of english as opposed to those of Korean speakers who may have a more distinct way of speaking in english. This may also be due to the pre-processing of the signals as there could be significant levels of noise in the data taht could skew the results. Aside from this, the signals were converted to MFCCs (Mel Frequency Cepstral Coefficients) that are more reminiscent of the way human naturally hear different frequencies (log-scaled). A different approach could also be utilized in the future using Neural Networks and converting the signals into images that could be processed as the image of the overall speech pattern of each individual.
The Dataset:
Wildcat Corpus of Native- and Foreign-Accented English
- From Northwestern University
- Accessed on August 5, 2019 from: http://groups.linguistics.northwestern.edu/speech_comm_group/wildcat/
- 86 speakers of different Nationalities saying a common phrase
- Phrase: “Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.”
Load Prerequisites
import pandas as pd
from collections import Counter
import librosa
import matplotlib.pyplot as plt
%matplotlib inline
import librosa.display as _display
import numpy as np
import glob
from tqdm.autonotebook import tqdm
pd.options.display.float_format = '{:,.5g}'.format
import warnings
from sklearn.exceptions import DataConversionWarning
from sklearn.exceptions import ConvergenceWarning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from scipy.fftpack import fft, ifft
import machumachine
from itertools import compress
from sklearn.model_selection import GridSearchCV
Load Prerequisite Functions For Signal Selection
findsilence
- finds the “silent” sections of the signalmerger
- merges the silent sections to the original sample signalshade_silence
- returns a plot of the signal with the “silent” areas shaded. Also returns the index (start,stop) of each non-silent area in the signal
Load all signal files
def findsilence(y,sr,ind_i):
hop = int(round(sr*0.18)) #hop and width defines search window
width = int(sr*0.18)
n_slice = int(len(y)/hop)
starts = np.arange(n_slice)*hop
ends = starts+width
if hop != width:
cutoff = np.argmax(ends>len(y))
starts = starts[:cutoff]
ends = ends[:cutoff]
n_slice = len(starts)
mask = [i for i in map(lambda i: np.dot(y[starts[i]:ends[i]],y[starts[i]:ends[i]])/width < (0.5 * np.dot(y,y)/len(y)), range(n_slice))]
starts = list(compress(starts+ind_i,mask))
ends = list(compress(ends+ind_i,mask))
return zip(starts,ends)
def merger(tulist):
tu=[]
for tt in tulist:
tu.append(tt)
tu = tuple(tu)
cnt = Counter(tu)
res = [i for i in filter(lambda x: cnt[x]<2, tu)]
# return res
# return [i for i in map(lambda x: tuple(x),np.array(res).reshape((19,2)))]
return [i for i in map(lambda x: tuple(x),np.array(res).reshape(int(len(res)),2))]
def shade_silence(filename,start=0,end=None,disp=True,output=False, itr='', save=None):
"""Find signal (as output) or silence (as shaded reagion in plot) in a audio file
filename: (filepath) works best with .wav format
start/end: (float or int) start/end time for duration of interest in second (default= entire length)
disp: (bool) whether to display a plot(default= True)
output: (bool) whether to return an output (default = False)
itr: (int) iteration use for debugging purpose
save: (str) filename to save to
"""
try:
y, sr = librosa.load(filename)
except:
obj = thinkdsp.read_wave(filename)
y = obj.ys
sr = obj.framerate
print(itr, ' : librosa.load failed for '+filename)
t = np.arange(len(y))/sr
i = int(round(start * sr))
if end != None:
j = int(round(end * sr))
else:
j = len(y)
fills = findsilence(y[i:j],sr,i)
if disp:
fig, ax = plt.subplots(dpi=200, figsize=(15,8))
ax.set_title(filename)
ax.plot(t[i:j],y[i:j])
ax.set_xlabel('Time (s)')
if fills != None:
shades = [i for i in map(lambda x: (max(x[0],i),min(x[1],j)), fills)]
if len(shades)>0:
shades = merger(shades)
if disp:
for s in shades:
ax.axvspan(s[0]/sr, s[1]/sr, alpha=0.5, color='r')
if save:
fig.savefig(save)
if len(shades)>1:
live = map(lambda i: (shades[i][1],shades[i+1][0]), range(len(shades)-1))
elif len(shades)==1:
a = [i,shades[0][0],shades[0][1],j]
live = filter(lambda x: x != None, map(lambda x: tuple(x) if x[0]!=x[1] else None,np.sort(a).reshape((int(len(a)/2),2))))
else:
live = [(i,j)]
if output:
return [i for i in live], sr, len(y)
Load and Slice each audio by time to generate new data points
# generates new y values that are cleaned
new_ys = []
for i in tqdm(final['filepaths']):
new_y = []
y, sr = librosa.load(i)
silences, b, c = shade_silence(i, output=True, disp=False)
for j in range(len(silences)):
if silences[j][0] != silences[j][1]:
new_y.extend(y[silences[j][0]:silences[j][1]])
new_ys.append(new_y)
HBox(children=(IntProgress(value=0, max=72), HTML(value='')))
Create a function splitter
to aggregate the signal files
splitter
- creates a new signal, compressing the original signal by removing the silent areas and leaving only those timeframes with sound
def splitter(y_list, target, interval=1, original_sr=22050, n_mfcc=20):
'''
Converts a list of audio signals to mfcc bands and slices and resamples the audio signals based on
original_sr and interval.
Returns a 2D n x m array. n is the number of data points, with m-1 features as each mfcc band value for
the interval or time step selected.
'''
import math
# check for the shortest sample
newys_lengths = []
for i in y_list:
newys_lengths.append(len(i))
max_length = int(np.array(newys_lengths).min()/22050)
# create bins that match the interval length and sampling rate
sec_bins = []
steps = np.arange(0, (max_length*original_sr), (original_sr*interval))
for i in range(len(steps)-1):
sec_bins.append([steps[i],steps[i+1]])
# create new slices of each audio signal in y_list
print(f'Length of time bins = {len(sec_bins)}')
print('Generating slices of X.')
new_X = []
for i in tqdm(sec_bins):
for audio in y_list:
new_X.append(np.array(audio[int(i[0]):int(i[1])]))
new_X = np.array(new_X)
print('Checking for empty arrays.')
counter = 0
shapes = [i.shape for i in new_X]
clean_index = []
for i in range(len(shapes)):
if shapes[i] == (0,):
counter += 1
else:
clean_index.append(i)
proceed = 'y'
if counter > 0:
proceed = eval(input(f'Number of empty arrays is {counter}. Proceed and delete these arrays? (y/n?)'))
else:
print('No empty arrays.\n')
if proceed == 'y':
new_X = new_X[clean_index]
print('Slicing original audio files.')
new_Xs = []
for i in tqdm(range(len(new_X))):
new_Xs.append(librosa.feature.mfcc(new_X[i], sr=original_sr, n_mfcc=n_mfcc))
new_Xs = np.array(new_Xs)
try:
new_Xs = new_Xs.reshape(len(new_X), (int(math.ceil((original_sr*interval)/512)) * n_mfcc))
except:
print(new_Xs.shape)
print(len(new_Xs))
# generate new targets
try:
new_target = np.array(list(target.to_numpy()) * len(sec_bins))
except AttributeError:
new_target = np.array(list(target) * len(sec_bins))
return new_Xs, new_target
X, target = splitter(new_ys, target, interval=0.1, n_mfcc=40)
Length of time bins = 99
Generating slices of X.
HBox(children=(IntProgress(value=0, max=99), HTML(value='')))
Checking for empty arrays.
No empty arrays
Slicing original audio files.
HBox(children=(IntProgress(value=0, max=7128), HTML(value='')))
X.shape
(7128, 200)
Classification Models
Load train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=1028)
Random Forest Classifier
param_grids = {'max_depth': [3, 4, 6],
'min_samples_leaf': [2, 3, 4],
'max_features': [3, 2, 5],
'n_estimators':[200,400,600]
}
est2 = RandomForestClassifier()
gs_cv2 = GridSearchCV(est, param_grids, n_jobs=-1, verbose=10, cv=3).fit(X_train, y_train)
print(gs_cv2.best_params_)
Fitting 3 folds for each of 81 candidates, totalling 243 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 1.2s
[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 3.6s
[Parallel(n_jobs=-1)]: Done 16 tasks | elapsed: 4.8s
[Parallel(n_jobs=-1)]: Done 25 tasks | elapsed: 7.9s
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 10.4s
[Parallel(n_jobs=-1)]: Done 45 tasks | elapsed: 12.8s
[Parallel(n_jobs=-1)]: Done 56 tasks | elapsed: 16.0s
[Parallel(n_jobs=-1)]: Done 69 tasks | elapsed: 21.6s
[Parallel(n_jobs=-1)]: Done 82 tasks | elapsed: 26.7s
[Parallel(n_jobs=-1)]: Done 97 tasks | elapsed: 30.8s
[Parallel(n_jobs=-1)]: Done 112 tasks | elapsed: 35.4s
[Parallel(n_jobs=-1)]: Done 129 tasks | elapsed: 40.1s
[Parallel(n_jobs=-1)]: Done 146 tasks | elapsed: 48.8s
[Parallel(n_jobs=-1)]: Done 165 tasks | elapsed: 58.8s
[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 1.1min
[Parallel(n_jobs=-1)]: Done 205 tasks | elapsed: 1.2min
[Parallel(n_jobs=-1)]: Done 226 tasks | elapsed: 1.5min
[Parallel(n_jobs=-1)]: Done 243 out of 243 | elapsed: 1.6min finished
{'max_depth': 6, 'max_features': 5, 'min_samples_leaf': 2, 'n_estimators': 600}
gs_cv2.score(X_test, y_test)
0.4618406285072952
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
new_y_pred2 = gs_cv.predict(X_test)
confusion_matrix(y_test, new_y_pred2)
array([[370, 27, 173],
[ 97, 198, 219],
[ 74, 15, 609]])
accuracy_score(y_test, new_y_pred)
0.6604938271604939
print(classification_report(y_test, new_y_pred))
precision recall f1-score support
1 0.68 0.65 0.67 570
2 0.82 0.39 0.53 514
3 0.61 0.87 0.72 698
micro avg 0.66 0.66 0.66 1782
macro avg 0.71 0.64 0.64 1782
weighted avg 0.70 0.66 0.65 1782
Gradient Boosting Classifier
param_grids = {'learning_rate': [.2, 0.1, 0.05, 0.02, 0.01],
'max_depth': [3, 4, 6, 8],
'min_samples_leaf': [2, 3, 4],
'max_features': [5, 3, 2],
'n_estimators':[200,400,600]
}
gbm = GradientBoostingClassifier()
gs_gbm = GridSearchCV(gbm, param_grids, n_jobs=-1, verbose=10).fit(X_train, y_train)
print(gs_gbm.best_params_)
/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
warnings.warn(CV_WARNING, FutureWarning)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
Fitting 3 folds for each of 540 candidates, totalling 1620 fits
[Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 3.6s
[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 10.8s
[Parallel(n_jobs=-1)]: Done 16 tasks | elapsed: 14.4s
[Parallel(n_jobs=-1)]: Done 25 tasks | elapsed: 23.8s
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 29.1s
[Parallel(n_jobs=-1)]: Done 45 tasks | elapsed: 35.7s
[Parallel(n_jobs=-1)]: Done 56 tasks | elapsed: 42.9s
[Parallel(n_jobs=-1)]: Done 69 tasks | elapsed: 49.3s
[Parallel(n_jobs=-1)]: Done 82 tasks | elapsed: 56.6s
[Parallel(n_jobs=-1)]: Done 97 tasks | elapsed: 1.2min
[Parallel(n_jobs=-1)]: Done 112 tasks | elapsed: 1.4min
[Parallel(n_jobs=-1)]: Done 129 tasks | elapsed: 1.7min
[Parallel(n_jobs=-1)]: Done 146 tasks | elapsed: 1.9min
[Parallel(n_jobs=-1)]: Done 165 tasks | elapsed: 2.2min
[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 2.7min
[Parallel(n_jobs=-1)]: Done 205 tasks | elapsed: 3.1min
[Parallel(n_jobs=-1)]: Done 226 tasks | elapsed: 3.5min
[Parallel(n_jobs=-1)]: Done 249 tasks | elapsed: 4.0min
[Parallel(n_jobs=-1)]: Done 272 tasks | elapsed: 4.5min
[Parallel(n_jobs=-1)]: Done 297 tasks | elapsed: 4.9min
[Parallel(n_jobs=-1)]: Done 322 tasks | elapsed: 5.3min
[Parallel(n_jobs=-1)]: Done 349 tasks | elapsed: 5.7min
[Parallel(n_jobs=-1)]: Done 376 tasks | elapsed: 6.0min
[Parallel(n_jobs=-1)]: Done 405 tasks | elapsed: 6.3min
[Parallel(n_jobs=-1)]: Done 434 tasks | elapsed: 6.8min
[Parallel(n_jobs=-1)]: Done 465 tasks | elapsed: 7.2min
[Parallel(n_jobs=-1)]: Done 496 tasks | elapsed: 7.8min
[Parallel(n_jobs=-1)]: Done 529 tasks | elapsed: 8.6min
[Parallel(n_jobs=-1)]: Done 562 tasks | elapsed: 9.3min
[Parallel(n_jobs=-1)]: Done 597 tasks | elapsed: 10.5min
[Parallel(n_jobs=-1)]: Done 632 tasks | elapsed: 11.4min
[Parallel(n_jobs=-1)]: Done 669 tasks | elapsed: 12.2min
[Parallel(n_jobs=-1)]: Done 706 tasks | elapsed: 12.6min
[Parallel(n_jobs=-1)]: Done 745 tasks | elapsed: 13.3min
[Parallel(n_jobs=-1)]: Done 784 tasks | elapsed: 14.0min
[Parallel(n_jobs=-1)]: Done 825 tasks | elapsed: 14.7min
[Parallel(n_jobs=-1)]: Done 866 tasks | elapsed: 15.8min
[Parallel(n_jobs=-1)]: Done 909 tasks | elapsed: 17.3min
[Parallel(n_jobs=-1)]: Done 952 tasks | elapsed: 19.3min
[Parallel(n_jobs=-1)]: Done 997 tasks | elapsed: 20.3min
[Parallel(n_jobs=-1)]: Done 1042 tasks | elapsed: 20.8min
[Parallel(n_jobs=-1)]: Done 1089 tasks | elapsed: 21.4min
[Parallel(n_jobs=-1)]: Done 1136 tasks | elapsed: 22.1min
[Parallel(n_jobs=-1)]: Done 1185 tasks | elapsed: 23.3min
[Parallel(n_jobs=-1)]: Done 1234 tasks | elapsed: 25.0min
[Parallel(n_jobs=-1)]: Done 1285 tasks | elapsed: 26.8min
[Parallel(n_jobs=-1)]: Done 1336 tasks | elapsed: 27.6min
[Parallel(n_jobs=-1)]: Done 1389 tasks | elapsed: 28.3min
[Parallel(n_jobs=-1)]: Done 1442 tasks | elapsed: 29.0min
[Parallel(n_jobs=-1)]: Done 1497 tasks | elapsed: 30.4min
[Parallel(n_jobs=-1)]: Done 1552 tasks | elapsed: 31.9min
[Parallel(n_jobs=-1)]: Done 1620 out of 1620 | elapsed: 34.5min finished
{'learning_rate': 0.05, 'max_depth': 8, 'max_features': 5, 'min_samples_leaf': 3, 'n_estimators': 600}
gs_gbm.score(X_test, y_test)
0.5841750841750841
XGBoost Classifier
import xgboost as xgb
data = xgb.DMatrix(X, target)
param_grids = {'learning_rate': [0.1, 0.05, 0.2],
'max_depth': [3, 4, 6],
'min_samples_leaf': [2, 3, 4],
'max_features': [5, 3, 2],
'n_estimators':[200,400,600],
'n_jobs':[-1,]
}
xgb_clf = xgb.XGBClassifier()
gs_xgb = GridSearchCV(xgb_clf, param_grids, n_jobs=-1, verbose=10).fit(X_train, y_train)
print(gs_xgb.best_params_)
Fitting 3 folds for each of 243 candidates, totalling 729 fits
/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
warnings.warn(CV_WARNING, FutureWarning)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 55.3s
[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 2.9min
[Parallel(n_jobs=-1)]: Done 16 tasks | elapsed: 3.9min
[Parallel(n_jobs=-1)]: Done 25 tasks | elapsed: 6.8min
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 8.8min
[Parallel(n_jobs=-1)]: Done 45 tasks | elapsed: 11.6min
[Parallel(n_jobs=-1)]: Done 56 tasks | elapsed: 14.8min
[Parallel(n_jobs=-1)]: Done 69 tasks | elapsed: 17.8min
[Parallel(n_jobs=-1)]: Done 82 tasks | elapsed: 20.8min
[Parallel(n_jobs=-1)]: Done 97 tasks | elapsed: 25.8min
[Parallel(n_jobs=-1)]: Done 112 tasks | elapsed: 30.5min
[Parallel(n_jobs=-1)]: Done 129 tasks | elapsed: 36.2min
[Parallel(n_jobs=-1)]: Done 146 tasks | elapsed: 40.9min
[Parallel(n_jobs=-1)]: Done 165 tasks | elapsed: 47.0min
[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 55.5min
[Parallel(n_jobs=-1)]: Done 205 tasks | elapsed: 64.5min
[Parallel(n_jobs=-1)]: Done 226 tasks | elapsed: 74.9min
[Parallel(n_jobs=-1)]: Done 249 tasks | elapsed: 83.4min
[Parallel(n_jobs=-1)]: Done 272 tasks | elapsed: 89.1min
[Parallel(n_jobs=-1)]: Done 297 tasks | elapsed: 95.1min
[Parallel(n_jobs=-1)]: Done 322 tasks | elapsed: 101.6min
[Parallel(n_jobs=-1)]: Done 349 tasks | elapsed: 109.9min
[Parallel(n_jobs=-1)]: Done 376 tasks | elapsed: 118.5min
[Parallel(n_jobs=-1)]: Done 405 tasks | elapsed: 127.8min
[Parallel(n_jobs=-1)]: Done 434 tasks | elapsed: 139.7min
[Parallel(n_jobs=-1)]: Done 465 tasks | elapsed: 152.4min
[Parallel(n_jobs=-1)]: Done 496 tasks | elapsed: 162.9min
[Parallel(n_jobs=-1)]: Done 529 tasks | elapsed: 171.4min
[Parallel(n_jobs=-1)]: Done 562 tasks | elapsed: 181.3min
[Parallel(n_jobs=-1)]: Done 597 tasks | elapsed: 192.2min
[Parallel(n_jobs=-1)]: Done 632 tasks | elapsed: 202.4min
[Parallel(n_jobs=-1)]: Done 669 tasks | elapsed: 216.1min
[Parallel(n_jobs=-1)]: Done 706 tasks | elapsed: 229.6min
[Parallel(n_jobs=-1)]: Done 729 out of 729 | elapsed: 238.2min finished
{'learning_rate': 0.2, 'max_depth': 6, 'max_features': 5, 'min_samples_leaf': 2, 'n_estimators': 600, 'n_jobs': -1}
gs_xgb
{'learning_rate': 0.2,
'max_depth': 6,
'max_features': 5,
'min_samples_leaf': 2,
'n_estimators': 600,
'n_jobs': -1}
gs_xgb.score(X_test, y_test)
0.5942760942760943
y_pred = gs_xgb.predict(X_test)
confusion_matrix(y_test, y_pred)
array([[352, 73, 170],
[113, 223, 161],
[120, 86, 484]])
accuracy_score(y_test, y_pred)
0.5942760942760943
print(classification_report(y_test, y_pred))
precision recall f1-score support
1 0.60 0.59 0.60 595
2 0.58 0.45 0.51 497
3 0.59 0.70 0.64 690
micro avg 0.59 0.59 0.59 1782
macro avg 0.59 0.58 0.58 1782
weighted avg 0.59 0.59 0.59 1782
Save Model
import pickle
# pickle.dump(gs_xgb, open('XGBoost_model.sav', 'wb'))
# # load the model from disk
# loaded_model = pickle.load(open('XGBoost_model.sav', 'rb'))
# result = loaded_model.score(X_test, y_test)
# print(result)
0.5942760942760943
References
- Peak Extraction Script. Originally from https://github.com/libphy/which_animal, adapted to Python 3 and heavily tweaked for this study’s purpose.
- Leon Mak An Sheng, Mok Wei Xiong Edmund. Deep Learning Approach to Accent Classification. Stanford CS229 Machine Learning Project.