Gender recognition using voice data — on cAInvas

Using features extracted from voice data of individuals to predict whether it is a man or a woman.

Photo by Priyank Vyas on Dribbble

A voice recording is analyzed for many inferences like the content spoken, the emotion, gender, and identity of the speaker, and many more.

While recognizing the characteristics of the speaker from the recording, identity recognition requires a reference dataset and is limited to identifying the people in the dataset. In many cases, this level of specificity may not be needed.

Recognizing the speaker’s gender is one such use case. The model for the same can be trained to learn patterns and features specific to each gender and reproduce it for individuals who are not part of the training dataset too!

It finds applications in automatic salutations, tagging audio recording, and helping digital assistants reproduce male/female generic results.

In this article, we will be using acoustic features extracted from a voice recording to predict the speaker’s gender.

The implementation of the idea on cAInvas — here.

The dataset

(on Kaggle)

The dataset has 3168 voice samples recorded from both male and female speakers. These samples are then pre-processed using the acoustic analysis packages of R with an analyzed frequency range of 0hz-280hz (human vocal range) to extract around 20 features written to a CSV file.

Snapshot of the dataset

A peek into the spread of samples across categories —

This is a perfectly balanced dataset.

Preprocessing

Correlation

A few pairs of columns have high correlation values. Thus we will be eliminating one of the two columns with a correlation value > 0.95.

https://gist.github.com/Gunjan933/d98bdd957d59f6a3b359bcb51552a789

The dataset has been edited to include only the final list of columns. The final dataset now has 18 columns — 17 attributes and 1 category column.

Defining the input features and output formats

The input and output columns are defined to separate the dataset into X and y later in the notebook.

https://gist.github.com/Gunjan933/0d9034f9873d1da6d26c3825a2da34e0

The category column is one-hot encoded as the final layer of the model defined below has two nodes with the softmax activation function and uses the categorical cross-entropy loss.

https://gist.github.com/Gunjan933/1cf8c4e611df4409a3a433c5460bfb5b

We can alternatively use one node, sigmoid activation function, and binary cross-entropy loss.

One-hot encoded columns

These columns are inserted into the dataset making the total number of columns 20–17 attributes, 1 category column, and 2 one-hot encoded columns.

Snapshot of the dataset

Splitting into train, validation, and test sets

We use 80–10–10 split to define training, validation, and test sets. The datasets are then split into their respective X and y arrays.

https://gist.github.com/Gunjan933/5a22badae0ffa51c9d517c4f21dacb68

Scaling the attribute values

Since the individual features have values in different ranges, the min_max_scaler of the sklearn.preprocessing module is used to scale them in the range [0, 1].

https://gist.github.com/Gunjan933/0b8384bb8e619d920dc4a8a6a9600912

The model

The model is a simple one with 3 dense layers, two of which have ReLU activation and one has Softmax activation.

https://gist.github.com/Gunjan933/fbe2551215ecbca3e782139dcd2b8ef5

The Adam optimizer with a 0.01 learning rate is used along with categorical cross-entropy loss. The accuracy metric is monitored.

Two callback functions of the keras.callbacks module is used — EarlyStopping tracks a value and stops the training if it doesn’t improve (increase/decrease according to the value tracked, like accuracy or loss). ModelCheckpoint saves the best model obtained yet, according to least val_loss (default). The saved model can be re-loaded later.

The number of epochs mentioned was 32 epochs but the model stopped training earlier.

The metrics

The plot of accuracies
The plot of losses

Prediction

Let perform prediction on random test samples —

https://gist.github.com/Gunjan933/447ecaa317a24ac0eda6f321bf86b214

Prediction on a random sample

deepC

deepC library, compiler, and inference framework are designed to enable and perform deep learning neural networks by focussing on features of small form-factor devices like micro-controllers, eFPGAs, CPUs, and other embedded devices like raspberry-pi, odroid, Arduino, SparkFun Edge, RISC-V, mobile phones, x86 and arm laptops among others.

Compiling the model with deepC to get .exe file —

Head over to the cAInvas platform (link to notebook given earlier) and check out the predictions by the .exe file!

Credit: Ayisha D