Spam text classification — on cAInvas

Identifying whether the given text is spam or not (ham).

Photo by Emanuele Colombo on Dribbble

Spam texts are unsolicited messages usually sent for the purpose of advertising. While this helps a product reach consumers, it can be a source of unwanted input/communication to the consumer.

While reaching out to consumers in mass in order to increase the public presence of their product may seem harmless it is important to remember that spam messages, in many cases, include phishing attempts, non-commercial agenda propagation, or any other prohibited content/intent. These can put the consumer at the receiving end at risk.

There are many ways a spammer can reach you — SMS, calls, e-mails, web search results, etc.

Here, we classify text messages into two categories — spam or not, based on their content.

Implementation of the idea on cAInvas — here!

The dataset

On Kaggle by Team AI

Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG’11), Mountain View, CA, USA, 2011. Website | UCI

The dataset is a CSV file with 5572 messages falling into one of two categories — ham and spam.

Snapshot of the dataset

Preprocessing

Dropping repeated rows

A few rows in the dataset are repeated. Let us drop them to get a dataset with unique rows.

https://gist.github.com/AyishaR/a9f184919e7fe63fc82e9998a73cfd4a

Before dropping duplicates
After dropping duplicates

The dataset now has 5157 rows of which 4516 are ham and 641 are spam. It is not a balanced dataset but let us try training on it anyway.

Encoding the category values

The target column has label values that are to be converted into integer values depicting classes for training.

https://gist.github.com/AyishaR/5c2a76b355ce03bd1bf8803af790a70d

Snapshot of the dataset

Data cleaning

The text in the messages needs to be preprocessed before giving as input to the model.

Defining a few functions to carry out the preprocessing —

  • removing the HTML tags and URLs
  • keeping only the alphabets, i.e., removing punctuations, special characters, and digits
  • remove recurring characters. When we text, we tend to extend characters for emphasis or due to excitement. For example, ‘heyyyyyy’. Multiple occurrences of the same letter are reduced to 1. That is, the above is shortened to ‘hey’. But there are words in the English language that have consecutive recurring characters like wool, hello, etc. This is why we truncate characters that occur more than twice consecutively. Instances with 3 consecutive recurring characters are rare.

The list of stop words is also defined to be used later.

https://gist.github.com/AyishaR/6b352bcebf6bfb8d08fa8dc883906a06

Snowball stemmer of the nltk.stem module is used to get the root of the words in the message.

The words in both categories are stored separately for visualization later.

Here, we have skipped the filtering of stop words as along with the context, the style of the text is also a contributing factor.

https://gist.github.com/AyishaR/3d3fbfcadb28612af9f2be2f4b590a2f

The cleaned sentences are added as a column in the dataset.

Train-validation-test split

Using an 80–10–10 ratio to split the data frame into train- validation- test sets. The train_test_split function of the sklearn.model_selection module is used for this. These are then divided into X and y (input and output) for further processing.

https://gist.github.com/AyishaR/55c42ad239e04a3cd7a3687b53d50884

The train set has 4125 samples, the validation set and test have 516 samples each.

TF-IDF vectorization

There are 2 ways of implementing the TF-IDF vectorization —

  • TF-IDF vectorizer
  • CountVectorization followed by TF-IDF transformer

The CountVectorization function of the keras.preprocessing.text module is used to convert the text into a sequence of integers representing token counts.

https://gist.github.com/AyishaR/706b48ab43168df59654087650de5626

The cleaned messages are given as input and a sparse matrix of values is returned. The TF_IDF transformer takes into account the frequency of the words in the dataset. This is used to scale down the impact of frequently occurring words in the corpus.

https://gist.github.com/AyishaR/3b9dd0dfe244e650073c22507e730e6c

Defining the input and output to the model (X is the input and y is the output) —

https://gist.github.com/AyishaR/245e640ad6f53e240198063de8670260

Visualization

Looking at the words that are important/frequent in both categories.

The WordCloud function is used to visualize the words. The frequency of the words in the corpus are directly proportional to their size in the figure

WordCloud for ‘spam’
WordCloud for ‘ham’

The model

The model used is a simple one with 3 Dense layers, 2 of which have ReLU activation functions and the last one has a Sigmoid activation function that outputs a value in the range [0, 1].

Since this is a binary classification problem, the model is compiled using the binary cross-entropy loss function and the Adam optimizer is used and the accuracy of the model is tracked over epochs.

The EarlyStopping callback function of the keras.callbacks module monitors the validation loss and stops the training if it doesn’t decrease for 5 epochs continuously. The restore_best_weights parameter ensures that the weights of the model with the least validation loss are restored to the model variable.

https://gist.github.com/AyishaR/f985787f718e43b2734d6f3ab8e75bbc

The model was trained with a learning rate of 0.0001 and achieved an accuracy of ~98.8% on the test set.

Since this is an unbalanced dataset, the f1 score is calculated using the function of the sklearn.metrics module.

Test set evaluation

Plotting a confusion matrix to understand the results better —

Confusion matrix

A significant percentage of the ham messages are classified as spam. A larger dataset with more instances of spam messages would help in avoiding false positives.

The metrics

Plot of accuracies
Plot of losses

Prediction

Let us perform predictions on random test samples —

The text sample is taken from the original CSV file, cleaned, vectorized, and then input to the model as an external sample would have to go through the same process.

https://gist.github.com/AyishaR/128e54959bd677549e8504180c905b7e

Prediction on a random sample

deepC

deepC library, compiler, and inference framework are designed to enable and perform deep learning neural networks by focussing on features of small form-factor devices like micro-controllers, eFPGAs, CPUs, and other embedded devices like raspberry-pi, odroid, Arduino, SparkFun Edge, RISC-V, mobile phones, x86 and arm laptops among others.

Compiling the model using deepC —

Code: deepC

Head over to the cAInvas platform (link to notebook given earlier) to run and generate your own .exe file!

Credits: Ayisha D