Hate Speech and Offensive Language Detection

Nowadays we are well aware of the fact that if social media platforms are not handled carefully then they can create chaos in the world. One of the problems faced on these platforms are usage of Hate Speech and Offensive Language. Usage of such Language often results in fights, crimes or sometimes riots at worst. So, Detection of such language is essential and as humans cannot monitor such large volumes of data, we can take help of AI and detect the use of such language and prevent users from using such languages.

Photo by Lucien Leyh on Dribbble

Table of Content

  • Introduction to cAInvas
  • Source of Data
  • Data Preprocessing
  • Model Training
  • Introduction to DeepC
  • Compilation with DeepC

Introduction to cAInvas

cAInvas is an integrated development platform to create intelligent edge devices.Not only we can train our deep learning model using Tensorflow,Keras or Pytorch, we can also compile our model with its edge compiler called DeepC to deploy our working model on edge devices for production.The Hate Speech Detection model is also developed on cAInvas and a part of cAInvas Use-Case Gallery now. All the dependencies which you will be needing for this project are also pre-installed.

cAInvas also offers various other deep learning notebooks in its gallery which one can use for reference or to gain insight about deep learning.It also has GPU support and which makes it the best in its kind.

Source of Data

While working on any UsedCases from cAInvas gallary we don’t have to look for data manually.We can load the data in a dataframe by using pandas library. We just have to enter the following commands:

https://gist.github.com/Gunjan933/4a78123a78192ad1ffd86cef759e48f5

Running the above command will load the data in a dataframe which we will use for model training.

Data Preprocessing

This step involves data cleaning and pre processing our data for model training in order to achieve good performance and for better data visualization. In this step we will drop a column of serial number as it is not required for model training and we will also add a new column for tweet length. Next we will segregate our data on the basis of class of tweets for data visualization. This can be achieved by running the following commands:

https://gist.github.com/Gunjan933/73226af2ca13645f022ef3c632d75e6e

Our next step will be to remove the punctuation from the tweets and remove the stopwords from the tweets and then we will finally vectorize our words in the tweets so as to assign a unique number to all the words in our tweets which we will later pass on to our hate speech detection model. This vectorized vector will be passed on to our model in a variable along with the label.

https://gist.github.com/Gunjan933/d62f373c1006bd56bcbf9a890afe09ce

Model Training

After creating the dataset next step is to pass our training data into our Deep Learning model to learn to learn to classify various classes of tweets. The model architecture used was:

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 32) 6400
_________________________________________________________________
lstm (LSTM) (None, 32) 8320
_________________________________________________________________
repeat_vector (RepeatVector) (None, 200, 32) 0
_________________________________________________________________
global_average_pooling1d (Gl (None, 32) 0
_________________________________________________________________
dense (Dense) (None, 32) 1056
_________________________________________________________________
dense_1 (Dense) (None, 16) 528
_________________________________________________________________
dense_2 (Dense) (None, 3) 51
=================================================================
Total params: 16,355
Trainable params: 16,355
Non-trainable params: 0
_________________________________________________________________

The loss function used was “categorical_crossentropy” and optimizer used was “Adam”.For training the model we used Keras API with tensorflow at backend. .Here are the training plots for the model:

Introduction to DeepC

DeepC Compiler and inference framework is designed to enable and perform deep learning neural networks by focussing on features of small form-factor devices like micro-controllers, eFPGAs, cpus and other embedded devices like raspberry-pi, odroid, arduino, SparkFun Edge, risc-V, mobile phones, x86 and arm laptops among others.

DeepC also offers ahead of time compiler producing optimized executable based on LLVM compiler tool chain specialized for deep neural networks with ONNX as front end.

Compilation with DeepC

After training the model, it was saved in an H5 format using Keras as it easily stores the weights and model configuration in a single file.

After saving the file in H5 format we can easily compile our model using DeepC compiler which comes as a part of cAInvas platform so that it converts our saved model to a format which can be easily deployed to edge devices. And all this can be done very easily using a simple command.

https://gist.github.com/Gunjan933/ac70d248bbeeb93ff9c70c6ad8506a88

And that’s it, our Hate Speech and Offensive Language Detection is trained and ready for deployment on edge devices.

Link for the cAInvas Notebook : https://cainvas.ai-tech.systems/use-cases/hate-speech-and-offensive-language-detection-app/

Credit : Ashish Arya