Malicious URL Detection

A malicious website is a site that attempts to install malware (a general term for anything that will disrupt computer operation, gather your personal information, or, in a worst-case scenario, gain total access to your machine) onto your device.

So it is necessary to detect malicious websites or URLs and it can be achieved by training a deep learning model to classify Malicious and Non-Malicious URL.

Malicious URL Detection
Photo by MOWE on Dribbble

Table of Content

  • Introduction to cAInvas
  • Source of Data
  • Data Analysis
  • Model Training
  • Introduction to DeepC
  • Compilation with DeepC



<!- End of Embedded Video Section –>

Introduction to cAInvas

cAInvas is an integrated development platform to create intelligent edge devices. Not only we can train our deep learning model using Tensorflow, Keras or Pytorch, we can also compile our model with its edge compiler called DeepC to deploy our working model on edge devices for production.

The Malicious URL Detection model is also developed on cAInvas and all the dependencies which you will be needing for this project are also pre-installed.

cAInvas also offers various other deep learning notebooks in its gallery which one can use for reference or to gain insight about deep learning. It also has GPU support and which makes it the best in its kind.

Source of Data

While working on cAInvas one of its key features is UseCases gallery. Since the Malicious URL Detection model is also a part of cAInvas gallery we don’t have to look for data manually. We can load the data in a dataframe by using pandas library, we just have to enter the following commands:

Running the above command will load the data in a dataframe which we will use for model training.

Data Analysis

Data Analysis involves looking at the number of null values in our dataset which is fortunately zero in our case, looking for any class imbalance which was present in our dataset and it can be visualized with the help of the graph:

Count, Number and Type of URLs
Count, Number and Type of URLs

To prevent the class imbalance we oversampled our data using the SMOTE module of imblearn library. We also got to know that some features were already extracted from the URL for classification and stored in our CSV file.

Length Features

  • Length Of Url
  • Length of Hostname
  • Length Of Path
  • Length Of First Directory
  • Length Of Top Level Domain

Count Features

  • Count Of ‘-’
  • Count Of ‘@’
  • Count Of ‘?’
  • Count Of ‘%’
  • Count Of ‘.’
  • Count Of ‘=’
  • Count Of ‘http’
  • Count Of ‘www’
  • Count Of Digits
  • Count Of Letters
  • Count Of Number Of Directories

Binary Features

  • Use of IP or not
  • Use of Shortening URL or not

Once we are done analyzing our data we will create the trainset and testset which will contain the feature vector along with the labels for our model training. It can be done by executing the following commands:

Model Training

After creating the dataset next step is to pass our training data into our Deep Learning model to learn to classify URLs. The model architecture used was:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 32)                544       
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 8)                 136       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 9         
=================================================================
Total params: 1,217
Trainable params: 1,217
Non-trainable params: 0
_________________________________________________________________

The loss function used was “binary_crossentropy” and optimizer used was “Adam”.For training the model we used Keras API with tensorflow at backend. Here are the training plots for the model:

Model Loss
Model Loss
Model Accuracy
Model Accuracy

Introduction to DeepC

DeepC Compiler and inference framework is designed to enable and perform deep learning neural networks by focussing on features of small form-factor devices like micro-controllers, eFPGAs, cpus, and other embedded devices like raspberry-pi, odroid, arduino, SparkFun Edge, risc-V, mobile phones, x86 and arm laptops among others.

DeepC also offers ahead of time compiler-producing optimized executable based on LLVM compiler tool chain specialized for deep neural networks with ONNX as front end.

Compilation with DeepC

After training the model, it was saved in an H5 format using Keras as it easily stores the weights and model configuration in a single file.

After saving the file in H5 format we can easily compile our model using DeepC compiler which comes as a part of cAInvas platform so that it converts our saved model to a format which can be easily deployed to edge devices. And all this can be done very easily using a simple command.

And that’s it, our Malicious URL Detection is trained and ready.

Link for the cAInvas Notebook: https://cainvas.ai-tech.systems/use-cases/malicious-url-detection-app/

Credit: Ashish Arya


luminor submersible

You may also be interested in 

Become a Contributor: Write for AITS Publication Today! We’ll be happy to publish your latest article on data science, artificial intelligence, machine learning, deep learning, and other technology topics