Fuel Efficiency Prediction using Deep Learning

Photo by Wawan Saputra on Dribbble

In a regression problem, we aim to predict the output of a continuous value, like a price or a probability. Contrast this with a classification problem, where we aim to select a class from a list of classes (for example, where a picture contains an apple or an orange, recognizing which fruit is in the picture).

In these types of problems to predict fuel efficiency, we aim to predict a continuous value output, such as a price or a probability. In this article, I will take you through how we can predict Fuel Efficiency with Deep Learning.

Let’s begin !!

Importing the necessary libraries

Let’s import the necessary libraries to get started with this task:

https://gist.github.com/jeetaf/156fd63167fc194de16eb40828180c89

About the dataset

We will be using the classic Auto MPG Dataset and builds a model to predict the fuel efficiency of the late-1970s and early 1980s automobiles. To do this, we’ll provide the model with a description of many automobiles from that time period. This description includes attributes like cylinders, displacement, horsepower, and weight.

The dataset is available from the UCI Machine Learning Repository.

It can be easily downloaded using the following code :

https://gist.github.com/jeetaf/ad6c95c3ed616c8646b933a775938eff

Now, let’s import the data using the pandas package:

https://gist.github.com/jeetaf/385fc9b24fbc6abb7faab345d4f13720

Viewing the contents of the imported dataset

Cleaning and Preprocessing of data

The dataset contains a few unknown values.

https://gist.github.com/jeetaf/d5f4890e1fd767ccb27d3a58c6b17e8f

output :

Now let’s drop these rows :

https://gist.github.com/jeetaf/6ad1f91a46c5f1b7700d15763a584fc7

The “origin” column in the dataset is categorical, so to move forward we need to use some one-hot encoding on it:

https://gist.github.com/jeetaf/a92bad9af2b39d8eb12182c62e696d63

Dataset after One-Hot Encoding

Now, let’s split the data into training and test sets:

https://gist.github.com/jeetaf/02588f290bb4c96838b669e011aee00c

Before training and test to predict fuel efficiency with machine learning, let’s visualize the data by using the seaborn’s pair plot method:

https://gist.github.com/jeetaf/2ee68d5257cf5008ab45a5725158a0ff

Inspecting the data

Also, look at the overall statistics:

https://gist.github.com/jeetaf/1ba43481f5ad29ba2ca8abe68ee6ffab

Overall stats

Now, we will separate the target values from the features in the dataset. This label is that feature that we will use to train the model to predict fuel efficiency:

https://gist.github.com/jeetaf/b6a455beabf1fde811dde60a1b86b87d

Normalize the data

It is recommended that you standardize features that use different scales and ranges. Although the model can converge without standardization of features, this makes learning more difficult and makes the resulting model dependent on the choice of units used in the input. We need to do this to project the test dataset into the same distribution the model was trained on:

https://gist.github.com/jeetaf/3e2d901ffa67508609871cd34b160571

This normalized data is what we will use to train the model.

Caution: The statistics used to normalize the inputs here (mean and standard deviation) need to be applied to any other data that is fed to the model, along with the one-hot encoding that we did earlier. That includes the test set as well as live data when the model is used in production.

Build The Model

Let’s build our model. Here, I will use the sequential API with two hidden layers and one output layer that will return a single value. The steps to build the model are encapsulated in a function, build_model since we will be creating a second model later :

https://gist.github.com/jeetaf/05f300503decd5a87067484fc5fec099

Use the .summary method to print a simple description of the model :

https://gist.github.com/jeetaf/8bef09a6adcab3717d00daa9c74c1464

Model summary

Now, before training the model to predict fuel efficiency let’s tray this model in the first 10 samples:

https://gist.github.com/jeetaf/2acd35519a128073086a12c3a26266d6

output : 
array([[ 0.22689067],
[ 0.05828134],
[ 0.2640698 ],
[ 0.13235056],
[ 0.41513422],
[ 0.0909472 ],
[ 0.47577205],
[-0.11234067],
[ 0.24470758],
[ 0.541355 ]], dtype=float32)

Training Model To Predict Fuel Efficiency

Now, let’s train the model to predict fuel efficiency:

https://gist.github.com/jeetaf/c7afc809fc09cfb414ab8b0c15bb70e1

Displaying dot for each completed epoch

Now, let’s visualize the model training:

https://gist.github.com/jeetaf/ccd558fc7126f5c26c4108a80ba9f92e

https://gist.github.com/jeetaf/f9664dd1a4400cfb3df152c3f23e6a9b

MPG vs Epoch visualization
MPG² vs Epoch visualization

This graph represents a little improvement or even degradation in validation error after about 100 epochs. Now, let’s update the model.fit method to stop training when the validation score does not improve. We’ll be using an EarlyStopping callback that tests a training condition for each epoch. If a set number of epochs pass without showing improvement, automatically stop training:

https://gist.github.com/jeetaf/5212133440774b4a93557f8b035fc25e

epochs for second model
MPG vs Epoch visualization for the second model
MPG² vs Epoch visualization for the second model

The graph shows that on the validation set, the average error is usually around +/- 2 MPG.

Let’s see how the model generalizes using the test set, which we didn’t use when training the model. This shows how well it is expected that model to predict when we use it in the real world:

https://gist.github.com/jeetaf/0f57880c68910a4d2baab47f833010a7

Printing the mean abs error for testing set

Now, let’s make predictions on the model to predict fuel efficiency:

https://gist.github.com/jeetaf/8cd8c0f7b587d4e4c9cea4dd0332d52c

Predictions vs True values Scatter plot

Let’s take a look at the error distribution :

https://gist.github.com/jeetaf/6a005649e50058c4ad3f26ff534062fe

Error distribution

Conclusion :

  • Mean Squared Error (MSE) is a common loss function used for regression problems (different loss functions are used for classification problems).
  • Similarly, evaluation metrics used for regression differ from classification. A common regression metric is the Mean Absolute Error (MAE).
  • When numeric input data features have values with different ranges, each feature should be scaled independently to the same range.
  • If there is not much training data, one technique is to prefer a small network with few hidden layers to avoid overfitting.
  • Early stopping is a useful technique to prevent overfitting.

Implementation of the project on cainvas here.

Credit: Jeet Chawla