Next day rain prediction — on cAInvas

Predict next day rain in Australia using weather data.

Photo by GRAMM on Dribbble

A weather forecast is a prediction of how the weather will be in the coming days. Air pressure, temperature, humidity, wind, and other measurements are used by meteorologists along with other methods to predict the weather.

Predicting weather requires keen observation skills and knowledge of weather patterns. With trained deep learning models, we can identify the patterns in data to make predictions for the coming days.

Here, we use present-day weather conditions in different cities of Australia to predict rain the next day.

Implementation of the idea on cAInvas — here!

The dataset

On Kaggle by Joe Young and Adam Young

Observations were drawn from numerous weather stations. The daily observations are available from http://www.bom.gov.au/climate/data.
An example of the latest weather observations in Canberra: http://www.bom.gov.au/climate/dwo/IDCJDW2801.latest.shtml

Definitions adapted from http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml
Data source: http://www.bom.gov.au/climate/dwo/ and http://www.bom.gov.au/climate/data.

Copyright Commonwealth of Australia 2010, Bureau of Meteorology.

The dataset is a CSV file with about 10 years of daily weather observations from many locations across Australia. The various features in the dataset indicate weather-related information for the given day and RainTomorrow is the target attribute.

Snapshot of the dataset

Some columns seem to have NaN values. Let’s check how many NaN values are there in the data frame.

Count of NaN values

Too many NaN values! One option will be filling them but here we will be dropping them as there are too many and filling them may tint the dataset.

Code: Drop rows with NaN values

Preprocessing

Input attributes

Let us look into the datatypes of the attributes to go ahead with necessary processing —

Datatypes of the attributes

Location, WindGustDir, WindDir9am, WindDir3pm are columns whose values do not have a range dependency. The get_dummies() function of the Pandas library is used and the drop_first parameter is set to True. This means that if there are n categories in the column, n-1 columns are returned instead of n. i.e., each value is returned as an n-1 value array. The first category is defined by an array with all 0s while the remaining n-1 category variables are arrays with 1 in the (i-1)th index of the array.

After that, the 4 columns are removed as they won’t be needed anymore.

RainToday column values can be derived from the RainfallMeasurement column and the Date value is not necessary here too. Both can be removed.

https://gist.github.com/AyishaR/8eeb86d7b1d60ca53a51b9ca6d4d9503

Target attribute

RainTomorrow is a binary-valued column with characters representing the two classes. Changing their data type to an integer to give as the target to the model.

https://gist.github.com/AyishaR/9c81c8bf59eed3d855b5e8c038850b0c

Balancing the dataset

A peek into the spread of class labels across the data frame —

The spread of class labels

It is an unbalanced dataset. In order to balance the dataset, there are two options,

  • upsampling — resample the values to increase their count in the dataset.
  • downsampling — pick n samples from each class label where n = number of samples in class with least count, i.e., reducing the count of certain class values in the dataset.

Here, we will be upsampling. Resampling the values to be equal in the count can result in ~30k redundant values. So we restrict it to 20k values in class label 1, resulting in only ~8k redundant rows,

https://gist.github.com/AyishaR/a058ccbf00a310d92728f2c2cc12179e

Train-validation-test split

Using an 80–10–10 ratio to split the data frame into train- validation- test sets. The train_test_split function of the sklearn.model_selection module is used for this. These are then divided into X and y (input and output) for further processing.

https://gist.github.com/AyishaR/52ea1b5c8c1104b4b7439ea1eb10fca4

Standardization

Using df.describe() shows that the standard deviation of attribute values in the dataset is not the same across all of them. This may result in certain attributes being weighted higher than others. The values across all attributes are scaled to have mean = 0 and standard deviation = 1 with respect to the particular columns.

The StandardScaler function of the sklearn.preprocessing module is used to implement this concept. The instance is first fit on the training data and used to transform the train, validation, and test data.

https://gist.github.com/AyishaR/0da4bc0f591933fd2952a6e285158e2e

The model

The model is a simple one with 3 Dense layers, 2 of which have ReLU activation functions and the last one has a Sigmoid activation function that outputs a value in the range [0, 1].

As it is a binary classification problem, the model is compiled using the binary cross-entropy loss function. The Adam optimizer is used and the accuracy of the model is tracked over epochs.

The EarlyStopping callback function of the keras.callbacks module monitors the validation loss and stops the training if it doesn’t decrease for 5 epochs continuously. The restore_best_weights parameter ensures that the model with the least validation loss is restored to the model variable.

https://gist.github.com/AyishaR/9a4aa7e5d4a9454c42ab5458183846d5

The model was trained with a learning rate of 0.01 and achieved an accuracy of ~84.5% on the test set.

Plotting a confusion matrix to understand the results better —

Confusion matrix

A higher count of unique values for the rain class will help in higher accuracy.

The metrics

Plot of the accuracies
Plot of the losses

Prediction

Let’s perform predictions on random test data samples —

https://gist.github.com/AyishaR/0766d3c452923c6c8a1dc315112ab80d

Prediction on a random test sample

deepC

deepC library, compiler, and inference framework are designed to enable and perform deep learning neural networks by focussing on features of small form-factor devices like micro-controllers, eFPGAs, CPUs, and other embedded devices like raspberry-pi, odroid, Arduino, SparkFun Edge, RISC-V, mobile phones, x86 and arm laptops among others.

Compiling the model using deepC —

Code: deepC compilation

Head over to the cAInvas platform (link to notebook given earlier) and check out the predictions by the .exe file!

Credits: Ayisha D