Financial distress prediction — on cAInvas

Predicting whether a given company is under financial distress or not based on time-based data for different companies.

Financial distress prediction — on cAInvas
Photo by Shunya Koide on Dribbble

The financial stability of a company is dependent on various factors. Predicting financial distress is necessary to take appropriate steps to manage the situation and get the company back on track.

In this article, we predict whether a given set of companies are under financial distress based on ~80 time-based factors.

Implementation of the idea on cAInvas — here!

The dataset

On Kaggle by Ebrahimi

The dataset is a CSV file with financial distress prediction for a set of companies.

Along with companies and time periods, there are 83 factors denoted by x1 to x83 that define the financial and non-financial characteristics of the companies. Out of these, x80 is a categorical feature.

The ‘Financial Distress’ column is a continuous variable that can be converted into a two-value column — healthy (0) if value > -0.5, else distressed (1).

Snapshot of the dataset
Snapshot of the dataset

Let us understand the dataset we are working with. Each company has 1 or more rows corresponding to various time periods. Looking into how many —

How many time periods for each company
How many time periods for each company

There are 422 companies in the dataset. A few have less than 5 time periods too!

Preprocessing

One hot encoding the input variables

x80 is a categorical variable that has to be one hot encoded as the attribute values do not define a range dependency.

The drop_first parameter is set to True. This means that if there are n categories in the column, n-1 columns are returned instead of n. i.e., each value is returned as an n-1 value array. The first category is defined by an array with all 0s while the remaining n-1 category variables are arrays with 1 in the (i-1)th index of the array.

Creating a time-based data frame

Since this is a time-based dataset, the features are appended to include values from previous timesteps of the same company group.

A time window of 5 is defined, i.e., attribute values from 5 timesteps are combined to create one row of the final dataset. If a company has fewer timestamps than the defined time window, they are discarded.

One hot encoding the target variables

We are approaching this as a classification problem and so the categorical target feature is converted into a binary-valued feature using the condition defined previously — healthy (0) if value > -0.5, else distressed (1).

Class value distribution
Class value distribution

This is not a balanced dataset.

Balancing the dataset

With 5 timestep values in a one-row sample, we are going to resample and train using this dataset without a time series split.

It is an unbalanced dataset. In order to balance the dataset, there are two options,

  • upsampling — resample the values to increase their count in the dataset.
  • downsampling — pick n samples from each class label where n = number of samples in class with least count (here, 83), i.e., reducing the count of certain class values in the dataset.

Here, we will be upsampling.

Class-wise sample count
Class-wise sample count

Defining the input and output columns for use later —

Code: Input and output columns
Code: Input and output columns

There are 448 input columns and 1 output column.

Train-validation-test split

Using an 80–10–10 ratio to split the data frame into train- validation- test sets. The train_test_split function of the sklearn.model_selection module is used for this. These are then divided into X and y (input and output) for further processing.

Scaling the values

The range of attribute values in the dataset is not the same all across. This may result in certain attributes being weighted higher than others. The range of values across all attributes are scaled to [0, 1].

The MinMaxScaler function of the sklearn.preprocessing module is used to implement this concept. The instance is first fit on the training data and used to transform the train, validation, and test data.

The model

The model is a simple one with 4 Dense layers, 3 of which have ReLU activation functions and the last one has a Sigmoid activation function that outputs a value in the range [0, 1].

As it is a binary classification problem, the model is compiled using the binary cross-entropy loss function. The Adam optimizer is used and the accuracy of the model is tracked over epochs.

The EarlyStopping callback function of the keras.callbacks module monitors the validation loss and stops the training if it doesn’t decrease for 5 epochs continuously. The restore_best_weights parameter ensures that the model with the least validation loss is restored to the model variable.

The model was trained with a learning rate of 0.01 and followed by a learning rate of 0.001.

The model achieved an accuracy of ~93% on the test set.

Accuracy
Accuracy

Plotting a confusion matrix to understand the results better —

Confusion matrix
Confusion matrix

A larger dataset with more instances of financial distress would help in achieving a better test set accuracy. Feel free to play around with the time window to see the variation in results.

The metrics

The plot of accuracies
The plot of accuracies

The plot of losses
The plot of losses

Prediction

Let’s perform predictions on random test data samples —

Prediction on random test sample
Prediction on random test sample

deepC

deepC library, compiler, and inference framework are designed to enable and perform deep learning neural networks by focussing on features of small form-factor devices like micro-controllers, eFPGAs, CPUs, and other embedded devices like raspberry-pi, odroid, Arduino, SparkFun Edge, RISC-V, mobile phones, x86 and arm laptops among others.

Compiling the model using deepC —

Code: deepC compilation
Code: deepC compilation

Head over to the cAInvas platform (link to notebook given earlier) and check out the predictions by the .exe file!

Credits: Ayisha D

Also Read: Windmill Fault Prediction App