Handwritten Character Recognition is often considered as the “Hello World” of Modern Day Deep Learning. Handwritten Optical Character Recognition has been studied by researchers and Deep Learning practitioners for decades now. It is by far the most understood area in Deep Learning and pattern recognition. Anyone starting with Deep Learning encounters the MNIST dataset that contains highly processed images of handwritten digits. The MNIST dataset was created in 1998. Some Deep Learning methods have achieved a near-human level performance on this dataset. Researchers now use this dataset as a benchmark and baseline dataset for testing new Deep Learning and Machine Learning models.

According to Wikipedia —

Optical character recognitionoroptical character reader(OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast)

Most Handwritten Optical Character Recognition methods aim to effectively segment and recognise the handwritten characters in an image or document. This article takes one step forward by segmenting and recognising handwritten digits and some mathematical operators and calculating the value of the mathematical expression written.

The project aims at building a CNN model architecture and a pipeline for expression value calculation. Optimising the model to reduce the number of trainable parameters of the CNN model to below 250k is one of the project’s primary goals for easy deployment to edge or less computation efficient devices.

The Cainvas Platform is used for implementation, which provides seamless execution of python notebooks for building AI systems that can eventually be deployed on edge (i.e. an embedded system such as compact MCUs).

The notebook can be found here.

The flow of the article will be as follows —

- Description of the Project
- The Dataset
- Preprocessing Step
- Building the CNN model
- Training the Model
- Model Performance
- Testing on Images
- Building Pipeline for Segmentation and Expression Calculation
- Conclusion

### Description of the Project

The project aims at segmenting and recognising handwritten digits and mathematical operators in an image. Finally, creating a pipeline for calculating the value of the expression written. The current implementation recognises only four basic mathematical operators namely Add(+), Subtract(-), Multiply(x) and Divide(/). The CNN model contains around 160k trainable parameters, making it easily deployable on less computation efficient devices.

### The Dataset

The dataset is taken from Kaggle from this link except for images of the division sign. The images for the division are taken from this Kaggle Link.

The images of the dataset can be visualised from the following collage —

The data distribution can be seen in the following bar plot —

### Preprocessing Step

The preprocessing step includes the following sub-steps —

- Convert three-channel images to Grayscale images.
- Apply a threshold to all the images to convert the images to binary.
- Resize the thresholded images to a uniform size of (32x32x1)
- Encode the non-categorical labels like ‘add’, ‘sub’ to categorical labels.
- Split the dataset into train and test set in 80–20 ratio.

The implementation of the steps mentioned above is as follows —

https://gist.github.com/Yuvnish017/4869f8a51d2026fa00da49b9174a2f66

In **line 6**, OpenCV inbuilt function for thresholding is used. **Line 12 **contains the implementation of encoding the non-categorical labels using LabelEncoder class of sklearn. Finally, in **line 15**, the dataset is split into train and test sets.

The preprocessing step also includes converting the labels to one-hot vectors and normalising the images. The implementation is as follows —

https://gist.github.com/Yuvnish017/610a57156329b03e2b59cc62a1a54f23

### Building the CNN Model

The CNN model has the following characteristics —

- Three Convolutional layers with 32, 32, and 64 number of filters, respectively.
- A MaxPool2D layer follows each Convolutional layer.
- Three Fully Connected layers follow the convolutional layers for classification.

The Keras implementation is as follows —

https://gist.github.com/Yuvnish017/e372cffe6567e4583f9676af5eb50679

The L2 regulariser is used to avoid overfitting.

Model: "sequential"

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

conv1 (Conv2D) (None, 32, 32, 32) 320

_________________________________________________________________

act1 (Activation) (None, 32, 32, 32) 0

_________________________________________________________________

max_pooling2d (MaxPooling2D) (None, 16, 16, 32) 0

_________________________________________________________________

conv2 (Conv2D) (None, 16, 16, 32) 9248

_________________________________________________________________

act2 (Activation) (None, 16, 16, 32) 0

_________________________________________________________________

max_pooling2d_1 (MaxPooling2 (None, 8, 8, 32) 0

_________________________________________________________________

conv3 (Conv2D) (None, 8, 8, 64) 18496

_________________________________________________________________

act3 (Activation) (None, 8, 8, 64) 0

_________________________________________________________________

max_pooling2d_2 (MaxPooling2 (None, 4, 4, 64) 0

_________________________________________________________________

flatten (Flatten) (None, 1024) 0

_________________________________________________________________

dropout (Dropout) (None, 1024) 0

_________________________________________________________________

fc1 (Dense) (None, 120) 123000

_________________________________________________________________

fc2 (Dense) (None, 84) 10164

_________________________________________________________________

fc3 (Dense) (None, 14) 1190

=================================================================

Total params: 162,418

Trainable params: 162,418

Non-trainable params: 0

_________________________________________________________________

### Training the Model

https://gist.github.com/Yuvnish017/e0e5f5810ae355e713a73bdce489795b

Step Decay is used to decrease the value of the learning rate after every ten epochs. The initial learning rate is kept at 0.001. ImageDataGenerator class of Keras is used for data augmentation to provide a different image each time to the model. The batch size is saved at 128, and the model is trained for 100 epochs.

### Performance of the Model

The performance metrics used are as follows —

- Loss and Accuracy vs Epochs plot
- Classification report
- Confusion Matrix

Loss and Accuracy vs Epochs plot —

Classification Report —

precision recall f1-score support

0 0.94 0.97 0.96 113

1 0.99 0.93 0.96 115

2 0.99 0.99 0.99 97

3 1.00 0.97 0.98 116

4 0.94 0.99 0.97 101

5 1.00 0.85 0.92 82

6 0.89 0.99 0.94 120

7 0.93 1.00 0.97 86

8 0.98 0.97 0.97 127

9 0.98 1.00 0.99 102

10 1.00 0.96 0.98 113

11 1.00 1.00 1.00 84

12 0.99 0.99 0.99 114

13 1.00 0.99 1.00 150

accuracy 0.97 1520

macro avg 0.97 0.97 0.97 1520

weighted avg 0.97 0.97 0.97 1520

Confusion Matrix —

### Testing the Model on Images

https://gist.github.com/Yuvnish017/845056a2815f8b3e4670ebd704a3fcee

The pipeline includes reading the image, grayscale conversion, edge detection, contour detection, segmenting the digits and operators through detected contours, ROI extraction, preprocessing ROI, making predictions from the model on this ROI and displaying the results on the original image.

The result is of passing an image to the pipeline is shown below —

### Building Pipeline for Expression Calculation

https://gist.github.com/Yuvnish017/41a6982be1c0791db14e4b81c7bff0e8

The implementation is almost similar to the pipeline mentioned in the previous section. The differences are listed below —

**Line 2**initialises the list to store the predictions.**Line 37**appends the predictions to the ‘chars’ list.**Lines 47 to 58**construct the string representation of the expression.**Lines 59 and 60**evaluate the expression and print the value, respectively.

As mentioned earlier, the current implementation only recognises four basic arithmetic operators and does not recognise the brackets. So, For example, the expression to be solved is 22+16×16. This expression is interpreted as 22+(16×16), and a similar convention is used for the pipeline.

### Conclusion

The article summaries the Handwritten Optical Character Recognition. The implementation recognizes handwritten digits and four arithmetic operators and takes Handwriting Recognition one step forward. The pipeline can also evaluate the expression written. The level of the project can be increased further by including more mathematical operators and symbols and making the system more intelligent. The CNN model used has less than 250k parameters and can be easily deployed on edge devices. This deployment is possible through the Cainvas Platform by making use of their compiler called deepC. Thus effectively bringing AI out on edge — in actual and physical real-world use cases.

Notebook Link: Here

Credit: YUVNISH MALHOTRA