Ishtiaq Rasool Khan: July 2013

Tuesday, July 30, 2013

Part3: Convolutional Neural Networks for Handwritten Digits Recognition

Part 3 (Forward and Backward Propagation)

Forward Propagation

Each FM derives the values of its pixels by convolution of its kernels with the FMs in the previous layer, as indicated in the first 3 layers in the figure below. Suppose layer J has 3 FMs, then each FM in layer J+1 will have 3 kernels. Each of these kernels will be convoluted with the corresponding FM in layer J to calculate the pixel values for the FMs in layer J+1, as has been shown in the figure below. The 4th value being added to the pixels in layer J+1 is bias, which is added to each pixel in the FM. If there are more features in layer J+1, each would have 3 of its own kernels which will be convoluted with the FMs in layer J in a similar way to calculate pixels values of those FMs.

After calculating pixel values this way, we limit them to a suitable range of -1 to +1, by applying an activation function (also called the squashing function or Sigmoid). Here we use an activation function X[n] = 1.7159 tanh(0.66666667 Y[n]), where Y[n] and X[n] are, respectively, the values of the pixel before and after passing through the activation function.

Convolution of FMs with kernels and sub-sampling by a factor of 2 to get a FM in the next layer

Here one question can arise. The FMs calculated this way don’t seem to have the size which convolution would result in. For example, convolution of a (13x13) array in layer 1 with (5x5) kernels of layer 2 should result in (9x9) outputs, then why do we have (5x5) FMs in layer 2? In fact, the network proposed by Dr. LeCun has sub-sampling layers next to layer 1 and layer 2, which simply discards the alternative pixels in each row and column of each FM. Dr. Simrad advocated an approach where instead of first calculating the whole convolution output, he simply calculates the pixels which will be retained after applying the sub-sampling. This is equivalent to setting a step size, which the convolution window in Fig. will jump both in horizontal and vertical directions instead of jumping by one pixel, as is the case with general convolution shown in the Figure in part 1.

All Dr. LeCun, Dr. Simrad, and Mike have defined the last two layers as fully connected layers. For example, Mike defines layer 3 as consisting of 100 neurons, each connected to each of the 1250 neurons of layer 2. This is in fact the same as we have defined this layer. Since layer 2 has (5x5) FMs, and layer 3 has (5x5) kernels, each pixel in layer 3 will be automatically connected to each pixel in layer 3. Mike defines the weights of a layer altogether. For example 125100 weights of layer 3 are defined in an array, and then separate arrays are used to define which weight will be used for which connection. I think this makes the programming a bit difficult and heavy. In my implementation, the weights belong to the FMs and therefore there remains no need to define connections separately. Similarly, I have kept the same approach for layer 4. Its FMs have (1x1) kernels and layer 3 has (1x1) FMs, which makes it fully connected automatically.

During backpropagation, which will be explained later, we will need to calculate the errors in the output of each layer with respect to its pixel values, and its weights. Therefore we need two additional arrays in each FM, equal to the size of the FM and the size of the weights, respectively. These arrays and their sizes are also mentioned below each layer in the figure.

Backpropagation

Backpropagation is needed for training the CoNN. The error in the output, i.e., the pixel values of the last layer, is propagated back through the layers and their weights are adjusted. This is done in the following steps:

First, we find the derivative of the error with respect to neuron values in the last layer. This is done simply by subtracting the ideal output values from the actual.

dErrorX = X – Xideal

Note that the ideal value of only one neuron in the last layer, corresponding to the digit under investigation is +1, and all other neurons have an ideal value of zero.
The derivative calculated this way is wrt to the output that was calculated after applying the activation function. We get the error with respect to the “output before applying the activation function” by applying an inverse activation function to the output and multiplying it to the derivative calculated above.

dErrorY = InverseSIGMOID(X) * dErrorX

where

InverseSIGMOID(X) = 0.66666667/1.7159*(1.7159+Y)*(1.7159–X)
Next we calculate the derivative with respect to weights. All the neurons in the previous layer that shared a particular weight during forward propagation would be used for calculating the derivative of the error with respect to that weight.

Consider the forward propagation shown in the figure above. A (2x2) block in a FM in layer J is connected to a pixel in layer J+1 through a (2x2) kernel. The contribution of these connections to the derivative of error wrt weights in this kernel is calculated as

dErrorW00 += dErrorY00 * Y00
dErrorW01 += dErrorY00 * Y01
dErrorW10 += dErrorY00 * Y10
dErrorW11 += dErrorY00 * Y11

The same kernel also connects all other pixels in layer J+1 to the other pixels in the same FM in layer J, and all those connections will contribute their share in the same way in calculating total dErrorW for this kernel. The same procedure is repeated for all other kernels belonging to each FM in layer J+1. Bias is added to each pixel of an FM during forward propagation, without having a connection to the previous layer. Therefore, dError with respect to bias is calculated simply by summing up dErrorX of all pixels in the FM.

Take an example of the first FM in layer 4. It has a bias and 100 weights, each of which connected to a different neuron in layer 3. The FM has only one pixel and its dErrorY is multiplied to the value of each of the 100 neurons to which it is connected in layer 3 (for bias, this multiplier is +1), and we get dErrorW for all these weights.

Now take another example of the first weight in the first kernel of the first FM of layer 2. It connects all the pixels in the first FM to 25 different pixels in the first FM in layer 1. Therefore we will have 25 multiplications and summation of their results will give the dErrorW for this weight.

From the above explanation, it can be understood that backpropagation is very similar to the convolution used in forward propagation, and its code is easy to write, once the forward propagation has been implemented correctly.
After calculating dErrorW, we can update the weights using the following formulas:

(weight)new = (weight)old – eta * dErrorW
where eta is the learning rate and a very important design parameter.
Now we are done with this layer and need to move to the previous layer. For the last layer, it was easy to find dErrorX in step 1, but it is not that easy for other layers because we don’t know ideal values for their outputs. We have to use a procedure similar to step 3. All neurons that got a contribution from a particular neuron in the previous layer, their dErrorY are multiplied with corresponding weights and results are summed up. This gives dErrorX for the previous layer and then steps 2-4 can be used to update its weights

Consider figure above where one pixel in layer J+1 gets a share from 4 pixels in a FM in layer J. During backpropagation, this pixel in layer J+1 contributes its share in dErrorX of those 4 pixels in layer J in the following way:

(dErrorX00)j += (dErrorY00)j+1 * (W00)j
(dErrorX01)j += (dErrorY00)j+1 * (W01)j
(dErrorX10)j += (dErrorY00)j+1 * (W10)j
(dErrorX11)j += (dErrorY00)j+1 * (W11)j

Note that the same pixel in layer J+1 is connected to the same 4 pixels in other FMs in layer J through different weights, and will contribute to their dErrorX in the same way. Similarly, if there were more FMs in layer J+1, they would also contribute to all pixels in each FM of layer J in the same way. The coding of this step is very similar to that of step 3.

For further understanding of this step, consider the first pixel in the first FM of layer 1. It connects to the first pixel in each of 50 FMs of layer 2 through 50 different weights (the first weight of the first kernel in each FM of layer 2). The dot product of these 50 weights with dErrorY of these 50 neurons in layer 2 will give dErrorX of for the first neuron in layer 1.
After step 5, we move to step 2 and repeat the loop for all layers except for the first layer, which does not have any weights to update.

Ishtiaq Rasool Khan
Professor,
Department of Computer Science and AI
University of Jeddah
https://ishtiaqrasool.github.io/

Part 2 (CoNN: Theory and Implementation)

Ishtiaq Rasool Khan

Professor,

Department of Computer Science and AI

University of Jeddah

https://ishtiaqrasool.github.io/

For the implementation of CoNN, understanding convolution, and forward and backward propagation is very important. In this part, I will explain the convolution and the basic structure of CoNN. Then three basic classes of my implementation are described. Readers will note that the implementation becomes quite convenient with these classes. Forward and backward propagations will be explained in part 3.

Convolution

The convolution of an (NxN) image with a (KxK) kernel can be understood by sliding a (KxK) window over the input image iteratively. For each position of the window, an output pixel is generated by taking the dot product (sum of the multiplication of the corresponding pixels) of the kernel with the input pixels lying under the window. In the following figures we have shown the calculation of the first two pixels of the output image Y, generated by convolution of a (6x6) input image X with a (2x2) kernel W.

Calculation of the 1st output pixel	Calculation of the 2nd output pixel
Fig: Pictorial description of 2D convolution

It can be noted that the size of the output image obtained by convolution of an N1xN2 input image to a K1xK2 kernel is (N1-K1+1) x (N2-K2+1)e.

Convolutional Neural Networks (CoNN)

The first layer has only one feature map which is the input image itself. In the following layers, each feature map keeps a certain number of unique kernels (2D arrays of weights), equal to the number of the feature maps in the previous layer. The size of each kernel in a feature map is the same and is a design parameter. The pixel values in a feature map are derived by convoluting its kernels with the corresponding feature maps in the previous layer. The number of feature maps in the last layer is equal to the number of output options. For example in 0-9 digit recognition, there will be 10 feature maps in the output layer, and the feature map with highest pixel value will be the result.

Fig. Structure of CoNN designed in this blog

The CoNN implemented in this study is shown in the figure above. The number of feature maps (FM) in each layer and their sizes are shown under each layer. For example, layer 1 has 6 feature maps, each of size 13 x 13. The number of kernels each feature map in a layer contains is also shown in the figure. For example, W[6][25] written under layer 2 indicates that each feature map in this layer has 6 kernels (equal to the number of FM in the previous layer), each of which has 25 weights (5x5 array). In addition to these weights, each FM has a bias weight (for its importance, please refer to any NN text). Each pixel has another parameter SF (sampling factor) and it will be explained in the forward propagation section. Three other parameters dBias, dErrorW and dErrorFM will be explained in the backpropagation section. The last two lines below each layer in the figure give the total number of pixels (neurons) and weights in the whole layer.

Implementation

It can be understood from the above discussion that the topmost structure in our implementation would be the CoNN itself. We write a CCNN class as given below:

class CCNN
{

public:

Layer *m_Layer;
int m_nLayer;

CCNN(void);
~CCNN(void);

void ConstructNN();
void DeleteNN();

//load weights from a text file
void LoadWeights(char *FileName);

//initialize weights to random values
void LoadWeightsRandom();

//save current weights to a text file
void SaveWeights(char *FileName);

//forward propagate
int Calculate(double *input, double *output);

//Backward propagate
void BackPropagate(double *desiredOutput, double eta);

//diagonal Hessian to speed up learning
void CalculateHessian( );

};

CCNN contains an array of layers defined by *m_Layer and a variable m_nLayers which keeps the record of the number of layers in the network. The member function LoadWeights( ) loads weights of each layer from a text file, and SaveWeights( ) saves the current weights. LoadWeightsRandom( ) initializes the weights to random floating point values and it is used before starting a new training session. Calculate( ) performs forward propagation using the current weights and gets the result. BackPropagate( ) is used for training the CoNN, and it flows the error in the output of the last layer back to the previous layers and adjusts the weights to minimize the error. CalculateHessian( ) calculates the diagonal Hessian to speed up the learning of the CoNN. In fact, the last three functions simply call the corresponding functions in the Layer class which is explained below.

class Layer
{

public:

FeatureMap* m_FeatureMap;

int m_nFeatureMap;
int m_FeatureSize;
int m_KernelSize;
int m_SamplingFactor;

Layer *pLayerPrev;

void ClearAll();

void Calculate();
void BackPropagate(int dOrder, double etaLearningRate);

void Construct(int nFeatureMap, int FeatureSize, int KernelSize, int SamplingFactor);

void Delete();

};

Layer has a pointer to an array of feature maps *m_FeatureMap, and a variable m_nFeatureMap for the number of FMs in this layer. It also has variables for the size of the feature map (m_FeatureSize), the size of the kernel (m_KernelSize), and the sampling factor (m_SamplingFactor) which is the step size of the sliding window during convolution.

A layer derives the pixel values of its FMs from the FMs in the previous layer during forward propagation, and during backpropagation, it sends the error back to the previous layer. Therefore a pointer to the previous layer *pLayerPrev is also provided to the Layer. ClearAll( ) initializes all arrays to zero. Calculate( ) and Backpropagate( ) performs forward and backward propagation respectively. dOrder parameter in the latter defines the order of the derivative of the error which is to be propagated. For training, the first-order derivative is backpropagated, and for calculating diagonal Hessian, the second-order derivative is backpropagated. These will be explained later.

Finally, we define a class FeatureMap as

class FeatureMap
{
public:

double bias, dErr_wrtb, diagHessianBias;
double *value, *dError;
double **kernel, **diagHessian, **dErr_wrtw;

Layer *pLayer;

void Construct( );
void Delete();

void Clear();

void ClearDError();

void ClearDiagHessian();

void ClearDErrWRTW();

double Convolute(double *input, int size, int r0, int c0, double *weight, int kernel_size);

void Calculate(double *valueFeatureMapPrev, int idxFeatureMapPrev );

void BackPropagate(double *valueFeatureMapPrev, int idxFeatureMapPrev, double *dErrorFeatureMapPrev, int dOrder );

};

FeatureMap defines variables bias, derivative of error with respect to bias (dErr_wrtb), and diagonal Hessian for bias (diagHessianBias). It defines arrays for holding the pixel values (*value) and the derivative of the error wrt pixel values (*dError). Weights are stored in 2D array **kernel, and the derivative of the error wrt weights in **dErr_wrtw. The Digonal Hessian for each weight is stored in **diagHessian. FeatureMap has a pointer to its parent Layer *pLayer, which is used to access the size of the FM, its kernel, and other information required for different calculations.

Some of the functions in these three classes will be explained in part 3.

Part 1 (Introduction)

Ishtiaq Rasool Khan

Professor,

Department of Computer Science and AI

University of Jeddah

https://ishtiaqrasool.github.io/

Introduction

Dr. Yann LeCun proposed a 5 layers CoNN for handwritten character recognition in his famous article:

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-Based Learning Applied to Document Recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.

Later Dr. Simrad published his work in 2003:

Patrice Y. Simard, Dave Steinkraus, John Platt, "Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis," International Conference on Document Analysis and Recognition (ICDAR), IEEE Computer Society, Los Alamitos, pp. 958-962, 2003.

He gave some practical tips regarding software implementation of Dr. LeCun's work, and suggested few techniques for getting better output. Dr. LeCun did not program his CoNN in C/C++, and Dr. Simrad did not make his code public.

Mike Oneill programmed this ANN in Visual C++, and his code and excellent explanation is publicly available at
http://www.codeproject.com/KB/library/NeuralNetRecognition.aspx

What is different in this implementation?

One problem with Mike's code is that he extensively used Microsoft Foundation Classes (MFC) library, STL Vectors, and unnecessary (in my opinion) multithreading which makes it difficult to understand for people like me, who feel more comfortable without these things (especially Microsoft-specific strings of "random CAPITAL letters").

I have implemented the same CoNN in as simple and pure C/C++ as possible. MFC has been used for some interface and displaying of results, however, its use has been kept to a minimum. There is only one dialog for all sorts of interfaces and it is implemented using MFC. Implementation of CoNN is in two separate files (CNN.h and CNN.cpp) which contain three classes (CCNN, CLayer, and CFeatureMap), and they are all written in standard C++. Anyone wishing to get rid of MFC completely can use these classes in his/her own code.

Mike's code has a bug in the implementation of the 2nd layer (as has been discussed in Q&A part of his notes), and this is the reason why he could not achieve higher accuracy. I have corrected that bug in my implementation and that is why it has been able to achieve a better performance.

In general Mike's code is an excellent work, however, there is one deficiency that it is written just keeping in mind the NN proposed by Dr. LeCun and Dr. Simrad, and has not been generalized. It has five layers of proposed fixed sizes. If a user wants to change the number of layers or number of neurons in a certain layer, he has to rewrite a significant part of the code. I have tried to generalize the code. A user can choose the number of layers and the number of neurons from the main routine without modifying the NN implementation code.

Dr. LeCun has defined sampling layers in between convolutional layers which create subsampled reduced-sized versions of the feature maps. Dr. Simrad proposed a more efficient approach by eliminating these sampling layers and instead incorporated the subsampling in the operation of the convolution layers. Mike followed this approach and I also maintained the same. Going one step further, I have eliminated the difference between the convolutional and the fully connected layers. A fully connected layer having an (NxN) feature map is simply defined as a convolutional layer having (NxN) feature maps each of size (1x1). Details will come later where implementation is explained.

Training and Testing Data Sets

A database of handwritten patterns was prepared by the National Institute of Standards and Technology ("NIST"), written by different individuals including high school students and US census workers. Dr. LeCun modified this dataset and broke it into training and testing sets, containing 60,000 and 10,000 images of handwritten characters, respectively. I have provided links to these datasets at the end of this page.

Dr. Simrad in his implementation induced different kinds of distortions (scaling, rotation, and Gaussian filtering, etc) in the training images. This can be considered as equivalent to increasing the number of training images by several times.

According to Mike, distortion helps in training the neural network, since it forces the network to extract the intrinsic shapes of the patterns, rather than allowing the network to (incorrectly) focus on peculiarities of individual patterns. He mentions that he could not reach a certain level of accuracy without inducing distortion. In my implementation, I could achieve better results without inducing distortion in the training images, compared to what Mike could get with distortion-induced extended data. The reason might be the bug in Mike's code, as mentioned above.

In the current phase, I have trained the NN without distortion. In a future implementation, I will try to use distortion to find if it really helps.

(Update: It has been implemented. You can check the "Induce Distorsion" box and the training will be done using distorted patterns)

Description of GUI

A trained (or untrained) NN can be tested on individual samples or the whole set of 10,000 test samples using the buttons on the left side of the interface dialog box. Individual samples can be tested by clicking on "Get", " >>", or "<<" buttons. The sample whose index is shown above these buttons is displayed in "Input". The result and the true result are shown in "Output" and "Label" respectively. If "Induce Distortion" box at the bottom is checked, different kinds of distortion (scaling, rotation, and Gaussian filtering, etc.) will be induced in the input sample before testing. The distorted sample is shown in the box labeled as "Distorted". The radio buttons at the bottom provide an option to choose between training or test data sets, on which NN will be tested.

Fig: User Interface of Convolutional Neural Network for Handwritten Digit Recognition

"Load NN Weights" button is provided to load the weights obtained during a previous training session. At the start, the NN is loaded with random weights, and therefore it will not be able to recognize the samples correctly until it undergoes some training, or the weights from a previous training session are loaded.

The training session starts with a click on "Training". There is an option, provided with radio buttons, to start a new training from scratch, or resume from a previous training session. Mismatches in each epoch during training are listed in the "Training History" edit box. After each epoch of training, the weights are stored in a text file. Weights from a previous training session can be loaded by clicking on "Load NN Weights" button.

When clicked on "Training", an open-file dialog appears asking for the training dataset and training dataset labels (the correct results) files, if they are not opened already. Generally, the training should be carried out using the training dataset only, so that you can check the performance of the CoNN using the test dataset which has not been seen by the CoNN previously. However, you can experiment by training with both datasets and see how much improvement can be achieved in performance. The radio buttons on the left that provide a choice between training or test datasets are related only to the testing. If you want to train the CoNN using the test dataset, choose the appropriate file during the file open dialog.

The "Induce Distortion" checkbox applies to both testing and training. The NN can be trained using distorted samples, and also its performance can be tested on distorted samples by checking this box.
Counters provided on the top right give the number of misrecognized patterns, and the total number of patterns processed so far, during training or testing. They are reset to zero at beginning of each testing or training epoch.

Ishtiaq Rasool Khan

Tuesday, July 30, 2013

Part3: Convolutional Neural Networks for Handwritten Digits Recognition

Part 3 (Forward and Backward Propagation)

Forward Propagation

Backpropagation

Links

Part2: Convolutional Neural Networksfor Handwritten Digits Recognition

Part 2 (CoNN: Theory and Implementation)

Convolution

Convolutional Neural Networks (CoNN)

Implementation

Links

Part1: Convolutional Neural Networks for Handwritten Digits Recognition

Part 1 (Introduction)

Introduction

What is different in this implementation?

Training and Testing Data Sets

Description of GUI

Links

About Me