Tuesday, July 30, 2013

Part3: Convolutional Neural Networks for Handwritten Digits Recognition

  Part 3 (Forward and Backward Propagation)


Forward Propagation

Each FM derives the values of its pixels by convolution of its kernels with the FMs in the previous layer, as indicated in the first 3 layers in the figure below. Suppose layer J has 3 FMs, then each FM in layer J+1 will have 3 kernels. Each of these kernels will be convoluted with the corresponding FM in layer J to calculate the pixel values for the FMs in layer J+1, as has been shown in the figure below. The 4th value being added to the pixels in layer J+1 is bias, which is added to each pixel in the FM. If there are more features in layer J+1, each would have 3 of its own kernels which will be convoluted with the FMs in layer J in a similar way to calculate pixels values of those FMs.

After calculating pixel values this way, we limit them to a suitable range of -1 to +1, by applying an activation function (also called the squashing function or Sigmoid). Here we use an activation function X[n] = 1.7159 tanh(0.66666667 Y[n]), where Y[n] and X[n] are, respectively, the values of the pixel before and after passing through the activation function.

Convolution of FMs with kernels and sub-sampling by a factor of 2 to get a FM in the next layer

Here one question can arise. The FMs calculated this way don’t seem to have the size which convolution would result in. For example, convolution of a (13x13) array in layer 1 with (5x5) kernels of layer 2 should result in (9x9) outputs, then why do we have (5x5) FMs in layer 2? In fact, the network proposed by Dr. LeCun has sub-sampling layers next to layer 1 and layer 2, which simply discards the alternative pixels in each row and column of each FM. Dr. Simrad advocated an approach where instead of first calculating the whole convolution output, he simply calculates the pixels which will be retained after applying the sub-sampling. This is equivalent to setting a step size, which the convolution window in Fig. will jump both in horizontal and vertical directions instead of jumping by one pixel, as is the case with general convolution shown in the Figure in part 1. 

All Dr. LeCun, Dr. Simrad, and Mike have defined the last two layers as fully connected layers. For example, Mike defines layer 3 as consisting of 100 neurons, each connected to each of the 1250 neurons of layer 2. This is in fact the same as we have defined this layer. Since layer 2 has (5x5) FMs, and layer 3 has (5x5) kernels, each pixel in layer 3 will be automatically connected to each pixel in layer 3. Mike defines the weights of a layer altogether. For example 125100 weights of layer 3 are defined in an array, and then separate arrays are used to define which weight will be used for which connection. I think this makes the programming a bit difficult and heavy. In my implementation, the weights belong to the FMs and therefore there remains no need to define connections separately. Similarly, I have kept the same approach for layer 4. Its FMs have (1x1) kernels and layer 3 has (1x1) FMs, which makes it fully connected automatically.

During backpropagation, which will be explained later, we will need to calculate the errors in the output of each layer with respect to its pixel values, and its weights. Therefore we need two additional arrays in each FM, equal to the size of the FM and the size of the weights, respectively. These arrays and their sizes are also mentioned below each layer in the figure.

Backpropagation

Backpropagation is needed for training the CoNN. The error in the output, i.e., the pixel values of the last layer, is propagated back through the layers and their weights are adjusted. This is done in the following steps:
  1. First, we find the derivative of the error with respect to neuron values in the last layer. This is done simply by subtracting the ideal output values from the actual.

    dErrorX =  X – Xideal

    Note that the ideal value of only one neuron in the last layer, corresponding to the digit under investigation is +1, and all other neurons have an ideal value of zero.

  2. The derivative calculated this way is wrt to the output that was calculated after applying the activation function. We get the error with respect to  the “output before applying the activation function” by applying an inverse activation function to the output and multiplying it to the derivative calculated above.

    dErrorY = InverseSIGMOID(X) * dErrorX

    where

    InverseSIGMOID(X) = 0.66666667/1.7159*(1.7159+Y)*(1.7159–X)

  3. Next we calculate the derivative with respect to weights. All the neurons in the previous layer that shared a particular weight during forward propagation would be used for calculating the derivative of the error with respect to  that weight.

    Consider the forward propagation shown in the figure above. A (2x2) block in a FM in layer J is connected to a pixel in layer J+1 through a (2x2) kernel. The contribution of these connections to the derivative of error wrt weights in this kernel is calculated as

    dErrorW00 += dErrorY00 * Y00
    dErrorW01 += dErrorY00 * Y01
    dErrorW10 += dErrorY00 * Y10
    dErrorW11 += dErrorY00 * Y11

    The same kernel also connects all other pixels in layer J+1 to the other pixels in the same FM in layer J, and all those connections will contribute their share in the same way in calculating total dErrorW for this kernel. The same procedure is repeated for all other kernels belonging to each FM in layer J+1. Bias is added to each pixel of an FM during forward propagation, without having a connection to the previous layer. Therefore, dError with respect to bias is calculated simply by summing up dErrorX of all pixels in the FM.

    Take an example of the first FM in layer 4. It has a bias and 100 weights, each of which connected to a different neuron in layer 3. The FM has only one pixel and its dErrorY is multiplied to the value of each of the 100 neurons to which it is connected in layer 3 (for bias, this multiplier is +1), and we get dErrorW for all these weights.

    Now take another example of the first weight in the first kernel of the first FM of layer 2. It connects all the pixels in the first FM to 25 different pixels in the first FM in layer 1. Therefore we will have 25 multiplications and summation of their results will give the dErrorW for this weight.

    From the above explanation, it can be understood that backpropagation is very similar to the convolution used in forward propagation, and its code is easy to write, once the forward propagation has been implemented correctly.

  4. After calculating dErrorW, we can update the weights using the following formulas:

    (weight)new = (weight)old – eta * dErrorW
    where eta is the learning rate and a very important design parameter.

  5. Now we are done with this layer and need to move to the previous layer. For the last layer, it was easy to find dErrorX in step 1, but it is not that easy for other layers because we don’t know ideal values for their outputs. We have to use a procedure similar to step 3. All neurons that got a contribution from a particular neuron in the previous layer, their dErrorY are multiplied with corresponding weights and results are summed up. This gives dErrorX for the previous layer and then steps 2-4 can be used to update its weights

    Consider figure above where one pixel in layer J+1 gets a share from 4 pixels in a FM in layer J. During backpropagation, this pixel in layer J+1 contributes its share in dErrorX of those 4 pixels in layer J in the following way:

    (dErrorX00)j += (dErrorY00)j+1 * (W00)j
    (dErrorX01)j += (dErrorY00)j+1 * (W01)j
    (dErrorX10)j += (dErrorY00)j+1 * (W10)j
    (dErrorX11)j += (dErrorY00)j+1 * (W11)j

    Note that the same pixel in layer J+1 is connected to the same 4 pixels in other FMs in layer J through different weights, and will contribute to their dErrorX in the same way. Similarly, if there were more FMs in layer J+1, they would also contribute to all pixels in each FM of layer J in the same way. The coding of this step is very similar to that of step 3.

    For further understanding of this step, consider the first pixel in the first FM of layer 1. It connects to the first pixel in each of 50 FMs of layer 2 through 50 different weights (the first weight of the first kernel in each FM of layer 2). The dot product of these 50 weights with dErrorY of these 50 neurons in layer 2 will give dErrorX of for the first neuron in layer 1.

  6. After step 5, we move to step 2 and repeat the loop for all layers except for the first layer, which does not have any weights to update.


Ishtiaq Rasool Khan
Professor,
Department of Computer Science and AI
University of Jeddah
https://ishtiaqrasool.github.io/


No comments: