What is the cache used for in our implementation of forward propagation and backward propagation 1 point?

Try the new Google Books

Check out the new look and enjoy easier access to your favorite features

Page 2

Try the new Google Books

Check out the new look and enjoy easier access to your favorite features

Page 3

Why Neural Networks?

According to Universal Approximate Theorem, Neural Networks can approximate as well as learn and represent any function given a large enough layer and desired error margin. The way neural network learns the true function is by building complex representations on top of simple ones. On each hidden layer, the neural network learns new feature space by first compute the affine (linear) transformations of the given inputs and then apply non-linear function which in turn will be the input of the next layer. This process will continue until we reach the output layer. Therefore, we can define neural network as information flows from inputs through hidden layers towards the output. For a 3-layers neural network, the learned function would be: f(x) = f_3(f_2(f_1(x))) where:

  • f_1(x): Function learned on first hidden layer
  • f_2(x): Function learned on second hidden layer
  • f_3(x): Function learned on output layer

Therefore, on each layer we learn different representation that gets more complicated with later hidden layers.Below is an example of a 3-layers neural network (we don’t count input layer):

Figure 1: Neural Network with two hidden layers

For example, computers can’t understand images directly and don’t know what to do with pixels data. However, a neural network can build a simple representation of the image in the early hidden layers that identifies edges. Given the first hidden layer output, it can learn corners and contours. Given the second hidden layer, it can learn parts such as nose. Finally, it can learn the object identity.

Since truth is never linear and representation is very critical to the performance of a machine learning algorithm, neural network can help us build very complex models and leave it to the algorithm to learn such representations without worrying about feature engineering that takes practitioners very long time and effort to curate a good representation.

The post has two parts:

  1. Coding the neural network: This entails writing all the helper functions that would allow us to implement a multi-layer neural network. While doing so, I’ll explain the theoretical parts whenever possible and give some advices on implementations.
  2. Application: We’ll implement the neural network we coded in the first part on image recognition problem to see if the network we built will be able to detect if the image has a cat or a dog and see it working :)

This post will be the first in a series of posts that cover implementing neural network in numpy including gradient checking, parameter initialization, L2 regularization, dropout. The source code that created this post can be found here.

I. Coding The Neural Network

Forward Propagation

The input X provides the initial information that then propagates to the hidden units at each layer and finally produce the output y^. The architecture of the network entails determining its depth, width, and activation functions used on each layer. Depth is the number of hidden layers. Width is the number of units (nodes) on each hidden layer since we don’t control neither input layer nor output layer dimensions. There are quite a few set of activation functions such Rectified Linear Unit, Sigmoid, Hyperbolic tangent, etc. Research has proven that deeper networks outperform networks with more hidden units. Therefore, it’s always better and won’t hurt to train a deeper network (with diminishing returns).

Lets first introduce some notations that will be used throughout the post:

Next, we’ll write down the dimensions of a multi-layer neural network in the general form to help us in matrix multiplication because one of the major challenges in implementing a neural network is getting the dimensions right.

The two equations we need to implement forward propagations are: These computations will take place on each layer.

Parameters Initialization

We’ll first initialize the weight matrices and the bias vectors. It’s important to note that we shouldn’t initialize all the parameters to zero because doing so will lead the gradients to be equal and on each iteration the output would be the same and the learning algorithm won’t learn anything. Therefore, it’s important to randomly initialize the parameters to values between 0 and 1. It’s also recommended to multiply the random values by small scalar such as 0.01 to make the activation units active and be on the regions where activation functions’ derivatives are not close to zero.

Activation Functions

There is no definitive guide for which activation function works best on specific problems. It’s a trial and error process where one should try different set of functions and see which one works best on the problem at hand. We’ll cover 4 of the most commonly used activation functions:

  • Sigmoid function (σ): g(z) = 1 / (1 + e^{-z}). It’s recommended to be used only on the output layer so that we can easily interpret the output as probabilities since it has restricted output between 0 and 1. One of the main disadvantages for using sigmoid function on hidden layers is that the gradient is very close to zero over a large portion of its domain which makes it slow and harder for the learning algorithm to learn.
  • Hyperbolic Tangent function: g(z) = (e^z -e^{-z}) / (e^z + e^{-z}). It’s superior to sigmoid function in which the mean of its output is very close to zero, which in other words center the output of the activation units around zero and make the range of values very small which means faster to learn. The disadvantage that it shares with sigmoid function is that the gradient is very small on good portion of the domain.
  • Rectified Linear Unit (ReLU): g(z) = max{0, z}. The models that are close to linear are easy to optimize. Since ReLU shares a lot of the properties of linear functions, it tends to work well on most of the problems. The only issue is that the derivative is not defined at z = 0, which we can overcome by assigning the derivative to 0 at z = 0. However, this means that for z ≤ 0 the gradient is zero and again can’t learn.
  • Leaky Rectified Linear Unit: g(z) = max{α*z, z}. It overcomes the zero gradient issue from ReLU and assigns α which is a small value for z ≤ 0.

If you’re not sure which activation function to choose, start with ReLU. Next, we’ll implement the above activation functions and draw a graph for each one to make it easier to see the domain and range of each function.

Feed Forward

Given its inputs from previous layer, each unit computes affine transformation z = W^Tx + b and then apply an activation function g(z) such as ReLU element-wise. During the process, we’ll store (cache) all variables computed and used on each layer to be used in back-propagation. We’ll write first two helper functions that will be used in the L-model forward propagation to make it easier to debug. Keep in mind that on each layer, we may have different activation function.

Cost

We’ll use the binary Cross-Entropy cost. It uses the log-likelihood method to estimate its error. The cost is: The above cost function is convex; however, neural network usually stuck on a local minimum and is not guaranteed to find the optimal parameters. We’ll use here gradient-based learning.

Back-Propagation

Allows the information to go back from the cost backward through the network in order to compute the gradient. Therefore, loop over the nodes starting at the final node in reverse topological order to compute the derivative of the final node output with respect to each edge’s node tail. Doing so will help us know who is responsible for the most error and change the parameters in that direction. The following derivatives’ formulas will help us write the back-propagate functions: Since b^l is always a vector, the sum would be across rows (since each column is an example).

II. Application

The dataset that we’ll be working on has 209 images. Each image is 64 x 64 pixels on RGB scale. We’ll build a neural network to classify if the image has a cat or not. Therefore, y^i ∈ {0, 1}.

  • We’ll first load the images.
  • Show sample image for a cat.
  • Reshape input matrix so that each column would be one example. Also, since each image is 64 x 64 x 3, we’ll end up having 12,288 features for each image. Therefore, the input matrix would be 12,288 x 209.
  • Standardize the data so that the gradients don’t go out of control. Also, it will help hidden units have similar range of values. For now, we’ll divide every pixel by 255 which shouldn’t be an issue. However, it’s better to standardize the data to have a mean of 0 and a standard deviation of 1.
Figure 3: Sample image

Now, our dataset is ready to be used and test our neural network implementation. Let’s first write multi-layer model function to implement gradient-based learning using predefined number of iterations and learning rate.

Next, we’ll train two versions of the neural network where each one will use different activation function on hidden layers: One will use rectified linear unit (ReLU) and the second one will use hyperbolic tangent function (tanh). Finally we’ll use the parameters we get from both neural networks to classify training examples and compute the training accuracy rates for each version to see which activation function works best on this problem.

Figure 4: Loss curve with tanh activation function
Figure 5: Loss curve with ReLU activation function

Please note that the accuracy rates above are expected to overestimate the generalization accuracy rates.

Conclusion

The purpose of this post is to code Deep Neural Network step-by-step and explain the important concepts while doing that. We don’t really care about the accuracy rate at this moment since there are tons of things we could’ve done to increase the accuracy which would be the subject of following posts. Below are some takeaways:

  • Even if neural network can represent any function, it may fail to learn for two reasons:
  1. The optimization algorithm may fail to find the best value for the parameters of the desired (true) function. It can stuck in a local optimum.
  2. The learning algorithm may find different functional form that is different than the intended function due to overfitting.
  • Even if neural network rarely converges and always stuck in a local minimum, it is still able to reduce the cost significantly and come up with very complex models with high test accuracy.
  • The neural network we used in this post is standard fully connected network. However, there are two other kinds of networks:
  1. Convolutional NN: Where not all nodes are connected. It’s best in class for image recognition.
  2. Recurrent NN: There is a feedback connections where output of the model is fed back into itself. It’s used mainly in sequence modeling.
  • The fully connected neural network also forgets what happened in previous steps and also doesn’t know anything about the output.
  • There are number of hyperparameters that we can tune using cross validation to get the best performance of our network:
  1. Learning rate (α): Determines how big the step for each update of parameters.

A. Small α leads to slow convergence and may become computationally very expensive.

B. Large α may lead to overshooting where our learning algorithm may never converge.

2. Number of hidden layers (depth): The more hidden layers the better, but comes at a cost computationally.

3. Number of units per hidden layer (width): Research proven that huge number of hidden units per layer doesn’t add to the improvement of the network.

4. Activation function: Which function to use on hidden layers differs among applications and domains. It’s a trial and error process to try different functions and see which one works best.

5. Number of iterations.

  • Standardize data would help activation units have similar range of values and avoid gradients to go out of control.

Originally published at imaddabbura.github.io on April 1, 2018.

Última postagem

Tag