Try the new Google Books
Check out the new look and enjoy easier access to your favorite features
Page 2
Try the new Google Books
Check out the new look and enjoy easier access to your favorite features
Page 3
Why Neural Networks? According to Universal Approximate Theorem, Neural Networks can approximate as well as learn and represent any function given a large enough layer and desired error margin. The way neural network learns the true function is by building complex representations on top of simple ones. On each hidden layer, the neural network learns new feature space by first compute the affine (linear) transformations of the given inputs and then apply non-linear function which in turn will be the input of the next layer. This process will continue until we reach the output layer. Therefore, we can define neural network as information flows from inputs through hidden layers towards the output. For a 3-layers neural network, the learned function would be: f(x) = f_3(f_2(f_1(x))) where: Therefore, on each layer we learn different representation that gets more complicated with later hidden layers.Below is an example of a 3-layers neural network (we don’t count input layer): For example, computers can’t understand images directly and don’t know what to do with pixels data. However, a neural network can build a simple representation of the image in the early hidden layers that identifies edges. Given the first hidden layer output, it can learn corners and contours. Given the second hidden layer, it can learn parts such as nose. Finally, it can learn the object identity. Since truth is never linear and representation is very critical to the performance of a machine learning algorithm, neural network can help us build very complex models and leave it to the algorithm to learn such representations without worrying about feature engineering that takes practitioners very long time and effort to curate a good representation. The post has two parts: This post will be the first in a series of posts that cover implementing neural network in numpy including gradient checking, parameter initialization, L2 regularization, dropout. The source code that created this post can be found here. The input X provides the initial information that then propagates to the hidden units at each layer and finally produce the output y^. The architecture of the network entails determining its depth, width, and activation functions used on each layer. Depth is the number of hidden layers. Width is the number of units (nodes) on each hidden layer since we don’t control neither input layer nor output layer dimensions. There are quite a few set of activation functions such Rectified Linear Unit, Sigmoid, Hyperbolic tangent, etc. Research has proven that deeper networks outperform networks with more hidden units. Therefore, it’s always better and won’t hurt to train a deeper network (with diminishing returns). Lets first introduce some notations that will be used throughout the post: Next, we’ll write down the dimensions of a multi-layer neural network in the general form to help us in matrix multiplication because one of the major challenges in implementing a neural network is getting the dimensions right. The two equations we need to implement forward propagations are: These computations will take place on each layer. We’ll first initialize the weight matrices and the bias vectors. It’s important to note that we shouldn’t initialize all the parameters to zero because doing so will lead the gradients to be equal and on each iteration the output would be the same and the learning algorithm won’t learn anything. Therefore, it’s important to randomly initialize the parameters to values between 0 and 1. It’s also recommended to multiply the random values by small scalar such as 0.01 to make the activation units active and be on the regions where activation functions’ derivatives are not close to zero. There is no definitive guide for which activation function works best on specific problems. It’s a trial and error process where one should try different set of functions and see which one works best on the problem at hand. We’ll cover 4 of the most commonly used activation functions: If you’re not sure which activation function to choose, start with ReLU. Next, we’ll implement the above activation functions and draw a graph for each one to make it easier to see the domain and range of each function. Given its inputs from previous layer, each unit computes affine transformation z = W^Tx + b and then apply an activation function g(z) such as ReLU element-wise. During the process, we’ll store (cache) all variables computed and used on each layer to be used in back-propagation. We’ll write first two helper functions that will be used in the L-model forward propagation to make it easier to debug. Keep in mind that on each layer, we may have different activation function. We’ll use the binary Cross-Entropy cost. It uses the log-likelihood method to estimate its error. The cost is: The above cost function is convex; however, neural network usually stuck on a local minimum and is not guaranteed to find the optimal parameters. We’ll use here gradient-based learning. Allows the information to go back from the cost backward through the network in order to compute the gradient. Therefore, loop over the nodes starting at the final node in reverse topological order to compute the derivative of the final node output with respect to each edge’s node tail. Doing so will help us know who is responsible for the most error and change the parameters in that direction. The following derivatives’ formulas will help us write the back-propagate functions: Since b^l is always a vector, the sum would be across rows (since each column is an example). The dataset that we’ll be working on has 209 images. Each image is 64 x 64 pixels on RGB scale. We’ll build a neural network to classify if the image has a cat or not. Therefore, y^i ∈ {0, 1}. Now, our dataset is ready to be used and test our neural network implementation. Let’s first write multi-layer model function to implement gradient-based learning using predefined number of iterations and learning rate. Next, we’ll train two versions of the neural network where each one will use different activation function on hidden layers: One will use rectified linear unit (ReLU) and the second one will use hyperbolic tangent function (tanh). Finally we’ll use the parameters we get from both neural networks to classify training examples and compute the training accuracy rates for each version to see which activation function works best on this problem. Please note that the accuracy rates above are expected to overestimate the generalization accuracy rates. The purpose of this post is to code Deep Neural Network step-by-step and explain the important concepts while doing that. We don’t really care about the accuracy rate at this moment since there are tons of things we could’ve done to increase the accuracy which would be the subject of following posts. Below are some takeaways: A. Small α leads to slow convergence and may become computationally very expensive. B. Large α may lead to overshooting where our learning algorithm may never converge.I. Coding The Neural Network
Forward Propagation
Parameters Initialization
Activation Functions
Feed Forward
Cost
Back-Propagation
II. Application
Conclusion
2. Number of hidden layers (depth): The more hidden layers the better, but comes at a cost computationally.
3. Number of units per hidden layer (width): Research proven that huge number of hidden units per layer doesn’t add to the improvement of the network.
4. Activation function: Which function to use on hidden layers differs among applications and domains. It’s a trial and error process to try different functions and see which one works best.
5. Number of iterations.
- Standardize data would help activation units have similar range of values and avoid gradients to go out of control.
Originally published at imaddabbura.github.io on April 1, 2018.