IMHO there are a ton of resources on Deep learning but none of them and even practitioners are NOT able to articulate what is deep learning to a layperson, hence this post!
We are living in an awesome time. One of the reasons is, it is not normal for any field to be dormant for about 30 years and suddenly explode. As James Somers puts it “AI today is deep learning, and deep learning is “backprop”. It is more than obligatory for everyone to understand what is deep learning and here is my humble attempt to bring it to you. Please don’t be depressed by the length of this post, I have made sure it is completely accessible to all kinds of audience, hang on with me and I promise everything will make sense.
- You will develop a strong intuition for neural networks and how they work internally.
- You will know where does the buzzword deep learning hails from.
- You will have good follow-up reads and references to deepen your knowledge in this subject matter.
- As a followup to this post, you will have a working example of a deep learning code using a popular deep learning library called keras.
- You DON’T have to be a rockstar at machine learning but you should have the basic intuition behind general purpose machine learning techniques like classification to make the best use of this post.
- Motivation, because applied Deep learning is relatively shallow and you can pick it up if you have the required motivation.
- High school math.
If it gives you any comfort, here is a little secret “Deep Learning is at best an experimental science. Do not let all the mathematics fool you into believing that the theorists have a handle on what is going on”. No, I am not saying this, it comes straight from the field (5)
Excited ? Let’s get started
Imagine a decision maker who can make decisions based on stimuli and evidence. Say, she comes with some simple guarantees. Let’s look at a very simplified version of a decision maker and build on it as we proceed.
- Is purpose-built, i.e is “designed” to make only one kind of decision.
- Can take inputs or stimuli for a given situation and can help make a decision.
- Can tune her decisionmaking logic based on Significance of the inputs/stimuli.
- Can sway the decision based on an arbitrary threshold.
- Inputs can be multiple and are binary.
- ONLY one Output and is binary.
- The significance of inputs are real numbers
- Threshold for the output is a real number.
We want to make a decision for “going on a vacation”
- Input 1: Cost of Air tickets low/high (1/0)
- Input 2: Do I get leaves yes/ no (1/0)
- Input 3: How is the weather nice / not nice (1/0)
n is the total number of inputs, i the current input , X represents inputs and W represent the Significance
What if we have to build a decisionmaking engine to ask questions like this. There was an idea proposed by Frank Rosenblatt to develop Artificial neurons inspired by human nerve cells (but they have nothing common more than the name, human neurons are awesome). The stick figure in the above diagram is the rough equivalent to an Artificial neuron or neurons hereinafter is the smallest tactical unit for a much bigger decision-making network called a Neural network.
While there are different styles of neurons the one we will start to understand is called Perceptron, which essentially the binary classifier equivalent in the supervised machine learning world. (Although latest NN models use more modern neuron models like Sigmoid neurons or rectifier (RELU) neurons. We will get to that shortly)
Now imagine the diagram with a perceptron
What if we have to make chain the outputs of one perceptron and feed it to another and make a sophisticated or complex decision, it will be something like below. A network of perceptrons or network of neurons if you will. Hence the name neural networks. Each perceptron giving an output 1 or 0 in a biological sense is a neuron firing or not. (The symbol shown inside the Perceptron is the shape of a step function, hence looking like a step or staircase. The step function essentially is the logic that the neuron executes to get the output). Each vertical silo of neurons you see in neural networks are called layers and they are largest tactical units of the neural network. More on the layers to follow.
Now, this is all good, but we have to do few things before moving forward.
- Refine our equation.
- Take stock of all neural network terminologies
- Revisit the idea of binary inputs and binary outputs as they are kind of restrictive. Consider the case of “I want an output which gives me a confidence score or probability estimation say, 0.8 or 80% sure that this image is an apple as opposed to is this image apple or not (1 or 0)?
- The logic each neuron executes to get the output is called the “Activation function”. In fact the terms perceptrons, sigmoid, and rectifier signify a style of the activation function. It’s called activation function because it governs the threshold at which the neuron is activated or “fired” or excited.
- The threshold is technically called bias (hence B) and it basically dictates whether a neuron will fire or not
- The significance is technically called a weight.
We will shortly see some more terminologies, hang in there. Let’s make the above neural network more complicated, below comes the dreaded neural network model. As you can see there can be many layers of neurons, while any network would have an input and output layer, they can vary in the number of hidden layers they might have any number of neurons per each layer.
- This is technically called a multilayer perceptron (strangely even if you used a sigmoid activation function).
- The idea of a network of neurons processing inputs and forwarding it to the next layer is “Feedforward” network. While feeding forward is the predominant case there are some cases where the output of neurons is fed back into the network i.e to themselves with loops in the network. They are called RNN or recurrent neural networks, and they are the dominant model for input data which are sequences rather than fixed vectors.
Where does Neural network meet Deep learning?
Now we have neural networks which can be stitched using layers and layers of artificial neurons, But more often in real-world cases, we need to experiment with weights and biases to tune the NN improve accuracy.
Two things we need now are
- A mechanism for effortlessly tuning values without manual intervention (as it can be a huge effort depending on the problem we are trying to solve.)
- Ability to make fine tunes. Think about this in binary inputs there isn’t we can do to fine tune. In essence instead of 0 or 1, we need the ability to send inputs as values between 0 and 1.
Fortunately, we can build learning models using algorithms which can automatically tune/adjust weights & biases and run multiple iterations without programmer intervention until the model gives the desired accuracy. These kinds of complex artificial neural networks are around for a long time now, but there were limitations on how easily we were able to build and tune such large networks. Also, due to the above limitations, they have largely remained shallow neural networks. In fact, the nature of the problems what perceptrons could solve were very limited.
Today it’s the mixture of new faster, cheaper hardware, GPUs, highly optimised open source libraries and powerful new activation functions/techniques that made creating very larger and deeper neural networks possible which lead to the new fascinating field called deep learning. Deep learning has quickly emerged as a field that works well on a broader range of problem and has become very accessible to anyone with a determination to master this field.
Now switching gears, I mentioned while we started learning by understanding perceptrons in the real world most commonly used activation functions are sigmoid and RELU, but why? with perceptrons exposing a step function fine-tuning of values in fractions isn’t possible with sigmoid function which is more of a smoother curve finer tunings are possible.
So what happens during learning?
The learning algorithm exposes one row of data at a time to the network, the network processes the input forward activating neurons. The actual output of the network is compared to the expected output. If there is an intolerable error, the delta is propagated back through the network one layer at a time, the weights are updated to narrow the error. This idea of converging to a minimal error is an optimisation problem. Gradient descent (Stochastic gradient descent to be more precise) is the commonly used optimisation algorithm with a technique called backpropagation. Let’s consider these as a black box and move forward.
Now, Let’s break it down. The idea remains simple that we have to identify the weights that contribute to the error, use it as an input to choose better weights to tune the learning. This involves two steps:
- Weight update
Propagation: When an input is presented to the network, it is propagated forward through the network, layer by layer, until it reaches the output layer. The output of the network is then compared to the desired output, using a cost function. (Cost function simply is a mathematical way to quantify the error rate.) The resulting error value is calculated for each of the neurons in the output layer. The error values are then propagated from the output back through the network until each neuron has an associated error value that reflects its contribution to the original output. Backpropagation uses these error values to calculate the gradient of the cost function. (Gradient the word by definition means a steep slope and the intuition behind gradient descent is to descend the slope to reach a lowermost point a.k.a minimize the cost or reduce the error in this context.)
Weight update: In the second phase, this gradient is fed to the optimization method, which in turn uses it to update the weights, in an attempt to minimize the cost function.
The backpropagation is repeated until you train all the data. One full round of updating your network for entire data set is called an epoch. Updating the weights from the errors usually doesn’t happen one at a time, they are usually batched and it’s called batched learning. The number of inputs shown before the weights in the network updated is controlled by a parameter called learning rate and usually small.
Let’s look at in detail,
- We need a cost function and the objective is to minimize cost (here cost refers to overall error rate)
- Backpropagation technique to calculate the gradient of the cost function
- Optimisation algorithm like SGD to use the gradient as input and comes up with weights to minimize the cost function
We need an algorithm which lets us find weights and biases so that the output from the network approximates y(x) for all training inputs x. To quantify how well we’re achieving this goal we define a quadratic cost function.
Here, w denotes the collection of all weights in the network, b denotes all the biases, n is the total number of training inputs, a is the vector of outputs from the network when x is input, and the sum is over all training inputs, x.The notation ‖v‖ just denotes the usual length function for a vector v. We’ll call C the quadratic cost function; it’s also sometimes known as the mean squared error or just MSE. Inspecting the form of the quadratic cost function, we see that C(w,b) is non-negative since every term in the sum is nonnegative. Furthermore, the cost C(w,b) becomes small, i.e., C(w,b)≈0, precisely when y(x) is approximately equal to the output, aa, for all training inputs, x. So our training algorithm has done a good job if it can find weights and biases so that C(w,b)≈0. So the aim of our training algorithm will be to minimize the cost C(w,b) as a function of the weights and biases. In other words, we want to find a set of weights and biases which make the cost as small as possible. An algorithm known as stochastic gradient descent will be used to minimize the cost.
Here take this quick tutorial on partial derivatives before moving forward (2). To compute the gradient of the cost function we need a special algorithm called the backpropagation algorithm. At the heart of backpropagation is an expression for the partial derivative ∂C/∂w of the cost function C with respect to any weight w (or bias b) in the network. The expression tells us how quickly the cost changes when we change the weights and biases. The goal of backpropagation is to compute the partial derivatives ∂C/∂w and ∂C/∂b of the cost function C with respect to any weight w or bias b in the network. (Much intuitively explained in 1). ‘ Backprop is a procedure for rejiggering the strength of every connection in the network so as to fix the error for a given training example. The way it works is that you start with the last two neurons, and figure out just how wrong they were: how much of a difference is there between what the excitement numbers should have been and what they actually were? When that’s done, you take a look at each of the connections leading into those neurons—the ones in the next lower layer—and figure out their contribution to the error. You keep doing this until you’ve gone all the way to the first set of connections, at the very bottom of the network. At that point, you know how much each individual connection contributed to the overall error, and in a final step, you change each of the weights in the direction that best reduces the error overall. The technique is called “backpropagation” because you are “propagating” errors back (or down) through the network, starting from the output.’
That’s it for now. I will follow this up with a post with some practical deep learning using python and keras as promised at the start.
Further reads and references
- A non-technical introduction to deep neural nets and deep learning.
- Partial derivatives for dummies
- Surprisingly accessible wiki link on backpropagation
- Backprop step by step
- The Black Magic and Alchemy of Deep Learning