There are plenty of awesome blog posts ou there introducing neural networks, the most popular being Micheal Neilsen’s and Christopher Olah’s.
My intention is to complement the above resources with a different perspective ,perhaps answser those question which most people would be afraid to ask because its silly
Let us start by examining the conventional way of writing code in a programming language such as C++. Supposing we want to code the equation y = 2x ,the structure would be :
The usual structure for an ML program would be:
Here we specify the input and output and the algorithm figures out how to predict the output for a given input in the future. Unlike conventional software in which functions are written in the language of code, this function is represented in the language of weights. In fact, Andrej Karpathy calls it Software 2.0. In order to understand weights, let us first examine the line equation.
For all those who are already familiar with the line equation, we are just going to make some changes to the naming convention. Don’t worry if you are not, it is just a set of numbers that mathematically describe any line drawn on a 2D plane such as a sheet of paper.For the line y = 2x + 3,it turns out that the parameters slope=2 and intercept=3 are enough to uniquely describe it. What is usually called slope= m is called weight=W here and intercept=e is called bias=b here. But what is the role of the weight and bias in neural networks?. Actually, what is it’s role in the line equation?
Slope / Weight: It amplifies the input value to obtain a really close or sometimes exactly the desired value. For instance, given input=3
,there is no value of W which produces output=8
exactly ,but W=2.666
gives us output=7.998
which is a pretty good approximation but we do have to note that it changes the nature of the function as demonstrated below
Intercept / Bias: In cases such as above ,we need to add a suitable number which allows us to obtain any output value. In out particular case, (3 x 2) + 2 = 8
.
Now let us discuss the actual learning part. Remember that in deep learning, we specify both input=3
and output=6
but the computer has to figure out what weight to multiply and bias to add to obtain the output value. At the beginning, our only go is to initialize the weight as a random number as we have no idea what the actual value is [Of course we know that W=2
for the simple case but remember we want the network to find it out, even for large networks which may contain millions of weights]. Let us consider that we initialize is as W=23
. The network outputs y = 23 x 3 = 46
which is completely wrong. Obviously, we want to modify W in such a way that we obtain y=6 but the question is how to do it.
The immediate concern then becomes why waste time pondering over the problem when you can just do 6 / 3 = 2
get it over with?. It is wrong because it would give us a value that satisfies the equation only for that particular (input, output) pair. There are a bunch of reasons for it:
My favourite interpretation: Consider studying a single question and answer for an algebra exam and trying to answer all other question using only that knowledge. Rest assured that we would answer that particular question perfectly but all other answers would be completely disastrous. By studying a variety of questions and answers, we would able to correctly answer not all those questions which we had studied but also totally new questions to some extent because we could build certain mental relationships between individual questions and answers since they were regarding the same general subject. If you had studied questions from totally different topics such as algebra and geology, you wouldn’t be able to answer any new algebra question using the knowledge we gathered from the geology question. In the same way, we train neural networks not on a single example but a range of different examples belonging to the same category [lots of audio recordings for voice detection, lots of photos for image recognition etc] because we care about is the prediction and generalization for new real-world examples. If we just wanted to predict correctly only the examples it was trained with, we then the whole idea of a neural network could be replaced by a series of if statements.
This one’s good too: The network might even learn the wrong representation. Assume that all goes well and the network does learn that W=2
from the data input=3
and output=6
. However, if we give input=4
and ask the network to predict the output, there is a good chance it might give output=6
instead of the expected output=8
because the representation it would have learned might be y = 2 + x
instead of y=2x
.The network isn’t at fault because both of these representations are correct for input=3
and output=6
. We are at fault for not providing sufficient training data.
Multiple weights : In the test case discusses above, we have considered only one weight. In all practical cases , neural nets have multiple weights W1, W2 , W3 etc because we want to input data to spread out through the network. To make it more concrete, let us consider out audio processing example. We mentioned earlier that the most important feature frequency has to be specified by hand in old ML methods but deep learning automatically determines that it is the best feature.
A DJ uses instruments which have a variety of options to control each feature of the music output. Similarly, we have multiple weights in the neural net whose magnitudes tell us how important each input feature is. Multiple weights allow learning of complex function representations which wouldn’t be possible with just one weight, much like a guitar has 6 strings to allow for a range of sound outputs which wouldn’t be possible with just one string.
Multiple layers: Theoretically, sufficient number weights in a single layer are enough to learn any function, but remember that our purpose is not to have a network finely tuned for any specific example. If we want any sort of generalization, we have to make sure that the network learns multiple layers of representations. Consider that we are building an image classifer. We could train a single layer network to learn that function mapping any single input image to a single output category but such a network could never generalize to a new input image. To make things more concrete, consider preparing for a math exam. You are absolutely sure of the question coming for 10 marks but have no idea of the associated topic. The solution to such a problem would have a lot of steps, and you decide to memorize each one of them without learning any of the underlying concepts. If the same question comes, well and good but any other question would most probably yield zero marks. However, if you had understood the theory behind each step and learned to solve it sequentially, you would have been easily able to answer any question of the same type.
Multiple inputs: Since input data rarely comes as a single feature in the real world, the network will need several inputs to describe it.
Audio files: Usually come in 128 or more amplitude channels.
Image files: Each image file is a matrix of pixels ,a 720p image has 720 x 1280 = 12280 pixels
Neural nets need to have as many numbers of inputs as the number of features in the input data. Also, the inputs need to be connected to each weight
Remember where we left off: We initialized the weight as random number W=23 and wanted the network to learn that the weight is actually W=2
from given input=3
and output=6
. Such a model can accurately predict the output for any new input but how do we get it to learn?
Failures are the stepping stones to success is one of the most common proverbs out there. In fact, neural networks need to make a series of failures before we can get them to work. Consider that we want to play a game on the mobile phone but do not know what the controls are. We might press random controls to see what happens at the beginning, which may lead to failure in the game but we eventually figure out the correct controls. Putting it another way, we need to calculate how far apart the predicted value and actual output were. In ML terminology, we call this loss and something used to measure the loss is called loss function. Let u see how we might design a basic loss loss_function:
The simplest and most intuitive way to find the difference between the actual and predicted value is to calculate Loss = actual - predicted.
The problem with this function is that it outputs negative loss when actual < predicted
. However, we consider only the magnitude of the loss because considering whether the sign does not add any extra infrmation .To make this right, we modify the function as L = (actual - predicted)^2.
Squaring not only results in positive loss always but also amplifies the loss.
Now we get a concrete measure of how well the network performs, and the question now becomes how to decrease the loss. Now is the right time to get some insights into the gradient of a function. Before that, let us examine the derivative of a function.