a

2 points

joined

Hello.

History

History

The best way to think of ReLU is in terms of "what are the alternates?" and "what problems does ReLU solve that they have?" because, unfortunately, there isn't a perfect answer for *why* things of done in DL usually.

Consider another common activation function: the Sigmoid function. What are the issues with this function? Firstly there are the practical issues: Calculating $e^x$ is far more expensive than calculating $max(0, x)$; Sigmoid will quickly become arbitrarily close to +/-1 which can cause some issues with representation, again, unlike ReLU. More important are the properties that ReLU posses: Sigmoid functions suffer from the Vanishing Gradient Problem. This is caused during backpropogation where the "share" of an error for any specific node can be incredibly close to zero meaning that $e^x$ is *even closer* to zero still which occurs less often with ReLUs. Finally ReLUs seem to perform better than Sigmoids, but I am not up-to-date with the cutting edge and cannot comment further as to *why* they are better.

I quickly skimmed a Medium post which seems to cover this in more detail, and the Wikipedia page is quite good too.

In summary: ReLUs have nice computational properties and seem to work better.

Re: 6.9 - 6.11. This is a matrix representation of the neural network given in Figure 6.2. Matrices are a very common, and useful, representation of neural networks (and other machine learning techniques such as linear regression). On 6.8 the input matrix is multiplied by the first layer of the network which will be used to feed into the second layer, *but* first the bias term must be applied (fairly common in neural networks) this is what Equation 6.9 is doing. In 6.10 the weights are going through the activation function (ReLU) before they are fed into the next layer. 6.11 is just a repeat of 6.8 but for the final layer to give us our output.

Edit 02/03/2019 1535: Small typo changes.