Black Swans

RafayAK

24 points

joined
RafayAK hasn't added a bio yet
History

Recent History

Study Group for Goodfellow's "Deep Learning" (Pages 110 - 140) (Week 7)

Hey Fam, here are your tasks for the week commencing 12/12/2018:

  • Read pages 110 - 140 of the Deep Learning book. Here we'll be introduced to one of the most fundamental concepts in ML, the Maximum Likelihood Estimation. A lot of the proofs and justifications of many algos are derived from it. Make sure to go over it thoroughly.
  • Leave at least one comment below, e.g. a question about something you don't understand or a link to a brilliant resource. It must somehow relate to pages 110 - 140.
  • No bonus activity for this week either 😲. Use your extra time to help answer the questions others leave 🙏, this may go a long way to solidify your understanding of many concepts, too.

As always, if you're feeling lost mention it in the comments and ask for help!

Enjoy 🍻, see you next week.

Yeah this can look a bit weird as it's a hard to read due to the mathematical syntax.
If you can expand this out and then read it out, it makes a lot more sense.

Here x (our unsupervised problem) can be written as a set of supervised problems $\mathbf{x} = { x_{1}, x_{2}, x_{3}, x_{4} }$

So, expanding the probability product rule we get:
$$
P(\mathbf{x}) = P(x_{1}, x_{2}, x_{3}, x_{4}) = \prod_{i=1}^{4}p(x_{i}|x_{1},...,x_{i-1})
= p(x_{1})\; p(x_{2}| x_{1})\; p(x_{3}|x_{1}, x_{2})\; p(x_{4}|x_{1},x_{2}, x_{3})
$$

Now reading the expanded probability from left to right:

  1. Probability of being $x_{1}$,
  2. Probability of being $x_{2}\; given\; x_{1}$,
  3. Probability of being $x_{3}\; given\; x_{1}\;and \;x_{2} $,
  4. Probability of being $x_{4}\; given\; x_{1}\;and \;x_{2}\; and\; x_{3} $,

Putting this in a more concrete example, we can look at this as... maybe creating a cat detector given a picture without actually creating a specific cat detector.

So, expanding the probability product rule we get:
$$
\begin{align}
P(\mathbf{cat}) & = P(animal\;detected, paws\;detected, tail\;detected, whiskers\;detected, \;...) \\
& = p(animal\;detected)\; p( paws\;detected\;|\;animal\;detected)\; p(tail\;detected\;|\;animal\;detected, paws\;detected)\; \\
& p(whiskers\;detected\;|\;animal\;detected,paws\;detected,tail\;detected)
...
\end{align}
$$

Using probability chain rule we have turned the unsupervised problem of of cat detection into a series of supervised problems which lead to a cat detector.

Since, the upcoming topics delve a bit into Maximum Likelihood Estimation and Maximum A Posteriori Estimation if anyone has some resources for understanding these topics please share.

If you have an unlabelled dataset of cats and dogs and want a model that can classify both of them, well what do you do? Simple, label them and then train a model. But if your unlabeled dataset contains millions of examples then this is where semi-supervised learning comes in handy.
Initially, you label a small number of examples(say 100) and then test how is the accuracy of your model on the rest of the dataset. Using a clever technique called Active Learning you then guide your labelling process to label a few of the examples that the model got wrong, and then a few more in the next iteration till a satisfactory accuracy is reached. This process significantly reduces quantity of labelled data required.

Before I begin the answer one important thing to note is, that the gradient is always perpendicular to the contours of the function.

So, when the conditioning number is high the contours of the objective function are very elliptical(skewed circles) as shown below. The magnitude of gradient in one direction is larger than the other.

enter image description here

Here the blue arrows show the direction of the gradient at arbitrary points perpendicular to the contours of the function. As you can see they are always pointing away from the minima(red point). This is the reason why gradient descent(orange line) keeps bouncing up and down in a to-and-for motion slowing down the rate of convergence. All because the gradient is never pointing directly to the minima.

To remedy this we use normalization techniques on data when using Gradient descent(or any of first order optimization methods) and Batch normalization layers within the neural networks, so that all our data is on the same scale. In a function where all the contours are perfectly circular i.e. very small condition number, the gradient at any arbitrary point will directly point in the direction of the minima. As shown below

enter image description here

Also note that, Newton's method always jumps directly to the minima in any can case(elliptical or circular) because intuitively, Newton's method changes the coordinate system so that everything is perfectly circular. But Newton's method has its own drawbacks that it is susceptible to saddle points and the Hessian is expensive to calculate and store for large datasets.

P.S. sorry of the crappy images, made'em in paint. Hopefully, you get the point.

Really enjoyed this chapter, it was a nice refresher on Lagrange multipliers and introduced me to its generalized form, KKT.

The numerical stability part, early on, piqued my interest to actually sit down and create a numerically stable version of the Softmax activation function from scratch. For anyone else also interested, check out these two blogs:

http://saitcelebi.com/tut/output/part2.html
https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/

Oh yeah, that's even better. I completely forgot about that.

Exactly, what I am planning to do. Since Softplus just looks like a smoothed out version of a ReLU, putting one in place of the other seems like a simple job.

Two simple of networks of the form:

1- (Linear->Softplus)->(Linear->Softplus)->(Linear->Softplus)->Linear->Sigmoid(output)

2- (Linear->ReLU)->(Linear->ReLU)->(Linear->ReLU)->Linear->Sigmoid(output)

and a dataset of suppose...cat/non-cat will hopefully be sufficient to run this experiment

Enjoyed a bit of a deep dive into into probability and statistics, as it cleared up the question of where did the cross-entropy function emerge from and what is a logit (turns out just the inverse of the logistic function).

A weird itch i've got after reading this chapter is to check how will my toy networks behave if I supplant the ReLU activations with Softplus activations. Maybe I'll try it out this weekend. If someone has any empirical knowledge on this I would love to hear.

See you next week fam.

enter image description here

A bit confused about this this part.

The Dirac delta function($\delta $) will be infinitely high when $x = x^{(i)}$
If it is infinitely high at $x = x^{(i)}$ then how come multiplying by $\frac{1}{m}$ will put probability mass of $\frac{1}{m}$ on $m$ points?

I've recently become a huge fan of 3Blue1Brown

A cool thing I learned this week is that the Determinant of matrix is the product of the eigenvalues of the matrix.

I wonder why this info eluded me throughout the countless math classes in high school and college, could have used it.

edit: added a pic instead of Latex code, the matrices were being squashed to a vector
enter image description here

Note that if $A.A^T$ and $A^T.A$ are valid operations then $Tr(A^T.A) = Tr(A.A^T)$.

check out equation#2.51 under the Trace operator section

Yep, you got that right. It is read "for all x and for all y..."

For those who are not sure what the implication of $L_{1}$ (Manhattan distance), $L_{2}$, and $L_{\infty}$ norm would be, later on with respect to the cost function, a good intuition to develop at this stage would be to visualize the unit circles of all three

enter image description here

Note that $\left \| x \right \|_{1} \geq \left \| x \right \|_{2} \geq \left \| x \right \|_{\infty }$

This short video provides further explanation on the subject

Here's my favorite book on Linear Algebra. It single handedly got me through my undergrad Linear Algebra course. Kolman's Elementary Linear Algebra with Applications
https://drive.google.com/open?id=0B2mB_TmBtVqFMk9EVVBvcm1tRTA

Hey great question. A funny story about this is after AlexNet's groundbreaking result in the 2012 ImageNet competition, each successive winning convolution net model consistently kept on decreasing its error rate in the competition. Researchers early on believed that a trained human's accuracy on ImageNet data could be assumed to be 100%, or very close to it, though no actual studies were carried out to prove this. Mind you that the Image net data has 1000 different classes, and as I have recently learnt a big portion of them are various dog breeds (weird I know, LOL).

Around 2014 Andrej Karpathy looked to prove how well a human performs when compared to a trained neural network. Surprisingly, his study found that human error rate was around 5.1% compared to GoogLeNET's 6.8% (which since has passed human accuracy). This study earned Andrej Karpathy the witty moniker "The Reference Human" among AI researchers

Check out his blog What I learned from competing against a ConvNet on ImageNet

This entire introduction section reminds me of Andrew Ng's interview with Geoffrey Hinton. Where, in the later part, he talks about why the early researchers in AI were misguided. The early misconceptions are based off the reasons that "What comes in is a string of words, and what comes out is a string of words. And because of that, strings of words are the obvious way to represent things. So they thought what must be in between was a string of words, or something like a string of words."

Reading the parts about Cyc, where the researchers tediously compiled a knowledge-base for the inference engine and yet it failed, shows the pitfalls in relying upon symbolic expressions to create intelligence. As Hinton puts it "I think what's in between is nothing like a string of words. I think the idea that thoughts must be in some kind of language is as silly as the idea that understanding the layout of a spatial scene must be in pixels, pixels come in."

I've created a bunch of small networks, over the past few months, to test out certain hypothesis and experimental ideas. Once you get a hang of the math and the dimensions of the matrices and vectors you can build small networks pretty quickly.
My advice would be to first create a single layer network to mimic simple logical gates such as AND/OR. Then two layer networks to create a slightly more complicated XOR gate.

On a side note, I remember reading the Batch Normalization paper where the authors first tried out their idea on a simple 3-layer neural network.

Exactly what I was looking for! I am already on the 5th chapter and was looking for people to discuss the different topics and help me better understand some topics. Count me in.

To contact RafayAK, email .