Black Swans

RafayAK

32 points

joined
RafayAK hasn't added a bio yet
History

History

Study Group for Goodfellow's "Deep Learning" (Pages 140 - 165) (Week 8)

Hey Fam, I was out for a while, sad to see no one picked up the slack. We had a good thing going on. Anyways, I have moved on quite a bit since the last "study week", but a dedicated study group, however small, did wonders for not only motivating me but also clearing the fog when trying to learn new concepts or explain them.

For the sake of continuity, this study week will start from where we left off.

• Read pages 140 - 165 of the Deep Learning book. Here we’ll be introduced to basic Supervised and Unsupervised Learning algos and, most importantly the workhorse of the modern deep learning advances, the Stochastic Gradient Descent algo.
Make sure to go over it thoroughly.
• Leave at least one comment below, e.g. a question about something you don't understand or a link to a brilliant resource. It must somehow relate to pages 140 - 165.
• Go over Andrew Trask's brilliant A Neural Network in 11 lines of Python (Part 1) and
A Neural Network in 13 lines of Python (Part 2 - Gradient Descent) posts to solidify your concepts for this week and help prepare you for the next

Use your extra time to help answer the questions others leave 🙏, this may go a long way to improve your understanding of many concepts, too.
As always, if you're feeling lost mention it in the comments and ask for help!

Enjoy 🍻, see you next week.

Merry Christmas to you, too Jack. Interviews seem like a good idea for increasing traffic to the website. Eagerly, waiting for the Ian Goodfellow interview, that'll be a big one for the site.
As for the bugs, there is lingering Latex rendering issue.

On a side note, what do you plan on doing with the current points system?

Hey fam, if anyone is finding it difficult to understand Probability and Bayesian statistics parts of this week's reading be sure to check out the following resources:

1. A Short Introduction to Entropy, Cross-Entropy and KL-Divergence
2. How Bayes Theorem works
3. Where did the least-square come from?(Detailed derivation of mean squared error cost function using MLE, similar to the one in the book)

If you know any resources that helped you out this week please share below 🍻

This excerpt for the wiki on Estimation of Covariance Matrices, may help you understand this better:

Statistical analyses of multivariate data often involve exploratory
studies of the way in which the variables change in relation to one
another and this may be followed up by explicit statistical models
involving the covariance matrix of the variables. Thus the estimation
of covariance matrices directly from observational data plays two
roles:

1. to provide initial estimates that can be used to study the inter-relationships;
2. to provide sample estimates that can be used for model checking.

Hey hsm, sorry for the late reply, been a bit too busy this past week.

Anyway what the author is trying to convey in the 'entire' passage is that for all the machine learning tasks that map from x(input)--->y(label/output) there is a theoretical limit/ceiling to their accuracy, called the Bayes optimal error(or simply Bayes Error).

Any ML task can be seen as follows (the figure referenced in the book passage is looking at the error, instead of accuracy.):

Initially, with more and more training data the accuracy of your ML system increases(generalization error reduces) till it reaches human level performance after which significant gains are much harder to achieve, regardless of how much more data you throw at the problem. Note that, Human level performance is, in many AI tasks such as computer vision and speech synthesis, pretty close to the theoretical Bayes optimal error.

Hope this clears it up.

Study Group for Goodfellow's "Deep Learning" (Pages 110 - 140) (Week 7)

• Read pages 110 - 140 of the Deep Learning book. Here we'll be introduced to one of the most fundamental concepts in ML, the Maximum Likelihood Estimation. A lot of the proofs and justifications of many algos are derived from it. Make sure to go over it thoroughly.
• Leave at least one comment below, e.g. a question about something you don't understand or a link to a brilliant resource. It must somehow relate to pages 110 - 140.
• No bonus activity for this week either 😲. Use your extra time to help answer the questions others leave 🙏, this may go a long way to solidify your understanding of many concepts, too.

As always, if you're feeling lost mention it in the comments and ask for help!

Enjoy 🍻, see you next week.

Yeah this can look a bit weird as it's a hard to read due to the mathematical syntax.
If you can expand this out and then read it out, it makes a lot more sense.

Here x (our unsupervised problem) can be written as a set of supervised problems $\mathbf{x} = { x_{1}, x_{2}, x_{3}, x_{4} }$

So, expanding the probability product rule we get:
$$P(\mathbf{x}) = P(x_{1}, x_{2}, x_{3}, x_{4}) = \prod_{i=1}^{4}p(x_{i}|x_{1},...,x_{i-1}) = p(x_{1})\; p(x_{2}| x_{1})\; p(x_{3}|x_{1}, x_{2})\; p(x_{4}|x_{1},x_{2}, x_{3})$$

Now reading the expanded probability from left to right:

1. Probability of being $x_{1}$,
2. Probability of being $x_{2}\; given\; x_{1}$,
3. Probability of being $x_{3}\; given\; x_{1}\;and \;x_{2}$,
4. Probability of being $x_{4}\; given\; x_{1}\;and \;x_{2}\; and\; x_{3}$,

Putting this in a more concrete example, we can look at this as... maybe creating a cat detector given a picture without actually creating a specific cat detector.

So, expanding the probability product rule we get:
\begin{align} P(\mathbf{cat}) & = P(animal\;detected, paws\;detected, tail\;detected, whiskers\;detected, \;...) \\ & = p(animal\;detected)\; p( paws\;detected\;|\;animal\;detected)\; p(tail\;detected\;|\;animal\;detected, paws\;detected)\; \\ & p(whiskers\;detected\;|\;animal\;detected,paws\;detected,tail\;detected) ... \end{align}

Using probability chain rule we have turned the unsupervised problem of of cat detection into a series of supervised problems which lead to a cat detector.

Since, the upcoming topics delve a bit into Maximum Likelihood Estimation and Maximum A Posteriori Estimation if anyone has some resources for understanding these topics please share.

If you have an unlabelled dataset of cats and dogs and want a model that can classify both of them, well what do you do? Simple, label them and then train a model. But if your unlabeled dataset contains millions of examples then this is where semi-supervised learning comes in handy.
Initially, you label a small number of examples(say 100) and then test how is the accuracy of your model on the rest of the dataset. Using a clever technique called Active Learning you then guide your labelling process to label a few of the examples that the model got wrong, and then a few more in the next iteration till a satisfactory accuracy is reached. This process significantly reduces quantity of labelled data required.

Before I begin the answer one important thing to note is, that the gradient is always perpendicular to the contours of the function.

So, when the conditioning number is high the contours of the objective function are very elliptical(skewed circles) as shown below. The magnitude of gradient in one direction is larger than the other.

Here the blue arrows show the direction of the gradient at arbitrary points perpendicular to the contours of the function. As you can see they are always pointing away from the minima(red point). This is the reason why gradient descent(orange line) keeps bouncing up and down in a to-and-for motion slowing down the rate of convergence. All because the gradient is never pointing directly to the minima.

To remedy this we use normalization techniques on data when using Gradient descent(or any of first order optimization methods) and Batch normalization layers within the neural networks, so that all our data is on the same scale. In a function where all the contours are perfectly circular i.e. very small condition number, the gradient at any arbitrary point will directly point in the direction of the minima. As shown below

Also note that, Newton's method always jumps directly to the minima in any can case(elliptical or circular) because intuitively, Newton's method changes the coordinate system so that everything is perfectly circular. But Newton's method has its own drawbacks that it is susceptible to saddle points and the Hessian is expensive to calculate and store for large datasets.

P.S. sorry of the crappy images, made'em in paint. Hopefully, you get the point.

Really enjoyed this chapter, it was a nice refresher on Lagrange multipliers and introduced me to its generalized form, KKT.

The numerical stability part, early on, piqued my interest to actually sit down and create a numerically stable version of the Softmax activation function from scratch. For anyone else also interested, check out these two blogs:

Oh yeah, that's even better. I completely forgot about that.

Exactly, what I am planning to do. Since Softplus just looks like a smoothed out version of a ReLU, putting one in place of the other seems like a simple job.

Two simple of networks of the form:

1- (Linear->Softplus)->(Linear->Softplus)->(Linear->Softplus)->Linear->Sigmoid(output)

2- (Linear->ReLU)->(Linear->ReLU)->(Linear->ReLU)->Linear->Sigmoid(output)

and a dataset of suppose...cat/non-cat will hopefully be sufficient to run this experiment

Enjoyed a bit of a deep dive into into probability and statistics, as it cleared up the question of where did the cross-entropy function emerge from and what is a logit (turns out just the inverse of the logistic function).

A weird itch i've got after reading this chapter is to check how will my toy networks behave if I supplant the ReLU activations with Softplus activations. Maybe I'll try it out this weekend. If someone has any empirical knowledge on this I would love to hear.

See you next week fam.

The Dirac delta function($\delta$) will be infinitely high when $x = x^{(i)}$
If it is infinitely high at $x = x^{(i)}$ then how come multiplying by $\frac{1}{m}$ will put probability mass of $\frac{1}{m}$ on $m$ points?

I've recently become a huge fan of 3Blue1Brown

A cool thing I learned this week is that the Determinant of matrix is the product of the eigenvalues of the matrix.

I wonder why this info eluded me throughout the countless math classes in high school and college, could have used it.

edit: added a pic instead of Latex code, the matrices were being squashed to a vector

Note that if $A.A^T$ and $A^T.A$ are valid operations then $Tr(A^T.A) = Tr(A.A^T)$.

check out equation#2.51 under the Trace operator section

Yep, you got that right. It is read "for all x and for all y..."

For those who are not sure what the implication of $L_{1}$ (Manhattan distance), $L_{2}$, and $L_{\infty}$ norm would be, later on with respect to the cost function, a good intuition to develop at this stage would be to visualize the unit circles of all three

Note that $\left \| x \right \|_{1} \geq \left \| x \right \|_{2} \geq \left \| x \right \|_{\infty }$

This short video provides further explanation on the subject

Here's my favorite book on Linear Algebra. It single handedly got me through my undergrad Linear Algebra course. Kolman's Elementary Linear Algebra with Applications

Hey great question. A funny story about this is after AlexNet's groundbreaking result in the 2012 ImageNet competition, each successive winning convolution net model consistently kept on decreasing its error rate in the competition. Researchers early on believed that a trained human's accuracy on ImageNet data could be assumed to be 100%, or very close to it, though no actual studies were carried out to prove this. Mind you that the Image net data has 1000 different classes, and as I have recently learnt a big portion of them are various dog breeds (weird I know, LOL).

Around 2014 Andrej Karpathy looked to prove how well a human performs when compared to a trained neural network. Surprisingly, his study found that human error rate was around 5.1% compared to GoogLeNET's 6.8% (which since has passed human accuracy). This study earned Andrej Karpathy the witty moniker "The Reference Human" among AI researchers

Check out his blog What I learned from competing against a ConvNet on ImageNet

This entire introduction section reminds me of Andrew Ng's interview with Geoffrey Hinton. Where, in the later part, he talks about why the early researchers in AI were misguided. The early misconceptions are based off the reasons that "What comes in is a string of words, and what comes out is a string of words. And because of that, strings of words are the obvious way to represent things. So they thought what must be in between was a string of words, or something like a string of words."

Reading the parts about Cyc, where the researchers tediously compiled a knowledge-base for the inference engine and yet it failed, shows the pitfalls in relying upon symbolic expressions to create intelligence. As Hinton puts it "I think what's in between is nothing like a string of words. I think the idea that thoughts must be in some kind of language is as silly as the idea that understanding the layout of a spatial scene must be in pixels, pixels come in."

I've created a bunch of small networks, over the past few months, to test out certain hypothesis and experimental ideas. Once you get a hang of the math and the dimensions of the matrices and vectors you can build small networks pretty quickly.
My advice would be to first create a single layer network to mimic simple logical gates such as AND/OR. Then two layer networks to create a slightly more complicated XOR gate.

On a side note, I remember reading the Batch Normalization paper where the authors first tried out their idea on a simple 3-layer neural network.

Exactly what I was looking for! I am already on the 5th chapter and was looking for people to discuss the different topics and help me better understand some topics. Count me in.