RafayAK

45 points

joined

RafayAK hasn't added a bio yet

History

History

Study Group for Goodfellow's "Deep Learning" (Chapter-8 ) (Week 11)

Here is your task for this week. Keep on the grind, we'll be done in no time...if you're here...hello...hellooooooo...anybody out there?π

This week we'll be going over Optimization for Deep Learning. Another important aspect of modern-day AI research. Many cool concepts are explained here, but I'll recommend first reading Karpathy's cs231n lecture note#3 and, maybe, also watching lecture 6 and lecture 7. If you have even a cursory knowledge of the concepts presented this week, you can dive straight into the book.

- Read pages 274 - 329 of the Deep Learning book.
- Leave at least one comment below, e.g. a question about something you donβt understand or a link to a brilliant resource. It must somehow relate to pages 228-273.
- Use your extra time to help answer the questions others leave π, this may go a long way to improve your understanding of many concepts, too.
- As always, if youβre feeling lost mention it in the comments and ask for help!

Enjoy π», see you next week.

So another week went by, sadly this time no one showed up π₯.

Anyways, regularization is a topic that intrigues me the most.

The small paragraph on setting regularization for each layer, separately, piqued my curiosity. Some quick research shows this idea first being implemented by Snoek et al. (Practical Bayesian Optimization of Machine Learning Algorithms-2012) where the researchers optimized separate regularization parameters for each layer in a neural network and found that it improved performance. Some ideas are brewing in my head, I think Im'a work on this, anyone up for some research? Not promising anything but I think I got something.

As with the previous few chapters, I feel like understanding the content here is much easier if you look it up first and then come back to the book, Goodfellow's explanation are very dense and most of the time go over my head βΉοΈ.

To the next one!π

Study Group for Goodfellow's "Deep Learning" (Chapter-7 ) (Week 10)

Great seeing some new people joining in and helping each other out.π

This week we'll be going over Regularization, one of the most important topics in Deep Learning also happens to be one of my favorites.

- Read pages 228 - 273 of the Deep Learning book.
- Leave at least one comment below, e.g. a question about something you donβt understand or a link to a brilliant resource. It must somehow relate to pages 228-273.
- Use your extra time to help answer the questions others leave π, this may go a long way to improve your understanding of many concepts, too.

As always, if youβre feeling lost mention it in the comments and ask for help!

Enjoy π», see you next week.

So, another week has gone by and down goes another chapter. This chapter, in particular, has been one of the harder ones to follow, in many places the author for the sake of brevity sacrificed on comprehensibility.

The highlights of this chapter for me have been (in no particular order) :

- The benefit 'log' function provides to the numerical stability of our cost and activation functions and preventing saturation of gradients
- Why cross-entropy trumps mean-squared error loss
- The small para on Absolute Value Rectification(AVR) particularly piqued my interest. AVR helps in object detection in images even when the illumination polarity is reversed. Could an activation function be made that it is rotation-invariant? Something to research, I guess
- In one of our previous study weeks, I asked how supplanting ReLU with Softplus unit will affect my toy networks, and in this chapter, Goodfellow gave the answer. Turns out, like many things in Deep Learning, the answer is derived from empirical knowledge, rather than theoretical, and in experiments, ReLU seems to perform better.

(*From my basic research on this topic, I have found some theoretical justification; the derivative of the Softplus unit is $\frac{1}{1+e^{-x}}$, which is the Sigmoid activation, its range is bounded by $[-1,1]$ so, the max derivative that can pass through is bounded. Whereas, the ReLU uint's derivative in the positive linear regime in unbounded and bounded to zero in the negative regime, so, the range of its derivative is $[0,\infty)$. In conclusion, a much greater magnitude of the gradient can backprop thorough a ReLU unit helping the network to learn more in a shorter period compared to Softplus units.*) - The discussion on deep vs wider network also helped clear up some particular architectural choices.

On to the next one.π

Sure, thing!

There are two types of dimensionality reduction algorithms: linear and non-linear.

Linear dimensionality reduction algorithms, like PCA, project high-dimension data to a lower-dimension using orthogonal projections. *(Whipped the below image in ms-paint, sorry for the crappy quality)*

The drawback of linear dimensionality reduction algorithms is that they can only help you remove features that are linearly correlated. Thus, work best when your data lies in a straight(linear) fashion(as the red dots above)

Non-linear learning dimensionality reduction techniques, take this process one step further by generalizing PCA to 'n' arbitrary dimensions. Simply, put these algorithms can find linear and non-linear correlations in data, helping you find much better representation in a lower-dimensional subspace.

For example, in the image below where data is placed in an *'S'* shape, PCA(blue line) simply projects the data on to a 1-D straight line, clearly poor representation of the original data. On the other hand, SOM(Self-organizing map) a manifold learning technique finds the true representation in 1-D

I hope this clears it up. Cheers

I didn't quite get your second question.

*"Did anyone find any intuitive explanation for Relu activation function?"*

In terms of what? How to apply it? Are you looking for a theoretical justification?

Hey James92, yeah this derivation is a bit hand-wavy. I'll derive it in more detail.

Update: replaced the latex code with a pic, vector $w$ is $w = \begin{bmatrix} 1\

-2\end{bmatrix} not \begin{bmatrix} 1\

2\end{bmatrix}$, claculation part is correct though. Latex, rendering here is a bit wonky

Study Group for Goodfellow's "Deep Learning" (Pages 168 - 227 ) (Week 9)

LOL. No one showed up to the last study week βΉοΈ Anyways I'ma continue on. We're finally getting on to the meat and bones of the subject.

From now on we'll finishing entire chapters each week.

- Read pages 168 - 227 of the Deep Learning book. Here Nuances in the architectural design of simple feed-forward NNs are discussed
- Leave at least one comment below, e.g. a question about something you donβt understand or a link to a brilliant resource. It must somehow relate to pages 168 - 227.
- Use your extra time to help answer the questions others leave π, this may go a long way to improve your understanding of many concepts, too.

As always, if youβre feeling lost mention it in the comments and ask for help!

Enjoy π», see you next week.

End of the week π.

The coolest concept of this week has to be Manifold Learning. Just the concept of a learning algorithm that can sift through data in a multi-dimensional space to figure out a low-dimensional representation is very intriguing.

Since Manifold Learning is presented as a dimensionality reduction algorithm in the same vein as PCA, but not hamstrung to linear subspaces like PCA, I can personally see how useful it could have been in my past projects.

Study Group for Goodfellow's "Deep Learning" (Pages 140 - 165) (Week 8)

Hey Fam, I was out for a while, sad to see no one picked up the slack. We had a good thing going on. Anyways, I have moved on quite a bit since the last "study week", but a dedicated study group, however small, did wonders for not only motivating me but also clearing the fog when trying to learn new concepts or explain them.

For the sake of continuity, this study week will start from where we left off.

- Read pages 140 - 165 of the Deep Learning book. Here weβll be introduced to basic Supervised and Unsupervised Learning algos and, most importantly the workhorse of the modern deep learning advances, the Stochastic Gradient Descent algo.

Make sure to go over it thoroughly. - Leave at least one comment below, e.g. a question about something you don't understand or a link to a brilliant resource. It must somehow relate to pages 140 - 165.
- Go over Andrew Trask's brilliant A Neural Network in 11 lines of Python (Part 1) and

A Neural Network in 13 lines of Python (Part 2 - Gradient Descent) posts to solidify your concepts for this week and help prepare you for the next

Use your extra time to help answer the questions others leave π, this may go a long way to improve your understanding of many concepts, too.

As always, if you're feeling lost mention it in the comments and ask for help!

Enjoy π», see you next week.

replied to
Merry Christmas π

Merry Christmas to you, too Jack. Interviews seem like a good idea for increasing traffic to the website. Eagerly, waiting for the Ian Goodfellow interview, that'll be a big one for the site.

As for the bugs, there is lingering Latex rendering issue.

On a side note, what do you plan on doing with the current points system?

Hey fam, if anyone is finding it difficult to understand Probability and Bayesian statistics parts of this week's reading be sure to check out the following resources:

- A Short Introduction to Entropy, Cross-Entropy and KL-Divergence
- How Bayes Theorem works
- Where did the least-square come from?(Detailed derivation of mean squared error cost function using MLE, similar to the one in the book)

If you know any resources that helped you out this week please share below π»

This excerpt for the wiki on Estimation of Covariance Matrices, may help you understand this better:

Statistical analyses of multivariate data often involve exploratory

studies of the way in which the variables change in relation to one

another and this may be followed up by explicit statistical models

involving the covariance matrix of the variables. Thus the estimation

of covariance matrices directly from observational data plays two

roles:

- to provide initial estimates that can be used to study the inter-relationships;
- to provide sample estimates that can be used for model checking.

You can read more here

Hey hsm, sorry for the late reply, been a bit too busy this past week.

Anyway what the author is trying to convey in the 'entire' passage is that for all the machine learning tasks that map from x(input)--->y(label/output) there is a theoretical limit/ceiling to their accuracy, called the Bayes optimal error(or simply Bayes Error).

Any ML task can be seen as follows (the figure referenced in the book passage is looking at the error, instead of accuracy.):

Initially, with more and more training data the accuracy of your ML system increases(generalization error reduces) till it reaches human level performance after which significant gains are much harder to achieve, regardless of how much more data you throw at the problem. Note that, Human level performance is, in many AI tasks such as computer vision and speech synthesis, pretty close to the theoretical Bayes optimal error.

Hope this clears it up.

Study Group for Goodfellow's "Deep Learning" (Pages 110 - 140) (Week 7)

Hey Fam, here are your tasks for the week commencing 12/12/2018:

- Read pages 110 - 140 of the Deep Learning book. Here we'll be introduced to one of the most fundamental concepts in ML, the Maximum Likelihood Estimation. A lot of the proofs and justifications of many algos are derived from it. Make sure to go over it thoroughly.
- Leave at least one comment below, e.g. a question about something you don't understand or a link to a brilliant resource. It must somehow relate to pages 110 - 140.
- No bonus activity for this week either π².
*Use your extra time to help answer the questions others leave*π, this may go a long way to solidify your understanding of many concepts, too.

As always, if you're feeling lost mention it in the comments and ask for help!

Enjoy π», see you next week.

Yeah this can look a bit weird as it's a hard to read due to the mathematical syntax.

If you can expand this out and then read it out, it makes a lot more sense.

Here **x** (our unsupervised problem) can be written as a set of supervised problems $\mathbf{x} = { x_{1}, x_{2}, x_{3}, x_{4} }$

So, expanding the probability product rule we get:

$$

P(\mathbf{x}) = P(x_{1}, x_{2}, x_{3}, x_{4}) = \prod_{i=1}^{4}p(x_{i}|x_{1},...,x_{i-1})

= p(x_{1})\; p(x_{2}| x_{1})\; p(x_{3}|x_{1}, x_{2})\; p(x_{4}|x_{1},x_{2}, x_{3})

$$

Now reading the expanded probability from left to right:

- Probability of being $x_{1}$,
- Probability of being $x_{2}\; given\; x_{1}$,
- Probability of being $x_{3}\; given\; x_{1}\;and \;x_{2} $,
- Probability of being $x_{4}\; given\; x_{1}\;and \;x_{2}\; and\; x_{3} $,

Putting this in a more concrete example, we can look at this as... maybe creating a cat detector given a picture without actually creating a specific cat detector.

So, expanding the probability product rule we get:

$$

\begin{align}

P(\mathbf{cat}) & = P(animal\;detected, paws\;detected, tail\;detected, whiskers\;detected, \;...) \\

& = p(animal\;detected)\; p( paws\;detected\;|\;animal\;detected)\; p(tail\;detected\;|\;animal\;detected, paws\;detected)\; \\

& p(whiskers\;detected\;|\;animal\;detected,paws\;detected,tail\;detected)

...

\end{align}

$$

Using probability chain rule we have turned the unsupervised problem of of cat detection into a series of supervised problems which lead to a cat detector.

Since, the upcoming topics delve a bit into Maximum Likelihood Estimation and Maximum A Posteriori Estimation if anyone has some resources for understanding these topics please share.

If you have an unlabelled dataset of cats and dogs and want a model that can classify both of them, well what do you do? Simple, label them and then train a model. But if your unlabeled dataset contains millions of examples then this is where semi-supervised learning comes in handy.

Initially, you label a small number of examples(say 100) and then test how is the accuracy of your model on the rest of the dataset. Using a clever technique called **Active Learning** you then guide your labelling process to label a few of the examples that the model got wrong, and then a few more in the next iteration till a satisfactory accuracy is reached. This process significantly reduces quantity of labelled data required.

Before I begin the answer one important thing to note is, that the *gradient is always perpendicular to the contours of the function.*

So, when the conditioning number is high the contours of the objective function are very elliptical(skewed circles) as shown below. The magnitude of gradient in one direction is larger than the other.

Here the blue arrows show the direction of the gradient at arbitrary points perpendicular to the contours of the function. As you can see they are always pointing away from the minima(red point). This is the reason why gradient descent(orange line) keeps bouncing up and down in a to-and-fro motion slowing down the rate of convergence. All because the gradient is never pointing directly to the minima.

To remedy this we use normalization techniques on data when using Gradient descent(or any of first order optimization methods) and Batch normalization layers within the neural networks, so that all our data is on the same scale. In a function where all the contours are perfectly circular i.e. very small condition number, the gradient at any arbitrary point will directly point in the direction of the minima. As shown below

Also note that, Newton's method always jumps directly to the minima in any can case(elliptical or circular) because intuitively, Newton's method changes the coordinate system so that everything is perfectly circular. But Newton's method has its own drawbacks that it is susceptible to saddle points and the Hessian is expensive to calculate and store for large datasets.

*P.S. sorry of the crappy images, made'em in paint. Hopefully, you get the point.*

Really enjoyed this chapter, it was a nice refresher on Lagrange multipliers and introduced me to its generalized form, KKT.

The numerical stability part, early on, piqued my interest to actually sit down and create a numerically stable version of the Softmax activation function from scratch. For anyone else also interested, check out these two blogs:

http://saitcelebi.com/tut/output/part2.html

https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/

Oh yeah, that's even better. I completely forgot about that.

Exactly, what I am planning to do. Since Softplus just looks like a smoothed out version of a ReLU, putting one in place of the other seems like a simple job.

Two simple of networks of the form:

1- (Linear->Softplus)->(Linear->Softplus)->(Linear->Softplus)->Linear->Sigmoid(output)

2- (Linear->ReLU)->(Linear->ReLU)->(Linear->ReLU)->Linear->Sigmoid(output)

and a dataset of suppose...cat/non-cat will hopefully be sufficient to run this experiment

Enjoyed a bit of a deep dive into into probability and statistics, as it cleared up the question of where did the cross-entropy function emerge from and what is a logit (turns out just the inverse of the logistic function).

A weird itch i've got after reading this chapter is to check how will my toy networks behave if I supplant the ReLU activations with Softplus activations. Maybe I'll try it out this weekend. If someone has any empirical knowledge on this I would love to hear.

See you next week fam.

A bit confused about this this part.

The Dirac delta function($\delta $) will be ** infinitely high** when $x = x^{(i)}$

If it is infinitely high at $x = x^{(i)}$ then how come multiplying by $\frac{1}{m}$ will put probability mass of $\frac{1}{m}$ on $m$ points?

I've recently become a huge fan of 3Blue1Brown

A cool thing I learned this week is that the *Determinant of matrix is the product of the eigenvalues of the matrix*.

I wonder why this info eluded me throughout the countless math classes in high school and college, could have used it.

*edit: added a pic instead of Latex code, the matrices were being squashed to a vector*

Note that if $A.A^T$ and $A^T.A$ are valid operations then $Tr(A^T.A) = Tr(A.A^T)$.

check out equation#2.51 under the Trace operator section

Yep, you got that right. It is read "for all x and for all y..."

For those who are not sure what the implication of $L_{1}$ (Manhattan distance), $L_{2}$, and $L_{\infty}$ norm would be, later on with respect to the cost function, a good intuition to develop at this stage would be to visualize the unit circles of all three

Note that $\left \| x \right \|_{1} \geq \left \| x \right \|_{2} \geq \left \| x \right \|_{\infty }$

**This short video provides further explanation on the subject**

Here's my favorite book on Linear Algebra. It single handedly got me through my undergrad Linear Algebra course. Kolman's Elementary Linear Algebra with Applications

https://drive.google.com/open?id=0B2mB_TmBtVqFMk9EVVBvcm1tRTA

Hey great question. A funny story about this is after AlexNet's groundbreaking result in the 2012 ImageNet competition, each successive winning convolution net model consistently kept on decreasing its error rate in the competition. Researchers early on believed that a trained human's accuracy on ImageNet data could be assumed to be 100%, or very close to it, though no actual studies were carried out to prove this. Mind you that the Image net data has 1000 different classes, and as I have recently learnt a big portion of them are various dog breeds (weird I know, LOL).

Around 2014 Andrej Karpathy looked to prove how well a human performs when compared to a trained neural network. Surprisingly, his study found that human error rate was around 5.1% compared to GoogLeNET's 6.8% (which since has passed human accuracy). This study earned Andrej Karpathy the witty moniker "The Reference Human" among AI researchers

Check out his blog What I learned from competing against a ConvNet on ImageNet

This entire introduction section reminds me of Andrew Ng's interview with Geoffrey Hinton. Where, in the later part, he talks about why the early researchers in AI were misguided. The early misconceptions are based off the reasons that "*What comes in is a string of words, and what comes out is a string of words. And because of that, strings of words are the obvious way to represent things. So they thought what must be in between was a string of words, or something like a string of words."*

Reading the parts about Cyc, where the researchers tediously compiled a knowledge-base for the inference engine and yet it failed, shows the pitfalls in relying upon symbolic expressions to create intelligence. As Hinton puts it *"I think what's in between is nothing like a string of words. I think the idea that thoughts must be in some kind of language is as silly as the idea that understanding the layout of a spatial scene must be in pixels, pixels come in."*

I've created a bunch of small networks, over the past few months, to test out certain hypothesis and experimental ideas. Once you get a hang of the math and the dimensions of the matrices and vectors you can build small networks pretty quickly.

My advice would be to first create a single layer network to mimic simple logical gates such as AND/OR. Then two layer networks to create a slightly more complicated XOR gate.

On a side note, I remember reading the Batch Normalization paper where the authors first tried out their idea on a simple 3-layer neural network.

replied to
Study Group for Goodfellow's "Deep Learning"

Exactly what I was looking for! I am already on the 5th chapter and was looking for people to discuss the different topics and help me better understand some topics. Count me in.