Black Swans

hsm

20 points

joined
hsm hasn't added a bio yet
History

History

i'm in.

This is from page 136:

I have a question about the last sentence in that paragraph:

How does providing a covariance matrix show how likely all the different values of $\mathbb{w}$ are? I'm confused because as far as I know, there is no way to interpret covariance as a probability.

On page 114, it is mentioned that:

Training and generalization error vary as the size of the training set varies.
Expected generalization error can never increase as the number of training examples
increases.

Why does the expected generalization error can never increase as the number of training examples increases?

How does semi-supervised learning work? According to the book, in a semi-supervised learning setting, some examples include a supervision target but others do not. Since the loss function will exclude the examples without the target, what is the point of including them in the training set?

What are the $n$ supervised learning problems that we can derive from equation 5.1 that will allow us to solve the unsupervised problem of modelling $p(\mathbb{x})$?

Thank you Rafayk for your very lucid explanation!

What is the interpretation of equation 4.15?

Do the order of the min and max terms matter i.e. is it saying we need to find an $\alpha$ that maximizes $L(x, \lambda, \alpha)$ first, followed by $\lambda$ and then finally find an $x$ that minimizes it?

I don't understand the point about gradient descent is unaware of the change in the derivative, so it does not know that it needs to explore preferentially in the direction where the derivative remains negative for longer.

I thought gradient descent will always pick the direction opposite to where the gradient increases the most. So how does having a poor condition number causes gradient descent to perform poorly?

I'm not sure how (4.10) simplifies to $\frac{1}{\lambda_{max}}$.

My understanding is that $g^THg$ simplifies to $\lambda_{max}$ because the expression is the directional second derivative in the direction of $g$ and we are told $g$ is the eigenvector corresponding to the maximal eigenvalue of $H$ (page 86).

But I don't know how $g^tg$ simplifies to 1. What are the properties being used to make this simplification?

This is from page 64:

I don't understand how this formula works because $\boldsymbol{x}-\boldsymbol{x}^{(i)}$ to me means a vector minus a scalar, which isn't a defined operation.

Is there a typo in this equation?

This is from page 70:

I don't understand how to get to equation 3.45 from 3.44 and to 3.46 from 3.45.

Can someone please show the intermediate steps?

I feel lost when reading section 3.9.2 Multinouli Distribution on page 60.

The questions I have are:

1. What does $\boldsymbol{p}\in[0,1]^{k-1}$ mean?

From context, I think it means the vector p is a k-1 dimensional vector where each element is a real number between 0 and 1 (inclusive). But I don't think the meaning of this notation has been described in the book.

2. What is does the $\boldsymbol{1}^{T}$ in $\boldsymbol{1}^{T}\boldsymbol{p}$ mean?

Is $\boldsymbol{1}^{T}$ a k-1 dimensional vector where each element is the number $1$?

3. I don't understand how multinouli distributions are used to refer to distributions over categories of objects and why we don't usually assume that state 1 has numerical value 1 , and so on.

What are some examples that illustrate the concept that the above sentence is trying to convey?

4. I don't understand why we don't usually need to compute the expectation or variance of multinouli-distributed variables.

I don't see how this conclusion follows from the reason given.

I'm not sure how to read equation 3.7 on page 58:

Only $x$ has the universal quantifier. Is it implied that $y$ also has the universal quantifier which means we should read it as 'For all $x$ and for all $y$...'?

This is from page 49:

Does anyone know any online resources that walks through this proof?

I don't understand equation 2.74 on page 49:

This is equation 2.49 (page 44):

To me, it looks like equation 2.74 is a wrong application of 2.49 because it says $Tr(A^T A)$ instead of $Tr(AA^T)$ where $A = X-Xdd^T$.

The following is from page 39:

I don't understand the part where it says "If both vectors have nonzero norm, this means that they are at a 90 degree angle to each other" because if I have the following vectors:

• a = [1, 0]
• b = [1, 1]

then both vector a and vector b have nonzero norm but they are not at a 90 degree angle to each other. The vector a and b are at a 45 degree angle to each to each other.

Did I misunderstood the text?

Who recommended the Radeon Pro 580 for deep learning work?

interested

Here's a 2017 version of the Computer Vision intro from Stanford.

The book was published in 2016. It is now 2018.

Are people still just as excited about deep learning or are there alternative machine algorithms that can achieve better results?

For example, Microsoft recently open-sourced infer.net, and tensorflow has tensorflow probability which to me suggests that graphical models are starting to gain traction in industry...but I have not come across any state-of-the-art results achieved using these kind of models yet.

count me in too!