Black Swans


21 points

hsm hasn't added a bio yet


i'm in.

This is from page 136:

enter image description here

I have a question about the last sentence in that paragraph:

How does providing a covariance matrix show how likely all the different values of $\mathbb{w}$ are? I'm confused because as far as I know, there is no way to interpret covariance as a probability.

On page 114, it is mentioned that:

Training and generalization error vary as the size of the training set varies.
Expected generalization error can never increase as the number of training examples

Why does the expected generalization error can never increase as the number of training examples increases?

How does semi-supervised learning work? According to the book, in a semi-supervised learning setting, some examples include a supervision target but others do not. Since the loss function will exclude the examples without the target, what is the point of including them in the training set?

enter image description here

What are the $n$ supervised learning problems that we can derive from equation 5.1 that will allow us to solve the unsupervised problem of modelling $p(\mathbb{x})$?

Thank you Rafayk for your very lucid explanation!

enter image description here

What is the interpretation of equation 4.15?

Do the order of the min and max terms matter i.e. is it saying we need to find an $\alpha$ that maximizes $L(x, \lambda, \alpha)$ first, followed by $\lambda$ and then finally find an $x$ that minimizes it?

enter image description here

I don't understand the point about gradient descent is unaware of the change in the derivative, so it does not know that it needs to explore preferentially in the direction where the derivative remains negative for longer.

I thought gradient descent will always pick the direction opposite to where the gradient increases the most. So how does having a poor condition number causes gradient descent to perform poorly?

enter image description here

I'm not sure how (4.10) simplifies to $\frac{1}{\lambda_{max}}$.

My understanding is that $g^THg$ simplifies to $\lambda_{max}$ because the expression is the directional second derivative in the direction of $g$ and we are told $g$ is the eigenvector corresponding to the maximal eigenvalue of $H$ (page 86).

But I don't know how $g^tg$ simplifies to 1. What are the properties being used to make this simplification?

This is from page 64:

enter image description here

I don't understand how this formula works because $\boldsymbol{x}-\boldsymbol{x}^{(i)}$ to me means a vector minus a scalar, which isn't a defined operation.

Is there a typo in this equation?

This is from page 70:

enter image description here

I don't understand how to get to equation 3.45 from 3.44 and to 3.46 from 3.45.

Can someone please show the intermediate steps?

I feel lost when reading section 3.9.2 Multinouli Distribution on page 60.

enter image description here

The questions I have are:

  1. What does $\boldsymbol{p}\in[0,1]^{k-1}$ mean?

    From context, I think it means the vector p is a k-1 dimensional vector where each element is a real number between 0 and 1 (inclusive). But I don't think the meaning of this notation has been described in the book.

  2. What is does the $\boldsymbol{1}^{T}$ in $\boldsymbol{1}^{T}\boldsymbol{p}$ mean?

    Is $\boldsymbol{1}^{T}$ a k-1 dimensional vector where each element is the number $1$?

  3. I don't understand how multinouli distributions are used to refer to distributions over categories of objects and why we don't usually assume that state 1 has numerical value 1 , and so on.

    What are some examples that illustrate the concept that the above sentence is trying to convey?

  4. I don't understand why we don't usually need to compute the expectation or variance of multinouli-distributed variables.

    I don't see how this conclusion follows from the reason given.

Thanks in advance for your help.

I'm not sure how to read equation 3.7 on page 58:

enter image description here

Only $x$ has the universal quantifier. Is it implied that $y$ also has the universal quantifier which means we should read it as 'For all $x$ and for all $y$...'?

This is from page 49:

enter image description here

Does anyone know any online resources that walks through this proof?

I don't understand equation 2.74 on page 49:

enter image description here

This is equation 2.49 (page 44):

enter image description here

To me, it looks like equation 2.74 is a wrong application of 2.49 because it says $Tr(A^T A)$ instead of $Tr(AA^T)$ where $A = X-Xdd^T$.

Please help me understand the steps to get to equation 2.74 from equation 2.73.

The following is from page 39:

enter image description here

I don't understand the part where it says "If both vectors have nonzero norm, this means that they are at a 90 degree angle to each other" because if I have the following vectors:

  • a = [1, 0]
  • b = [1, 1]

then both vector a and vector b have nonzero norm but they are not at a 90 degree angle to each other. The vector a and b are at a 45 degree angle to each to each other.

Did I misunderstood the text?

Who recommended the Radeon Pro 580 for deep learning work?

Here's a 2017 version of the Computer Vision intro from Stanford.

The book was published in 2016. It is now 2018.

Are people still just as excited about deep learning or are there alternative machine algorithms that can achieve better results?

For example, Microsoft recently open-sourced, and tensorflow has tensorflow probability which to me suggests that graphical models are starting to gain traction in industry...but I have not come across any state-of-the-art results achieved using these kind of models yet.