Black Swans

mossbanay

33 points

joined
Interested in reinforcement learning, Bayesian approaches and quantitative finance.
History

History

As you say $g$ here is a gradient but I don't think it is necessarily unit-norm (i.e. that $g^Tg=1$) but rather that there is a constant factor that cancels out in the numerator and the denominator. However I'm not sure of this myself either.

Study Group for Goodfellow's "Deep Learning" (Pages 98 - 110) (Week 6)

• Read pages 98 - 110 of Deep Learning. Here we'll be beginning to introduce the broader classes of machine learning problems and look closer at linear regression.
• Leave at least one comment below, e.g., a question about something you don't understand or a link to a resource that you found helpful relating to this weeks content.
• There is no bonus activity for this week. Use any extra time to go back over last weeks content since it's been the densest section yet and there's still some room for discussion.

As always, if you're feeling lost mention it in the comments and ask for help!

Enjoy 🍻
MB

Why don't you try using the Tensorflow tutorial from Week 2 as a baseline and go from there? Make sure to share your results in a new post when you finish! You can also use Google Collaboratory that way too.

I think this would be interesting too. You could try and create two networks with different activation functions and look at how well they perform after a fixed number of epochs. There are some problems with logistic activation functions that slow down learning in some cases (the gradient for large positive and negative numbers approaches 0 which impedes the rate at which weights are updated), but it's always good to bridge the gap from theory and practice.

Study Group for Goodfellow's "Deep Learning" (Pages 80 - 97) (Week 5)

Hey all. Since it seems like there's been a few people absent in the last week, we'll allow an additional week for people to get the reading finished in. Here are your tasks for the week commencing 19/11/2018:

• Read pages 80 - 97 of Deep Learning. This finishes the chapter on numerical computation 🎉
• Leave at least one comment below, e.g., a question about something you don't understand or a link to a resource that you found helpful relating to this weeks content.
• For the bonus activity this week, watch this lecture by Andrew Ng to try to get a better understanding of gradient descent, which you'll find is a crucial concept for neural networks.

As always, if you're feeling lost mention it in the comments and ask for help!

Enjoy 🍻
MB

1. That's the correct interpretation.
2. Also correct
3. Have a look into Markov chains, they might help you understand why the author describes this as a distribution over states.
4. See 3

See the notation section at the very start of the book that outlines things like $[0,1]$ and $\pmb{1}$.

Here $\pmb{x}$ and $\pmb{x}^{(i)}$ are vectors since the argument is generalising to multivariate distributions, so the operation is valid.

For anyone looking into more about the links between measure theory and probability theory I found this online text quite useful. It also goes into depth across other topics like common distributions and conditional probability. Highly recommended!

Using the Dirac delta function to define an empirical pdf was fascinating. I think this is the first time I've the function used and it's really neat and quite an intuitive use case.

Finally the discussion of Shannon entropy and KL divergence was great. I'd be curious to learn more about what bits of entropy or nats represent in the context of machine learning. From my understanding, they can be used to look at the efficiency of learning algorithms (i.e. how many samples are required to achieve a benchmark) and most likely in measuring the uncertainty of estimates also. If anyone knows of a good information theory textbook I'd greatly appreciate it.

To all those doing exams this week, good luck! Hopefully you'll have time to join back next week.

Study Group for Goodfellow's "Deep Learning" (Pages 60 - 79) (Week 4)

• Read pages 60 - 79 of Deep Learning. This finishes the chapter on probability and information theory 🎉
• Leave at least one comment below, e.g., a question about something you don't understand or a link to a resource that you found helpful relating to this weeks content.
• For bonus activity this week, have a go at working through this PyMC3 tutorial. It'll give you a bit more insight into some of the Bayesian approaches that are touched on by Goodfellow and graphical models.

As always, if you're feeling lost mention it in the comments and ask for help!

Enjoy 🍻

The best probability course I've seen online is most likely Statistics 110 from Harvard taught by Joe Blitzstein (link here). It's likely that the first few lectures will be more than enough to bring you up to speed on basic manipulation of random variables and common distributions that are later discussed in the book.

Does anyone have some good resources for matrix calculus?

Also as a suggestion, perhaps we should read to the end of this chapter next week so we can discuss it fully in next weeks post (instead of having the topic split up).

There's an identity (2.53) that shows that you can swap the order of matrices inside the trace operator.

Just worked through the tutorial.

I think as someone in the Week 2 post mentioned the hello world dataset has changed from MNIST to another 28x28 image dataset 😋. I wanted to try out Google Collaboratory so I didn't have to get all the dependencies installed for Tensorflow and it worked like a charm.

The syntax for building layers seems to have improved a lot of the last few years. Instead of having to wrangle your data into the right shape, using tf.layers.Flatten is so much simpler. If anyone else was curious you can find a listing of all the standard layers in Keras here.

Definitely want to take the tutorial and play around with some other simple image datasets to see how well it works with Google Collaboratory.

You can also try the video lectures from him class, on MIT OCW

There's a link to MIT 18.06 above in my reply, highly recommended!

For when we get to it later, 3Blue1Brown also has a calculus course and a few short lectures on neural networks.

Something new I learnt in this section was about the idea of max norms. I've seen $\ell_1$ and $\ell_2$ crop up in many loss functions previously but the explanation of max norm really clarified what was meant by it. I'd be curious if there's any usage of other higher (but finite) order norms or if there's little benefit over $\ell_1$ and $\ell_2$.

Since someones already linked to the computational LA course, I'll share Gilbert Strang's Linear Algebra lectures from MIT 18.06 here. They cover everything from elementary vectors, Gram-Schmidt, PCA, SVD, applications with Markov chains and ODEs and much more. Strang is a an excellent lecturer and has some of the clearest explanations I've heard for these topics. It's well worth watching if you want to learn more depth for this area of the content.

See you all next week 📒

### Alice

Let $x = \mathbb{E}[HT]$ be the expected number of flips to get HT, $\mathbb{E}[HT|H]$ is the expected number of flips to get HT given the last flip was H and similarly $\mathbb{E}[HT|T]$ is the expected number of flips to get HT given the last flip was T.

Assuming a fair coin we have

$$x = \mathbb{E}[HT] = 0.5 \times (1 + \mathbb{E}[HT|H]) + 0.5 \times (1 + E[HT|T])$$

If we flip a tail then this is comparable to start from scratch so $\mathbb{E}[HT|T] = x$.

If we flip a head then we either flip a T and we are done or we flip a head again. Therefore $\mathbb{E}[HT|H] = 0.5 \times (1) + 0.5 \times (1 + \mathbb{E}[HT|H])$. Solving this we find $\mathbb{E}[HT|H] = 2$.

Plugging these results back into the original equation we get

$$x = \mathbb{E}[HT] = 0.5 \times (1 + 2) + 0.5 \times (1 + x)$$ $$x = 1.5 + 0.5 + 0.5 \times x$$ $$x = 4$$

Therefore on average Alice makes four tosses.

Probabilistic programming does seem to have picked up some steam in the last few years (pymc3, STAN, pyro). I hadn't heard of Tensorflow probability before, I may look into it thanks!

I think that neural networks scale better at the moment compared to some of those approaches, although they aren't mutually exclusive! (see this post by Thomas Wiecki)

As a few others have mentioned I found it very interesting to get a feel for how the field as a whole has grown and developed over a surprisingly long history. I'd heard about the pique in interest of simple neural networks in the 90s but I'd never heard about the cybernetics movement.

Something else I found thought-provoking was the different ideas for defining the depth of a network. When I began reading that section I expected it to talk a little bit about the number of layers in a network and how that may correspond to the number of abstractions or representations the model learns but instead even thinking about what constitutes an operation (especially since modern instruction sets can parallelize some operations).

I'll share one of the papers that came out of DeepMind earlier this year on a possible extension to the set of functions that neural networks can compute. It's mentioned in this chapter that neural networks suffered early critiques for failing to learn XOR, and another simple task that they failed to learn was simple addition. You could teach models to add numbers in a given range but when tested on numbers outside the domain of the training data neural nets failed miserably. Neural Arithmetic Logic Units are one approach to solve this.

Sounds cool, is there a slack or similar we can setup?