roboticminstrel

10 points

joined

roboticminstrel hasn't added a bio yet

History

History

You'd have to mentioned the exact section if you want a specific answer, but risk is probably referring to the expected loss (E[L]). You've got your loss function defined in a particular way, but obviously each random draw from some stochastic process could be all over the place. If you've got a Gaussian process, and you're using a MSE loss function: $(y - \hat y)^2$ in this case, you would minimize the equation by predicting that y will be the mean... since on average you'll be right. But sometimes you might end up with a really unlikely value with a z index of 3 or 4 or something, so your loss will be really high on that particular draw. It was just bad luck though. On average over many draws, if you predict the mean, you'll on average have a lower loss than you would if you were predicting anything else (in this particular case). So the 'risk' is convenient to use, since you're basically taking luck out of the equation, and talking about the expected loss (the risk) instead.

Hey man, not sure if you're still worried about it... but you're going to get ALL the time you need working through and deeply understanding that concept in 2.3. In the meantime though... you know in the normal Gaussian univariate equation, you have:

$$(2\pi\sigma^2)^{-1/2}e^{-x^2/2\sigma^2}$$

so that piece out front with the 2 pi sigma squared... what is that? It's just the normalization term is all. It makes sure that the cumulative distribution adds up to 1 for the whole thing. You can look at it like... as sigma goes up, the spread gets 'fatter', so you need to squish the whole thing down a bit to keep it all equal to the same area.

Well, what about in a multivariate case? If the covariance matrix is equal to the identity matrix, your distribution is basically spherically symmetrical around the mean. But what happens if you start squishing and warping things? You spread things out, so in the same way... you need to account for the extra volume by shrinking the normalization factor to keep the total equal to 1.

Think about your linear algebra... what's the determinant of the matrix? It's the product of all the eigenvalues. You're stretching the total volume by some amount across all the eigen vector axis, so you need to shrink things by a reciprocal amount. Covariance determinant, there you go. You can safely shelve this question until section 2.3 though, it won't be relevant in chapter 1. That's why Bishop didn't clearly explain it... it's not relevant yet. As a head's up, there's a section in curve fitting revisited looking at a full bayesian treatment with some pretty gnarly looking equations. Those are also not intended to be used or fully understood yet, so you can also safely gloss over those equations as well, it's more a sneak peak of what's coming than a rigorous treatment of a new concept.

Also, fuck reddit's math formatting. I love that we can actually use latex here.

Hey man, not sure if you're still worried about it... but you're going to get ALL the time you need working through and deeply understanding that concept in 2.3. In the meantime though... you know in the normal Gaussian univariate equation, you have:

$$(2\pi\sigma^2)^{-1/2}e^{-x^2/2\sigma^2}$$

so that piece out front with the 2 pi sigma squared... what is that? It's just the normalization term is all. It makes sure that the cumulative distribution adds up to 1 for the whole thing. You can look at it like... as sigma goes up, the spread gets 'fatter', so you need to squish the whole thing down a bit to keep it all equal to the same area.

Well, what about in a multivariate case? If the covariance matrix is equal to the identity matrix, your distribution is basically spherically symmetrical around the mean. But what happens if you start squishing and warping things? You spread things out, so in the same way... you need to account for the extra volume by shrinking the normalization factor to keep the total equal to 1.

Think about your linear algebra... what's the determinant of the matrix? It's the product of all the eigenvalues. You're stretching the total volume by some amount across all the eigen vector axis, so you need to shrink things by a reciprocal amount. Covariance determinant, there you go. You can safely shelve this question until section 2.3 though, it won't be relevant in chapter 1. That's why Bishop didn't clearly explain it... it's not relevant yet. As a head's up, there's a section in curve fitting revisited looking at a full bayesian treatment with some pretty gnarly looking equations. Those are also not intended to be used or fully understood yet, so you can also safely gloss over those equations as well, it's more a sneak peak of what's coming than a rigorous treatment of a new concept.

Also, fuck reddit's math formatting. I love that we can actually use latex here.

curve fitting revisited still doesn't given an explicit method for finding a best fit polynomial given a dataset. The algorithm isn't hard though, it boils down to a matrix inversion (normal equation) so anyone interested in doing the polynomial code fitting example in code already has more or less enough to make it happen using what's covered in pages 1-28... or at least, there isn't any more help coming later in the book that's going to practically help with solving that problem.

kicking things off... making a little jupyter notebook duplicating figure 1.4 is probably the most relevant project. So a little polynomial curve fitting comparison to generated data using the given sin wave + noise parameters.

There was some suggestion of small coding projects we could all do as a group to help reinforce what we're learning. I feel like there's not a ton of obvious ideas in chapter one (or two) but I figured I'd start a thread just in case anyone else has any good ideas.

Reading through this section, does anyone have any cool little coding project ideas that come to mind relating to what's being covered?

hey man, yeah... that would be a confusing jump. The big difference... the brilliant and the Bishop versions are kind of using a different framework.

In the brilliant version, A and B are 'subsets', and they use a set theory approach to build up. So like... maybe you've a bag of pieces of paper, where there are 10 pieces of paper, and each have a unique number on it, so one paper might have 1, another 2, etc. up to 10.

Now, take a piece of paper out. Let's make two subsets of papers... 'A' is if you pick any papers with less than 5 on it ({1,2,3,4}) and 'B' is if you have any number between 3 and 6 ({3,4,5,6}).

So... you could make a little 'ven diagram' if you wanted, hopefully you understand what those probability rules would mean in this context. Maybe for comparison, think about the difference between the ven diagrams if you had mutually disjoint subsets: (A = {1,2,3,4,5}, B = {6,7,8,9,10}). However you cut it, you're talking about groups of events. Note that those events can be in a high dimensional space (maybe there's 30 pieces of paper, 10 each of red, white, and green) but it doesn't change the set theory approach. Maybe A = {green:1, red:1, white:1} and B={green:1, green:2, green:3} and you can ask questions like... which events belong to both A and B? What's the chance of an outcome being in A or B? Or what's the chance of an outcome not being in either A or B? You can add even more dimensions (maybe there are also big and little pieces of paper, or whatever else) but the set perspective doesn't care about that, it's just asking about groups of events.

Bishops on the other hand, those rules as they're written aren't written from the perspective of set theory, they're written more as a relation between outcomes with two variables. In the above description, the Bishop's equations would be easiest to use when thinking about the second scenario, relating color with numbers. So maybe X = color, Y = number. So in this conception, P(green) = sum P(green,i) (add the chance of a green one with 1 on it, green one with 2 on it, etc). In the Brilliant conception though, X might be {green:1,green;2,green:3} and Y might be {red:1,red:2,red:3}. So... X and Y shows up in both, but they mean completely different things. So the Bishop's equations definitely say the same thing, but it's getting at something very different... in Bishop's it's relating the joint/marginal/conditional probability equations to each other, whereas those brilliant equations are way more basic and fundamental than the concepts of joint/marginal/conditional probability distributions even.

For what it's worth, most intro stats books will introduce these concepts with set theory, so the ones you linked are the more standard way to introduce the ideas. Bishop's is pretty gnarly, and they assume you already have some basic familiarity with probability theory, so... this is more like 'hey, remember this thing you probably already know' so they aren't bothering to work up to it. It might be worth pounding through a primer on probability theory if you haven't yet... Brilliant has one on games of chance you could probably barrel through relatively quickly, you'll definitely want to be very comfortable with probability ideas like this before Bishop's is going to make much sense.

@mbrc - yes, if there was only one point (or two points that were in the same spot) then the MLE would be that point. What it's saying, is that it imagines 3 distinct datasets of two points each (shown in the three different graphs) where each dataset is two random draws from the gaussian distribution shown in red.

So like... imagine heights of people in the US. Take a small group of people, find the average height, and find the variance of the sample. In this case, maybe you accidentally pick some people that are all taller than average, another group is roughly average, another group skews short. The MLE for the average height across all three groups will be roughly the population mean (it's an unbiased estimate) but the variance? In each case, regardless of where the data sits relative to the distribution, the estimated variance is always too low.

All it's doing is pointing out that the MLE estimate of the variance only approaches the population variance after a fairly large number of observations, and it tends to underestimate more than overestimate.

Right on... the Bayesian stuff is super cool. I started Bishop's a little bit ago (I'm early on in chapter 2) and the Bayesian approach is used more and more as the book goes on, so you'll get plenty of practice using those techniques. One really cool piece though... 1.67 in Bishop's uses some Bayesian reasoning to explicitly show the assumptions being made when you're using ridge regression. Really interesting interpretation of a rule that's normally taken as a rule of thumb... 'why' use ridge regression? Well, there you go.

Sounds good. I started a little while ago with a local friend, we're up early into chapter 2, so the timing's working out fine. It'd be cool to have more people to talk with through problem spots. I've been doing every problem, so if anyone wants to do more than just the one star problems, it'd be cool to have a little side conversation about the harder parts if there's interest.