Hi,

I haven't done much statistics up till now, but it seems like the normal distribution ðŸ”” is used all the time. Why is this? Is it for simplicity, or because it often provides a good fit for the problem?

Thanks.

6

Hi,

I haven't done much statistics up till now, but it seems like the normal distribution ðŸ”” is used all the time. Why is this? Is it for simplicity, or because it often provides a good fit for the problem?

Thanks.

4

Well, first of all the normal distribution has many nice properties (i.e. any linear combination of normals is also normal, many results can be derived analytically) that make the normal distribution nice to work with.

It also provides a good approximation to other distributions largely due to the central limit theorem which states that for iid random variables $X_1, X_2 \dots X_n$ with finite mean and variance the random variable of the sum $X_1+X_2+\dots+X_n$ tends to a normal distribution as $n\rightarrow \infty$ (turns out these conditions can be relaxed a bit and the result is still normally distributed).

So for example if you have a lot of independent measurements (such as, for example the hemoglobin concentration in blood of a certain group of people) then (under some extra assumptions which are often true or at least close to true) the mean (i.e. the mean hemoglobin concentration of the group) is approximately normally distributed, which then allows us to use a wide variety of methods tailored to the normal.

0

Thank you so much. This is such a great explanation! I hadn't even heard of the central limit theorem before now - it's a pretty amazing result. It baffles me how you'd go about proving something like that.

simon
·

3

As cirpis very accurately said, the CLT plays a great part in its ubiquity. Many events in nature are assumed to be the result of independent random variables (e.g. fluxuations in measurement due to noise, or external influences to a process, like the collisions of a massive particle with lighter ones).

A further point is that the normal distribution is the distribution of maximum entropy for a continuous random variable $X\in\mathbb{R}$, with specified finite mean and variance. It is therefore normal to assume that distribution when nothing else is known. Entropic arguments crop up in other applications as well.

We are by no means done! As the eloquent Shalizi points out in this post, Gauss Is Not Mocked. Even when the random variables are multiplicative instead of additive, taking the logarithm will still produce a normal distribution for the logarithm of the product. The log-normal distribution bears a good resemblance to the pareto (power-law) distribution but has lighter tails, and as a result the two are often confounded (which is also the point of that blog post).

An excerpt from the post:

As Hacking notes, on further consideration Galton was even more impressed by the central limit theorem, and accordingly replaced the sentence about savages with "The law would have been personified by the Greeks and deified, if they had known of it." Whether deified by Hellenes or savages, however, the CLT has a message for those doing data analysis, and the message is:

Thou shalt have no other distribution before me, for I am a jealous

limit theorem.

webdrone
·

1

@webdrone

I am a Computer Science master student with interest in ML. How can I get to your level of Stats? What books would you recommend as must-read?

Thanks for the explanation!

th3owner
·