1. The statement
Take any reasonable distribution — uniform, exponential, the lopsided distribution of household incomes, the discrete distribution of dice rolls. Sample from it $n$ times, independently. Compute the average of those $n$ samples. Now do this over and over: each time, you get a new average. The Central Limit Theorem says that the distribution of those averages looks like a bell curve.
If $X_1, X_2, \ldots, X_n$ are independent samples from any distribution with finite mean $\mu$ and finite variance $\sigma^2$, then for large enough $n$ the sample mean
$$ \bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i $$is approximately normally distributed with mean $\mu$ and variance $\sigma^2/n$. The shape of the original distribution doesn't matter.
Two parts to digest. First, the mean of the average is $\mu$ — averages don't drift away from the true mean. Second, the spread shrinks: the variance of $\bar{X}_n$ is $\sigma^2/n$, not $\sigma^2$. Averages of more data are more tightly clustered around the truth. That's the law of large numbers. The new thing the CLT adds is the shape: the cluster is bell-shaped, regardless of where you started.
2. Why it's startling
Stop and notice how strange this is. You pick a distribution. Maybe it's the uniform distribution on $[0, 1]$ — a flat rectangle. Maybe it's the exponential distribution — a sharply decaying curve. Maybe it's a coin flip — two spikes at 0 and 1. These three shapes have nothing in common. Yet if you average $n$ samples from any of them, the histogram of those averages looks like a bell.
The underlying distribution leaves only two fingerprints on the average: its mean $\mu$ and its variance $\sigma^2$. Everything else — the asymmetry, the discreteness, the multiple peaks, the heavy tails (within reason) — gets washed away by the addition. Add enough independent things together and the sum forgets where it came from.
The CLT in its modern, fully general form is a 20th-century result — Lyapunov (1901) and Lindeberg (1922) supplied the rigorous proofs. But the idea is older: De Moivre noticed in 1733 that the binomial distribution approaches a bell shape for large $n$, and Laplace generalized this in the early 1800s. For most of the 19th century, statisticians used the theorem long before anyone had proved its general form.
3. The formal version
The informal statement is what you'll use day-to-day. The formal version is what you cite when you need to know exactly what's being claimed. Let $X_1, X_2, \ldots, X_n$ be independent and identically distributed (i.i.d.) random variables with
$$ \mathbb{E}[X_i] = \mu, \qquad \mathrm{Var}(X_i) = \sigma^2 < \infty. $$Define the sample mean $\bar{X}_n = \tfrac{1}{n}\sum_{i=1}^{n} X_i$. Then as $n \to \infty$:
$$ \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \;\xrightarrow{\;d\;}\; \mathcal{N}(0, 1). $$The arrow $\xrightarrow{d}$ means convergence in distribution: the cumulative distribution function of the left side converges, pointwise, to the CDF of the standard normal. In plain English, the standardized sample mean — sample mean minus its expected value, divided by its standard deviation — gets arbitrarily close in distribution to a standard normal as you average more samples.
Three ingredients in the standardization:
- Center: subtract $\mu$. We want the bell centered at zero.
- Scale: divide by $\sigma/\sqrt{n}$ — the standard error of the sample mean. This $\sqrt{n}$ in the denominator is the entire story of why averages tighten up.
- Limit: the resulting random variable converges to $\mathcal{N}(0, 1)$, the standard normal.
Equivalently, without standardizing:
$$ \bar{X}_n \;\overset{\text{approx}}{\sim}\; \mathcal{N}\!\left(\mu,\; \frac{\sigma^2}{n}\right) \quad \text{for large } n. $$Use whichever form fits the problem. The standardized form is the one to plug into a $z$-table; the un-standardized form is the one to picture in your head.
4. Implications
Why the bell curve is everywhere
Many quantities in the world are themselves sums of many small independent factors. Adult human height is roughly the sum of many genetic and environmental nudges; measurement error in a physical instrument is the sum of many tiny noise sources; the total error in a navigation system is the sum of many small contributing errors. Each individual factor can be wildly non-normal — but their sum, by the CLT, comes out normal. The bell curve isn't a metaphysical default; it's what you get when many small independent things add up.
The foundation of statistical inference
Almost every method you'll meet in introductory statistics leans on the CLT, often invisibly:
- Confidence intervals. The classic "sample mean $\pm 1.96 \cdot \mathrm{SE}$" interval treats the sample mean as if it were normal. It is — approximately — by the CLT.
- Hypothesis tests ($z$-tests, $t$-tests). These compute a test statistic and ask whether it's "extreme" under a normal reference. The reference is normal because the CLT says it is.
- Linear regression. The standard errors on regression coefficients, the $t$-statistics, the $F$-test for overall fit — all rest on the assumption that coefficient estimates are approximately normal in large samples, which is a CLT-like result for sums.
- Normal approximation to the binomial. A binomial$(n, p)$ random variable is a sum of $n$ Bernoulli trials, so for large $n$ it's approximately $\mathcal{N}(np,\, np(1-p))$. This is the CLT applied to the simplest possible building block.
If $X_1, \ldots, X_n$ are independent with variance $\sigma^2$, then $\mathrm{Var}(\sum X_i) = n\sigma^2$ (variances add for independent things). The sample mean divides the sum by $n$, so $\mathrm{Var}(\bar{X}_n) = n\sigma^2 / n^2 = \sigma^2/n$, and the standard deviation is $\sigma/\sqrt{n}$. That's why doubling your sample size only shrinks your error bar by a factor of $\sqrt{2} \approx 1.41$, not 2 — a discount that has frustrated every empirical researcher who ever lived.
5. Caveats and when it fails
The CLT is robust, but not magic. Three things matter.
Finite variance is required
The classical CLT needs $\sigma^2 < \infty$. Distributions with infinite or undefined variance — the Cauchy distribution is the canonical example — do not obey the CLT. In fact, the sample mean of i.i.d. Cauchy variables has the same distribution as a single Cauchy variable: averaging does nothing. If you ever find yourself working with very heavy-tailed data (financial returns, network packet sizes, social-media follower counts), the CLT may be lying to you about how fast your error bars shrink.
Convergence speed depends on the source
"Approximately normal" is a statement about the limit. For finite $n$, how close the sample mean's distribution is to a true normal depends on how non-normal the source distribution is. Symmetric, light-tailed distributions converge quickly — sometimes by $n = 5$ or $10$. Heavily skewed distributions (exponential, log-normal) converge much more slowly: for the exponential, you typically want $n$ in the hundreds before the normal approximation is reliable in the tails.
"$n > 30$" is folklore
The rule of thumb you'll see in textbooks — "the CLT kicks in around $n = 30$" — is a useful piece of intuition for moderately well-behaved distributions, but it's not a theorem. For mildly skewed data it's conservative; for badly skewed or discrete-with-rare-values data, $n = 30$ is laughably small. The honest answer: there is no single threshold. The right $n$ depends on the source distribution and on which part of the tail you care about.
If the underlying distribution has a heavy tail — a tail that decays as a power law rather than exponentially — even very large $n$ may not be enough. The mean of 10,000 samples from a distribution like daily stock returns can still misbehave. When in doubt, plot the sampling distribution of your statistic via bootstrap or simulation rather than trusting the CLT blindly.