Topic · Statistics & Probability

The Central Limit Theorem

Average enough independent random samples and the average itself becomes normally distributed — no matter what shape the underlying distribution had. It's the reason the bell curve is everywhere in nature, and the silent engine behind almost every confidence interval, p-value, and standard error in applied statistics.

What you'll leave with

  • An intuition for why averaging "washes out" the original shape of a distribution.
  • The formal statement: $(\bar{X}_n - \mu)/(\sigma/\sqrt{n}) \to \mathcal{N}(0, 1)$.
  • Why this single result underwrites confidence intervals, hypothesis tests, and regression.
  • The conditions where the CLT quietly fails — and why "$n > 30$" is folklore, not theorem.
  • Hands-on: a playground for watching different source distributions converge as $n$ grows.

1. The statement

Take any reasonable distribution — uniform, exponential, the lopsided distribution of household incomes, the discrete distribution of dice rolls. Sample from it $n$ times, independently. Compute the average of those $n$ samples. Now do this over and over: each time, you get a new average. The Central Limit Theorem says that the distribution of those averages looks like a bell curve.

Central Limit Theorem (informal)

If $X_1, X_2, \ldots, X_n$ are independent samples from any distribution with finite mean $\mu$ and finite variance $\sigma^2$, then for large enough $n$ the sample mean

$$ \bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i $$

is approximately normally distributed with mean $\mu$ and variance $\sigma^2/n$. The shape of the original distribution doesn't matter.

Two parts to digest. First, the mean of the average is $\mu$ — averages don't drift away from the true mean. Second, the spread shrinks: the variance of $\bar{X}_n$ is $\sigma^2/n$, not $\sigma^2$. Averages of more data are more tightly clustered around the truth. That's the law of large numbers. The new thing the CLT adds is the shape: the cluster is bell-shaped, regardless of where you started.

2. Why it's startling

Stop and notice how strange this is. You pick a distribution. Maybe it's the uniform distribution on $[0, 1]$ — a flat rectangle. Maybe it's the exponential distribution — a sharply decaying curve. Maybe it's a coin flip — two spikes at 0 and 1. These three shapes have nothing in common. Yet if you average $n$ samples from any of them, the histogram of those averages looks like a bell.

The underlying distribution leaves only two fingerprints on the average: its mean $\mu$ and its variance $\sigma^2$. Everything else — the asymmetry, the discreteness, the multiple peaks, the heavy tails (within reason) — gets washed away by the addition. Add enough independent things together and the sum forgets where it came from.

Uniform Exponential Bimodal average n samples Normal $(\mu, \sigma^2/n)$ distribution of $\bar{X}_n$
Three radically different source distributions, one common destination. The CLT says only $\mu$ and $\sigma^2$ survive the averaging.
Historical aside

The CLT in its modern, fully general form is a 20th-century result — Lyapunov (1901) and Lindeberg (1922) supplied the rigorous proofs. But the idea is older: De Moivre noticed in 1733 that the binomial distribution approaches a bell shape for large $n$, and Laplace generalized this in the early 1800s. For most of the 19th century, statisticians used the theorem long before anyone had proved its general form.

3. The formal version

The informal statement is what you'll use day-to-day. The formal version is what you cite when you need to know exactly what's being claimed. Let $X_1, X_2, \ldots, X_n$ be independent and identically distributed (i.i.d.) random variables with

$$ \mathbb{E}[X_i] = \mu, \qquad \mathrm{Var}(X_i) = \sigma^2 < \infty. $$

Define the sample mean $\bar{X}_n = \tfrac{1}{n}\sum_{i=1}^{n} X_i$. Then as $n \to \infty$:

$$ \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \;\xrightarrow{\;d\;}\; \mathcal{N}(0, 1). $$

The arrow $\xrightarrow{d}$ means convergence in distribution: the cumulative distribution function of the left side converges, pointwise, to the CDF of the standard normal. In plain English, the standardized sample mean — sample mean minus its expected value, divided by its standard deviation — gets arbitrarily close in distribution to a standard normal as you average more samples.

Three ingredients in the standardization:

  • Center: subtract $\mu$. We want the bell centered at zero.
  • Scale: divide by $\sigma/\sqrt{n}$ — the standard error of the sample mean. This $\sqrt{n}$ in the denominator is the entire story of why averages tighten up.
  • Limit: the resulting random variable converges to $\mathcal{N}(0, 1)$, the standard normal.

Equivalently, without standardizing:

$$ \bar{X}_n \;\overset{\text{approx}}{\sim}\; \mathcal{N}\!\left(\mu,\; \frac{\sigma^2}{n}\right) \quad \text{for large } n. $$

Use whichever form fits the problem. The standardized form is the one to plug into a $z$-table; the un-standardized form is the one to picture in your head.

4. Implications

Why the bell curve is everywhere

Many quantities in the world are themselves sums of many small independent factors. Adult human height is roughly the sum of many genetic and environmental nudges; measurement error in a physical instrument is the sum of many tiny noise sources; the total error in a navigation system is the sum of many small contributing errors. Each individual factor can be wildly non-normal — but their sum, by the CLT, comes out normal. The bell curve isn't a metaphysical default; it's what you get when many small independent things add up.

The foundation of statistical inference

Almost every method you'll meet in introductory statistics leans on the CLT, often invisibly:

  • Confidence intervals. The classic "sample mean $\pm 1.96 \cdot \mathrm{SE}$" interval treats the sample mean as if it were normal. It is — approximately — by the CLT.
  • Hypothesis tests ($z$-tests, $t$-tests). These compute a test statistic and ask whether it's "extreme" under a normal reference. The reference is normal because the CLT says it is.
  • Linear regression. The standard errors on regression coefficients, the $t$-statistics, the $F$-test for overall fit — all rest on the assumption that coefficient estimates are approximately normal in large samples, which is a CLT-like result for sums.
  • Normal approximation to the binomial. A binomial$(n, p)$ random variable is a sum of $n$ Bernoulli trials, so for large $n$ it's approximately $\mathcal{N}(np,\, np(1-p))$. This is the CLT applied to the simplest possible building block.
Why $\sqrt{n}$, not $n$?

If $X_1, \ldots, X_n$ are independent with variance $\sigma^2$, then $\mathrm{Var}(\sum X_i) = n\sigma^2$ (variances add for independent things). The sample mean divides the sum by $n$, so $\mathrm{Var}(\bar{X}_n) = n\sigma^2 / n^2 = \sigma^2/n$, and the standard deviation is $\sigma/\sqrt{n}$. That's why doubling your sample size only shrinks your error bar by a factor of $\sqrt{2} \approx 1.41$, not 2 — a discount that has frustrated every empirical researcher who ever lived.

5. Caveats and when it fails

The CLT is robust, but not magic. Three things matter.

Finite variance is required

The classical CLT needs $\sigma^2 < \infty$. Distributions with infinite or undefined variance — the Cauchy distribution is the canonical example — do not obey the CLT. In fact, the sample mean of i.i.d. Cauchy variables has the same distribution as a single Cauchy variable: averaging does nothing. If you ever find yourself working with very heavy-tailed data (financial returns, network packet sizes, social-media follower counts), the CLT may be lying to you about how fast your error bars shrink.

Convergence speed depends on the source

"Approximately normal" is a statement about the limit. For finite $n$, how close the sample mean's distribution is to a true normal depends on how non-normal the source distribution is. Symmetric, light-tailed distributions converge quickly — sometimes by $n = 5$ or $10$. Heavily skewed distributions (exponential, log-normal) converge much more slowly: for the exponential, you typically want $n$ in the hundreds before the normal approximation is reliable in the tails.

"$n > 30$" is folklore

The rule of thumb you'll see in textbooks — "the CLT kicks in around $n = 30$" — is a useful piece of intuition for moderately well-behaved distributions, but it's not a theorem. For mildly skewed data it's conservative; for badly skewed or discrete-with-rare-values data, $n = 30$ is laughably small. The honest answer: there is no single threshold. The right $n$ depends on the source distribution and on which part of the tail you care about.

Heavy tails are the silent killer

If the underlying distribution has a heavy tail — a tail that decays as a power law rather than exponentially — even very large $n$ may not be enough. The mean of 10,000 samples from a distribution like daily stock returns can still misbehave. When in doubt, plot the sampling distribution of your statistic via bootstrap or simulation rather than trusting the CLT blindly.

6. Playground: watch the CLT in action

Pick an underlying distribution, choose a sample size $n$, and watch the histogram of 2000 sample means take shape. The green curve is the normal distribution $\mathcal{N}(\mu,\, \sigma^2/n)$ the CLT predicts. Crank $n$ up and watch the bars converge onto the curve — even when the source is wildly non-normal.

Source: Uniform on [0, 1]
n = 1 · samples: 2000
1
sample mean density
Observed: histogram of sample means CLT prediction: $\mathcal{N}(\mu,\ \sigma^2/n)$
Underlying μ
0.500
Underlying σ
0.289
Sample size n
1
Standard error σ/√n
0.289
Observed mean
Observed std dev
Copied!
Try this

Start with Bernoulli at $n = 1$ — you'll see exactly two bars at $0$ and $1$, nothing remotely bell-shaped. Slide $n$ up. Around $n = 5$ the shape starts to round; by $n = 20$ the discrete origin is invisible. Now try the Exponential: at $n = 1$ it's a sharp right-tailed slope; the convergence is slower because the source is skewed, but by $n = 30$ the green curve and the bars are nearly indistinguishable.

7. Common pitfalls

The CLT is about sample means, not single observations

If individual incomes are skewed, individual incomes are still skewed — the CLT says nothing about them. What's approximately normal is the distribution of the sample mean across hypothetical repeated samples of size $n$. Conflating these two is the single most common CLT mistake.

Finite variance is non-negotiable

Before invoking the CLT, ask: does this distribution have a finite variance? For most well-behaved real-world data the answer is yes, but for financial returns, social-network metrics, and other heavy-tailed phenomena, the assumption can fail silently. A "Cauchy-like" data-generating process can produce sample means that never settle down.

Convergence speed varies wildly

How large $n$ has to be for the normal approximation to be "good" depends entirely on the source distribution. The further from symmetric and bell-shaped the source is, the larger $n$ must be — and the speed of convergence is not uniform across the distribution. The center of the bell converges first; the tails come along much later.

"$n > 30$" is a rule of thumb, not a theorem

You will see $n = 30$ quoted as a magic threshold. It isn't. It's a vague, pedagogically convenient number that works for moderately well-behaved distributions and fails badly for skewed or heavy-tailed ones. When sample sizes matter — confidence-interval coverage, p-value calibration — verify by simulation, don't take the rule on faith.

8. Worked examples

Try each before opening the solution. The point is to verify your setup matches the canonical one — pick out $\mu$, $\sigma$, $n$, standardize, look up the probability.

Example 1 · Probability the sample mean falls in a range

A population has $\mu = 100$ and $\sigma = 15$. You take a sample of size $n = 36$. What's the probability that $\bar{X}_{36}$ lies between $97$ and $103$?

Step 1. By the CLT, $\bar{X}_{36}$ is approximately $\mathcal{N}(100,\; 15^2/36) = \mathcal{N}(100,\; 6.25)$, so its standard deviation (the standard error) is

$$ \mathrm{SE} = \frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{36}} = 2.5. $$

Step 2. Standardize the endpoints:

$$ z_{\text{low}} = \frac{97 - 100}{2.5} = -1.2, \qquad z_{\text{high}} = \frac{103 - 100}{2.5} = 1.2. $$

Step 3. From a standard normal table, $P(-1.2 < Z < 1.2) \approx 0.7699$.

So there's roughly a 77% chance the sample mean falls in $[97, 103]$.

Example 2 · Converting to the standard normal

You're told the sample mean of a sample of size $n = 64$ from a population with $\mu = 50$, $\sigma = 8$ is $\bar{x} = 52$. How extreme is this — express it as a $z$-score.

Step 1. Standard error of the mean:

$$ \mathrm{SE} = \frac{8}{\sqrt{64}} = 1. $$

Step 2. $z$-score:

$$ z = \frac{\bar{x} - \mu}{\mathrm{SE}} = \frac{52 - 50}{1} = 2. $$

A sample mean of 52 is 2 standard errors above the population mean. Under the CLT-justified normal approximation, $P(\bar{X}_{64} \geq 52) \approx P(Z \geq 2) \approx 0.0228$ — only about a 2.3% chance under the null.

Example 3 · Normal approximation to a binomial

You flip a fair coin $n = 100$ times. What's the approximate probability of getting at least 60 heads?

Step 1. Let $X = $ number of heads $\sim \mathrm{Binomial}(100, 0.5)$. Recognize $X = \sum_{i=1}^{100} Y_i$ where each $Y_i \in \{0,1\}$ is one flip — a sum of i.i.d. Bernoullis with $\mu = 0.5$, $\sigma^2 = 0.25$.

Step 2. By the CLT (applied to the sum, not the average):

$$ X \;\overset{\text{approx}}{\sim}\; \mathcal{N}(np,\; np(1-p)) = \mathcal{N}(50,\; 25). $$

So $\mathrm{SD}(X) = 5$.

Step 3. Standardize, with a half-step continuity correction (since $X$ is discrete):

$$ z = \frac{59.5 - 50}{5} = 1.9. $$

Step 4. $P(X \geq 60) \approx P(Z \geq 1.9) \approx 0.0287$, or about 2.9%.

The exact binomial answer is $\approx 0.0284$ — the CLT-based approximation is off by about 0.0003. Not bad for an approximation.

Example 4 · Computing standard error

A study reports the standard deviation of individual measurements is $\sigma = 12$. The researcher takes $n = 144$ measurements. What is the standard error of the resulting sample mean?

Step 1. Standard error is the standard deviation of the sample mean:

$$ \mathrm{SE}(\bar{X}_n) = \frac{\sigma}{\sqrt{n}} = \frac{12}{\sqrt{144}} = \frac{12}{12} = 1. $$

Step 2. Interpret. Individual measurements vary with $\mathrm{SD} = 12$. The sample mean of 144 of them varies, across hypothetical repeated experiments, with $\mathrm{SD} = 1$. Twelvefold reduction — because $\sqrt{144} = 12$.

To halve the standard error again — from 1 to 0.5 — you'd need $n = 576$ measurements (four times as many). $\sqrt{n}$ scaling is unforgiving.

Example 5 · Why sample means are less variable than individual observations

Suppose adult heights have mean $\mu = 170$ cm and standard deviation $\sigma = 7$ cm. Which is more likely:

  1. A single randomly chosen adult is taller than 175 cm.
  2. The mean height of a random group of 49 adults is greater than 175 cm.

Case 1. Standardize the individual:

$$ z_1 = \frac{175 - 170}{7} \approx 0.71, \quad P(Z > 0.71) \approx 0.239. $$

About a 24% chance.

Case 2. By the CLT, $\bar{X}_{49} \sim \mathcal{N}(170,\; 7^2/49) = \mathcal{N}(170, 1)$. Standardize:

$$ z_2 = \frac{175 - 170}{1} = 5, \quad P(Z > 5) \approx 3 \times 10^{-7}. $$

About one in three million.

Lesson. A single tall person is unremarkable; a group of 49 whose average height is 175 cm would be astonishing. Sample means concentrate around $\mu$ much more tightly than individual observations do — by exactly a factor of $\sqrt{n}$. This is the whole reason statistical inference works.

Sources & further reading

The CLT is one of the most-written-about results in mathematics; these four resources span the spectrum from textbook treatment to formal reference.

Test your understanding

A quiz that builds from easy to hard. Pick an answer to get instant feedback and a worked explanation. Your progress is saved in this browser — come back anytime to continue.

Question 1 of 1
0 correct