1. Spread, not center
Imagine two basketball players, Alice and Bob. Both average 20 points per game over a season. On paper they look identical — but watch them play and the truth is obvious. Alice scores 19, 21, 20, 20, 20. Bob scores 5, 35, 10, 40, 10. Same mean, completely different stories.
The mean throws away a huge amount of information. To distinguish Alice from Bob, you need a number that captures how far the data wanders from the center. That number is what variance and standard deviation are built to measure.
Two datasets can share a mean and still describe completely different realities. Spread is what separates a reliable, predictable process from a wildly variable one. Without it, "the average customer waits 5 minutes" tells you almost nothing useful.
The naive idea is to average the distances of each point from the mean. But there's a problem: by definition, the positive and negative deviations from the mean cancel out exactly. Their sum is zero, every time:
$$ \sum_{i=1}^{n} (x_i - \bar{x}) = 0 $$So we have to do something to the deviations before summing them. There are two natural choices — take absolute values, or square them. Squaring wins for reasons we'll see in a moment.
2. Variance
The mean of the squared deviations from the mean. For a population of $n$ values $x_1, x_2, \ldots, x_n$ with mean $\bar{x}$:
$$ \sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2 $$The recipe is four steps, and you should be able to do it in your sleep:
- Compute the mean $\bar{x}$.
- For each value, find its deviation from the mean: $x_i - \bar{x}$.
- Square each deviation.
- Average the squared deviations.
Why squared, not absolute?
Both $|x_i - \bar{x}|$ and $(x_i - \bar{x})^2$ get rid of the sign problem. So why does every introductory textbook reach for the square? Three reasons, in order of how much they matter:
- Squaring is smooth. The function $f(d) = d^2$ is differentiable everywhere; $|d|$ has a kink at zero. Calculus on smooth things is enormously easier, and a lot of statistics is calculus on variance.
- Squaring punishes outliers more. A value twice as far from the mean contributes four times as much to the variance. This makes variance very sensitive to extreme observations — sometimes a feature, sometimes a bug.
- Squared distances add cleanly. Variance of a sum of independent random variables equals the sum of their variances. The analogous statement for absolute deviation is simply false. This single property is what makes variance the right tool for most theoretical work.
The "average absolute deviation," $\tfrac{1}{n}\sum|x_i - \bar{x}|$, is a perfectly valid measure of spread — sometimes more robust to outliers than variance. It just doesn't behave as nicely under the algebraic operations that statisticians need to do constantly, so it isn't the default.
3. Standard deviation
Variance has one annoying property: its units are the square of the data's units. If your numbers are heights in centimeters, the variance is in cm² — a quantity nobody has any intuition for. The fix is just to take the square root.
The square root of the variance. For a population:
$$ \sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2} $$Standard deviation is in the same units as the original data, which makes it the right number to quote when communicating with humans.
You should think of variance as the right thing to work with algebraically, and standard deviation as the right thing to report. They carry exactly the same information — knowing one tells you the other — but they live in different units, and that matters when you have to interpret a number.
If you want to do math, use variance. If you want to talk to a person, use standard deviation.
4. Sample versus population
So far we've assumed you know every value in the dataset — that the $n$ numbers are the whole population. In practice you almost never do; you have a sample, and you're using it to estimate properties of an unseen population.
When estimating from a sample, the formula changes slightly. The sample variance divides by $n - 1$ instead of $n$:
$$ s^2 = \frac{1}{n - 1}\sum_{i=1}^{n}(x_i - \bar{x})^2 $$This adjustment is called Bessel's correction. The reason is subtle but worth understanding once.
When you compute the sample mean $\bar{x}$ from the data and then measure deviations from $\bar{x}$, those deviations are systematically smaller than the deviations from the true (unknown) population mean would have been. The sample mean is, by construction, the value that minimizes the sum of squared deviations from the sample. So using $\bar{x}$ instead of the true mean understates the spread.
Formally, you've spent one "degree of freedom" estimating the mean. You only have $n - 1$ independent pieces of information left for estimating the variance. Dividing by $n - 1$ exactly cancels the downward bias and produces an estimator whose expected value equals the true population variance.
Almost always: divide by $n - 1$. Calculators and spreadsheets default to it for a reason. Only divide by $n$ when you genuinely have the entire population — every employee, every patient, every member of the set — which in practice is rare.
| Quantity | Symbol | Divide by | When |
|---|---|---|---|
| Population variance | $\sigma^2$ | $n$ | You have every value in the population |
| Sample variance | $s^2$ | $n - 1$ | You have a sample and want to estimate $\sigma^2$ |
| Population SD | $\sigma$ | — | $\sqrt{\sigma^2}$ |
| Sample SD | $s$ | — | $\sqrt{s^2}$ |
5. The empirical rule (68-95-99.7)
For data that follows a roughly bell-shaped (normal) distribution, the standard deviation comes with a remarkable interpretation. Almost the entire dataset lives within just a few standard deviations of the mean:
- About 68% of values fall within $\pm 1\sigma$ of the mean.
- About 95% fall within $\pm 2\sigma$.
- About 99.7% fall within $\pm 3\sigma$.
This is the single most useful rule of thumb in elementary statistics. Given a mean and a standard deviation, you can sketch the shape of an entire bell-curve dataset on a napkin.
An IQ test, for example, is calibrated so the mean is 100 and the standard deviation is 15. The empirical rule then tells you immediately: roughly two-thirds of people score between 85 and 115, about 19 in 20 score between 70 and 130, and a score outside 55–145 happens to about three people in a thousand.
The 68-95-99.7 numbers are properties of the bell curve specifically. For income data, response times, or anything else with a heavy tail, the percentages can be wildly different — sometimes most of the data sits within one standard deviation, sometimes far less. Don't apply the rule until you've checked the shape.
6. Chebyshev's inequality
The empirical rule is sharp but only works for normal data. When you have no idea what shape your distribution takes, there's a weaker but universal result that still gives you a guarantee: Chebyshev's inequality.
For any distribution with finite mean $\mu$ and finite standard deviation $\sigma$, and for any $k > 1$:
$$ P\bigl(|X - \mu| \geq k\sigma\bigr) \leq \frac{1}{k^2} $$Equivalently, at least $1 - 1/k^2$ of the data lies within $k$ standard deviations of the mean.
Plug in $k = 2$: at least $1 - 1/4 = 75\%$ of any dataset sits within $\pm 2\sigma$. Plug in $k = 3$: at least $1 - 1/9 \approx 88.9\%$ sits within $\pm 3\sigma$. The bounds are loose — for normal data the true numbers (95%, 99.7%) are much higher — but they hold for every distribution, no matter how exotic.
Use the empirical rule when you know (or can credibly assume) the data is roughly bell-shaped. Use Chebyshev when you don't know, or when the data is clearly skewed or heavy-tailed. Chebyshev gives a guarantee; the empirical rule gives a tight approximation.
7. Z-scores
Once you know a distribution's mean and standard deviation, you can re-express every value in a universal currency: how many standard deviations is this above or below the mean? That number is the z-score.
A z-score of $+1.5$ means "1.5 standard deviations above the mean." A z-score of $-0.5$ means "half a standard deviation below the mean."
Z-scores let you compare apples and oranges. A score of $85$ on a test with mean $70$ and SD $10$ has $z = 1.5$. A height of $190$ cm in a population with mean $175$ cm and SD $7$ cm has $z \approx 2.14$. The height is more "extreme" relative to its distribution than the test score is to its — and the z-score makes that visible without any further work.
Combined with the empirical rule, z-scores give an instant sanity check. A z of $\pm 1$ is unremarkable. A z of $\pm 2$ is rare-ish. A z of $\pm 3$ or beyond is genuinely unusual for normal data — about one in 370 either side.
8. Other measures of spread: range and IQR
Standard deviation isn't the only way to summarize spread. Two simpler alternatives sit at opposite ends of the trade-off curve.
Range
The range is just the maximum minus the minimum:
$$ \text{range} = \max(x) - \min(x) $$It's trivial to compute and easy to explain. It's also brittle: a single outlier can blow it up arbitrarily, and it ignores everything between the two extremes. Use it for a quick glance, not for serious analysis.
Interquartile range (IQR)
The interquartile range is the spread of the middle 50% of the data:
$$ \text{IQR} = Q_3 - Q_1 $$where $Q_1$ and $Q_3$ are the 25th and 75th percentiles. By construction, the IQR ignores the most extreme quarter of values on each side — which makes it robust to outliers. A single absurd value can't move it.
| Measure | Formula | Strength | Weakness |
|---|---|---|---|
| Range | $\max - \min$ | Trivial to compute | Wrecked by a single outlier |
| IQR | $Q_3 - Q_1$ | Robust to outliers | Throws away half the data |
| Standard deviation | $\sqrt{\tfrac{1}{n}\sum(x_i - \bar{x})^2}$ | Uses every point; algebra-friendly | Sensitive to outliers |
Reach for the IQR when your data is skewed or contaminated. Reach for the standard deviation when the distribution is roughly symmetric and you want to do any downstream math.