Topic · Statistics & Probability

Inferential Statistics

The move from "what does this sample look like" to "what can I claim about the population it came from." Inferential statistics is the machinery — sampling distributions, standard errors, confidence intervals — that turns a single finite sample into a calibrated statement about a world we never get to see in full.

What you'll leave with

  • A sharp distinction between populations and samples, and between parameters and statistics.
  • The single big idea: a sampling distribution is the distribution of a statistic (like $\bar X$) across hypothetical repeated samples.
  • Why the standard error of the mean is $\sigma/\sqrt n$ — and why doubling precision requires quadrupling the sample.
  • What a 95% confidence interval actually means (a procedure with 95% long-run coverage) — and the famous misinterpretation to never make again.
  • Bias vs variance for estimators, why the sample mean is unbiased, and when the $t$-distribution replaces the normal.

1. Why inference exists

Descriptive statistics summarizes data you already have — the mean, median, standard deviation, and shape of a sample sit in front of you, computable in a single pass. Inferential statistics takes the harder step: it treats that sample as a window onto a population you cannot fully observe and asks what you may responsibly claim about the world behind it.

The catch is that a sample is a finite, randomly selected slice. Pick a different 100 people and the mean shifts a little. Pick a third sample and it shifts again. Any single number you compute is contaminated by the luck of which observations happened to fall into your hands. Inferential statistics is the discipline of turning that contamination into a quantified, honest statement: not "the population mean is $50.3$," but "given what we saw, the population mean almost certainly sits between $48$ and $53$."

Descriptive statistics tells you what the sample is. Inferential statistics tells you what the sample is evidence for.

2. Population vs sample, parameter vs statistic

The whole framework depends on holding four words apart. Slipping one for another is the source of most undergraduate confusion.

Population

The complete set of units you'd like to know about — every voter in the country, every widget the factory will ever produce, every patient who could in principle receive the drug. Usually too large, too future, or too hidden to enumerate.

Sample

A finite subset actually drawn from the population — the $n$ rows in your dataset. Random sampling is the assumption that makes the rest of the math work; non-random samples can produce confident-looking numbers about nothing in particular.

A parameter is a fixed (but unknown) number that describes the population: the true mean $\mu$, the true standard deviation $\sigma$, the true proportion $p$. Parameters are what you want to know. A statistic is a number computed from the sample: the sample mean $\bar X$, the sample standard deviation $S$, the sample proportion $\hat p$. Statistics are what you actually have.

Population (parameter)Sample (statistic)
Mean $\mu$ $\bar X = \tfrac{1}{n}\sum X_i$
Variance $\sigma^2$ $S^2 = \tfrac{1}{n-1}\sum (X_i - \bar X)^2$
Standard deviation $\sigma$ $S$
Proportion $p$ $\hat p = \tfrac{\text{successes}}{n}$

Greek letters for the population, Latin for the sample. That convention isn't decorative — it is the visible reminder that the two quantities live in different worlds. $\mu$ is a fixed real number you can never see. $\bar X$ is a random variable: it depends on which sample you happened to draw.

Convention

Capital letters like $X_1, \ldots, X_n$ refer to the sample as random variables (before you look). Lowercase $x_1, \ldots, x_n$ are the realized numbers (after you look). $\bar X$ is the random sample mean; $\bar x$ is its observed value.

3. The sampling distribution

Here is the central concept. Imagine you could draw not one but many independent samples of size $n$ from the same population, computing the sample mean $\bar X$ from each. Each $\bar X$ comes out a little different. The collection of all those $\bar X$ values — across every possible sample of size $n$ — has its own distribution.

Sampling distribution

The probability distribution of a statistic across all possible samples of a given size $n$ drawn from a population. It tells you how the statistic itself varies from sample to sample.

Three facts about the sampling distribution of $\bar X$ make the rest of inferential statistics work:

  1. Its mean equals the population mean. $E[\bar X] = \mu$. On average, the sample mean is right.
  2. Its standard deviation is smaller than the population's — and by a precise amount: $\operatorname{SD}(\bar X) = \sigma/\sqrt n$. Averaging shrinks noise.
  3. For large $n$, its shape is approximately normal — regardless of the population's shape. This is the central limit theorem, and it is the engine that makes inference work even when the population itself is wildly non-normal.
Sampling distribution of the sample mean Population (wide) μ individual observations $X_i$, spread $\sigma$ many samples of size n, each gives one x̄ Sampling distribution of x̄ (narrow) μ sample means x̄, spread σ/√n ← smaller by a factor of √n spread ≈ σ spread ≈ σ/√n
A wide population (top) gives rise to a much narrower sampling distribution of the sample mean (bottom). Both are centered at $\mu$; the bottom is tighter by a factor of $\sqrt n$.
Mental model

The sampling distribution is not something you ever construct from data — you usually have just one sample. It is a thought experiment about what the statistic would look like if you could repeat the whole study, again and again. That thought experiment is the basis for every probabilistic claim about the population.

4. The standard error $\sigma/\sqrt n$

The standard error of a statistic is the standard deviation of its sampling distribution. For the sample mean drawn from a population with standard deviation $\sigma$,

$$ \operatorname{SE}(\bar X) \;=\; \frac{\sigma}{\sqrt n}. $$

Two things deserve attention. First, the standard error is not the standard deviation of the population, and it is not the standard deviation of your sample — it is the standard deviation of $\bar X$ across hypothetical repeated samples. Calling it a separate name keeps that distinction visible.

Second, the $\sqrt n$ in the denominator is one of the most quietly important features of statistics. Precision improves with $n$, but only at a square-root rate:

Sample size $n$$\operatorname{SE}(\bar X)$ relative to $\sigma$Improvement
$1$ $\sigma$ baseline
$4$ $\sigma/2$ $2\times$ tighter
$25$ $\sigma/5$ $5\times$ tighter
$100$ $\sigma/10$ $10\times$ tighter
$10{,}000$$\sigma/100$ $100\times$ tighter

To cut your standard error in half you need four times as much data. To cut it by a factor of ten you need a hundred times as much. This is why huge sample sizes can still leave plenty of uncertainty, and why the marginal value of every extra observation is always shrinking.

Standard error vs standard deviation

"Standard deviation" describes the spread of data; "standard error" describes the spread of an estimator. They have the same units and the same formal definition (an SD), but they're measuring different distributions. Reporting an SD when you meant an SE — or the reverse — silently exaggerates or understates precision by a factor of $\sqrt n$.

In practice you almost never know $\sigma$. The honest move is to plug in $S$, the sample standard deviation, and call the result the estimated standard error $S/\sqrt n$. That substitution is harmless for large $n$ — and exactly what triggers the $t$-distribution for small $n$, which we'll see in §7.

5. Point estimation, bias, and variance

A point estimate is a single best guess of a parameter, computed from the sample. The underlying recipe — the function that turns data into a number — is called the estimator. For the population mean, the canonical estimator is the sample mean: $\hat\mu = \bar X$.

Two qualities tell you whether an estimator is any good.

Bias

$\operatorname{Bias}(\hat\theta) = E[\hat\theta] - \theta$. An estimator is unbiased if, on average across repeated samples, it equals the true parameter — i.e. if its sampling distribution is centered on $\theta$.

Variance

$\operatorname{Var}(\hat\theta)$ — the spread of the estimator's sampling distribution. Low-variance estimators give similar answers across repeated samples; high-variance ones swing wildly.

Total error is captured by the mean squared error:

$$ \operatorname{MSE}(\hat\theta) \;=\; E\!\left[(\hat\theta - \theta)^2\right] \;=\; \operatorname{Var}(\hat\theta) \;+\; \operatorname{Bias}(\hat\theta)^2. $$

That decomposition is the bias–variance tradeoff in one line. You can shave variance at the cost of bias, or the reverse; the best estimator is the one that minimizes the sum.

The sample mean is unbiased

$\bar X = \tfrac{1}{n}\sum X_i$ is the textbook example of an unbiased estimator:

$$ E[\bar X] \;=\; \frac{1}{n}\sum_{i=1}^{n} E[X_i] \;=\; \frac{1}{n} \cdot n\mu \;=\; \mu. $$

And its variance is $\operatorname{Var}(\bar X) = \sigma^2/n$, which is why its standard error is $\sigma/\sqrt n$. Both pieces of the sampling distribution — center and spread — fall out of the same one-line argument.

Why $n - 1$ for sample variance

The sample variance is defined as $S^2 = \tfrac{1}{n-1}\sum (X_i - \bar X)^2$ — not divided by $n$. The $n - 1$ (called Bessel's correction) is exactly what makes $S^2$ unbiased for $\sigma^2$. Dividing by $n$ would systematically underestimate the population variance, because the deviations are measured against $\bar X$ rather than the unknown $\mu$.

6. Confidence intervals: what they do and don't mean

A point estimate alone is misleading — it suggests false precision. A confidence interval reports a range, calibrated so that the procedure that produced it captures the true parameter a known fraction of the time.

For a population mean $\mu$ with known $\sigma$, the standard $z$-based interval at confidence level $1 - \alpha$ is:

$$ \bar X \;\pm\; z_{\alpha/2} \cdot \frac{\sigma}{\sqrt n}. $$

For 95% confidence, $z_{0.025} \approx 1.96$, so the interval is $\bar X \pm 1.96 \cdot \sigma/\sqrt n$. The two pieces are doing two jobs: $\bar X$ is the point estimate; $1.96 \cdot \sigma/\sqrt n$ is the margin of error, set by how far the sampling distribution is likely to stray from $\mu$.

The famous misinterpretation

Confidence intervals are the most-misinterpreted concept in introductory statistics, and the misinterpretation has a name. Almost everyone — students, journalists, even some textbooks — wants to say:

"There's a 95% chance the true mean is in this interval."

That sentence is wrong. The true mean $\mu$ is a fixed number; it is either in the interval or it is not. There is no random event there to assign probability to. What the 95% refers to is the procedure, not the interval you happen to have.

What a 95% CI does mean

"If we repeated the procedure many times — draw a fresh sample, compute the same interval — 95% of the resulting intervals would contain the true $\mu$. The procedure is calibrated to have 95% long-run coverage."

The randomness lives in the interval endpoints, which change with every sample. The parameter is fixed.

What it does NOT mean
  • "There's a 95% chance $\mu$ is in this particular interval." (No: $\mu$ is fixed; this interval either contains it or doesn't.)
  • "95% of the population is in this interval." (No: a CI is about a parameter, not the data.)
  • "95% of future sample means will land in this interval." (No: this is about $\mu$, not future $\bar X$.)
Pitfall

The Bayesian framework does let you say "there's a 95% probability $\mu$ is in this range" — but only via a credible interval, computed from a posterior distribution with a prior. Frequentist confidence intervals (the ones in this section) are a different object, and conflating the two is the single most common error in statistical reporting.

How the width responds to inputs

If you change…The CI width…Because
Sample size $n$ up shrinks (as $1/\sqrt n$) $\operatorname{SE}$ shrinks
Population SD $\sigma$ down shrinks $\operatorname{SE}$ shrinks
Confidence up (e.g. 95% → 99%)grows bigger $z_{\alpha/2}$ needed for higher coverage

You can't get a tighter interval and stronger confidence and the same sample size — something has to give. The three knobs trade off against each other, and that tradeoff is the entire economics of study design.

7. When $\sigma$ is unknown: the $t$-distribution

The clean formula $\bar X \pm z_{\alpha/2} \cdot \sigma/\sqrt n$ assumes you know the population standard deviation $\sigma$. You almost never do. The natural fix — substitute the sample standard deviation $S$ — works fine when $n$ is large, but introduces real extra uncertainty when $n$ is small: now you're estimating both the mean and the spread from the same handful of data, and the standardized quantity

$$ T \;=\; \frac{\bar X - \mu}{S/\sqrt n} $$

is no longer standard normal. It follows Student's $t$-distribution with $n - 1$ degrees of freedom.

The $t$-distribution is bell-shaped and symmetric, like the normal, but has heavier tails — extreme values are more likely than the normal would predict, which is exactly the kind of cushion you want when you don't really know $\sigma$. As $n \to \infty$, the $t$-distribution converges to the normal, and the two intervals become indistinguishable.

The $t$-based confidence interval simply swaps $z$ for the corresponding $t$ critical value:

$$ \bar X \;\pm\; t_{\alpha/2,\, n-1} \cdot \frac{S}{\sqrt n}. $$
SituationCritical valueUse
$\sigma$ known (rare) $z_{\alpha/2}$ $z$-interval
$\sigma$ unknown, small $n$ $t_{\alpha/2,\, n-1}$ $t$-interval (the everyday case)
$\sigma$ unknown, large $n$ ($\gtrsim 30$) $z_{\alpha/2}$ (close enough) either works in practice
Where the $t$ came from

William Sealy Gosset derived the $t$-distribution in 1908 while working as a chemist at Guinness, where small-batch beer testing meant he routinely had $n$ in the single digits. Guinness's confidentiality policy forced him to publish under the pseudonym "Student," and the name stuck.

9. Common pitfalls

"95% chance the parameter is in the interval"

The single most common mistake. Repeat: $\mu$ is fixed, the interval is random. 95% refers to the long-run behavior of the procedure across hypothetical repeated samples, not the probability assigned to any particular interval you've already computed.

Standard error vs standard deviation

Reporting "$\bar x = 50$, SD = 10" describes the data. Reporting "$\bar x = 50$, SE = 1" describes the precision of $\bar x$ as an estimate of $\mu$. Mixing them up — putting SD where SE belongs, or vice versa — silently changes your error bars by a factor of $\sqrt n$.

Using $z$ when you should use $t$

If $\sigma$ is unknown and $n$ is small (say, $n < 30$), the $z$-interval is too narrow — its actual coverage falls below the nominal level. Default to $t$ unless you genuinely know $\sigma$ or $n$ is large.

$n$ vs $n - 1$ in the sample variance

Dividing by $n$ gives a biased estimate of $\sigma^2$ (the maximum-likelihood estimator for a normal). Dividing by $n - 1$ gives the unbiased sample variance $S^2$. Most introductory contexts want the unbiased version; software defaults vary, so check.

Non-random samples break everything

The whole machinery assumes the sample is a random draw from the population. A convenience sample — voluntary survey, easy-to-reach subjects — can give a beautifully tight confidence interval around the wrong number. Random sampling is not a technicality; it is the assumption that makes the math mean what it says.

Multiple intervals are not jointly calibrated

Each 95% CI is individually calibrated to 95% coverage. Compute twenty of them and the probability that all contain their parameters is roughly $0.95^{20} \approx 0.36$. Multiple comparisons need joint adjustments (Bonferroni, etc.) — a single threshold doesn't survive being applied many times.

10. Worked examples

Example 1 · A 95% CI for the mean, $\sigma$ known

A sample of $n = 100$ household incomes gives $\bar x = 50{,}000$. From a national database, $\sigma$ is taken as known: $\sigma = 10{,}000$.

Step 1. Standard error:

$$ \operatorname{SE}(\bar X) \;=\; \frac{\sigma}{\sqrt n} \;=\; \frac{10{,}000}{\sqrt{100}} \;=\; 1{,}000. $$

Step 2. For 95% confidence, $z_{0.025} = 1.96$. Margin of error:

$$ 1.96 \times 1{,}000 \;=\; 1{,}960. $$

Step 3. Interval:

$$ 50{,}000 \;\pm\; 1{,}960 \;=\; (48{,}040,\; 51{,}960). $$

Interpretation. The procedure that produced this interval has 95% long-run coverage — across many repeated samples, 95% of intervals built this way would contain the true mean. It is not right to say "there's a 95% chance the true mean is between $48{,}040$ and $51{,}960$."

Example 2 · A $t$-interval when $\sigma$ is unknown

A small clinical study measures recovery time on $n = 10$ patients: $\bar x = 8.4$ days, $s = 1.8$ days.

Step 1. Estimated standard error:

$$ \operatorname{SE} \;=\; \frac{s}{\sqrt n} \;=\; \frac{1.8}{\sqrt{10}} \;\approx\; 0.569. $$

Step 2. Degrees of freedom: $n - 1 = 9$. From a $t$-table, $t_{0.025,\,9} \approx 2.262$.

Step 3. Margin of error:

$$ 2.262 \times 0.569 \;\approx\; 1.287. $$

Step 4. Interval:

$$ 8.4 \;\pm\; 1.287 \;=\; (7.11,\; 9.69)\;\text{days}. $$

Compare to what you'd have gotten with $z = 1.96$: a margin of $1.96 \times 0.569 \approx 1.12$, giving the misleadingly tighter $(7.28, 9.52)$. The $t$ interval is wider because it honestly accounts for not knowing $\sigma$.

Example 3 · How $n$ shrinks the standard error

Suppose $\sigma = 20$. The standard error of $\bar X$ at several sample sizes:

$$ \begin{aligned} n &= 25 :\quad \operatorname{SE} = 20/\sqrt{25} = 4 \\ n &= 100 :\quad \operatorname{SE} = 20/\sqrt{100} = 2 \\ n &= 400 :\quad \operatorname{SE} = 20/\sqrt{400} = 1 \\ n &= 1600 :\quad \operatorname{SE} = 20/\sqrt{1600} = 0.5 \end{aligned} $$

Each time you cut SE in half you need $4\times$ more data. To go from SE $= 4$ to SE $= 0.5$ — an $8\times$ tightening — you needed $64\times$ the sample size. Square-root scaling is the entire reason "big enough" sample sizes for high-precision work get expensive fast.

Example 4 · Confidence interval for a proportion

A survey asks 400 voters; 240 support a measure. Sample proportion $\hat p = 240/400 = 0.6$.

Step 1. Standard error (Wald approximation):

$$ \operatorname{SE}(\hat p) \;=\; \sqrt{\frac{\hat p (1-\hat p)}{n}} \;=\; \sqrt{\frac{0.6 \cdot 0.4}{400}} \;=\; \sqrt{0.0006} \;\approx\; 0.0245. $$

Step 2. 95% CI:

$$ 0.6 \;\pm\; 1.96 \times 0.0245 \;\approx\; 0.6 \;\pm\; 0.048 \;=\; (0.552,\; 0.648). $$

The CLT lets us use a $z$-interval here because $n\hat p$ and $n(1-\hat p)$ are both comfortably large. Newspapers report this as "60% support, margin of error $\pm 5\%$."

Example 5 · Why the sample mean is unbiased

Let $X_1, \ldots, X_n$ be iid draws from a population with mean $\mu$. Then:

$$ E[\bar X] \;=\; E\!\left[\frac{1}{n}\sum_{i=1}^{n} X_i\right] \;=\; \frac{1}{n}\sum_{i=1}^{n} E[X_i] \;=\; \frac{1}{n} \cdot n\mu \;=\; \mu. $$

The estimator's sampling distribution is centered exactly on the target parameter — that is what "unbiased" formally means. (Linearity of expectation does all the work; the argument doesn't require normality, just iid sampling and a finite mean.)

Sources & further reading

The treatment above synthesizes standard introductory-statistics material; the sources below are where to go for fuller derivations, more examples, and the interactive visualizations that make sampling distributions click.

Test your understanding

A quiz that builds from easy to hard. Pick an answer to get instant feedback and a worked explanation. Your progress is saved in this browser — come back anytime to continue.

Question 1 of 22
0 correct