1. Why inference exists
Descriptive statistics summarizes data you already have — the mean, median, standard deviation, and shape of a sample sit in front of you, computable in a single pass. Inferential statistics takes the harder step: it treats that sample as a window onto a population you cannot fully observe and asks what you may responsibly claim about the world behind it.
The catch is that a sample is a finite, randomly selected slice. Pick a different 100 people and the mean shifts a little. Pick a third sample and it shifts again. Any single number you compute is contaminated by the luck of which observations happened to fall into your hands. Inferential statistics is the discipline of turning that contamination into a quantified, honest statement: not "the population mean is $50.3$," but "given what we saw, the population mean almost certainly sits between $48$ and $53$."
Descriptive statistics tells you what the sample is. Inferential statistics tells you what the sample is evidence for.
2. Population vs sample, parameter vs statistic
The whole framework depends on holding four words apart. Slipping one for another is the source of most undergraduate confusion.
The complete set of units you'd like to know about — every voter in the country, every widget the factory will ever produce, every patient who could in principle receive the drug. Usually too large, too future, or too hidden to enumerate.
A finite subset actually drawn from the population — the $n$ rows in your dataset. Random sampling is the assumption that makes the rest of the math work; non-random samples can produce confident-looking numbers about nothing in particular.
A parameter is a fixed (but unknown) number that describes the population: the true mean $\mu$, the true standard deviation $\sigma$, the true proportion $p$. Parameters are what you want to know. A statistic is a number computed from the sample: the sample mean $\bar X$, the sample standard deviation $S$, the sample proportion $\hat p$. Statistics are what you actually have.
| Population (parameter) | Sample (statistic) | |
|---|---|---|
| Mean | $\mu$ | $\bar X = \tfrac{1}{n}\sum X_i$ |
| Variance | $\sigma^2$ | $S^2 = \tfrac{1}{n-1}\sum (X_i - \bar X)^2$ |
| Standard deviation | $\sigma$ | $S$ |
| Proportion | $p$ | $\hat p = \tfrac{\text{successes}}{n}$ |
Greek letters for the population, Latin for the sample. That convention isn't decorative — it is the visible reminder that the two quantities live in different worlds. $\mu$ is a fixed real number you can never see. $\bar X$ is a random variable: it depends on which sample you happened to draw.
Capital letters like $X_1, \ldots, X_n$ refer to the sample as random variables (before you look). Lowercase $x_1, \ldots, x_n$ are the realized numbers (after you look). $\bar X$ is the random sample mean; $\bar x$ is its observed value.
3. The sampling distribution
Here is the central concept. Imagine you could draw not one but many independent samples of size $n$ from the same population, computing the sample mean $\bar X$ from each. Each $\bar X$ comes out a little different. The collection of all those $\bar X$ values — across every possible sample of size $n$ — has its own distribution.
The probability distribution of a statistic across all possible samples of a given size $n$ drawn from a population. It tells you how the statistic itself varies from sample to sample.
Three facts about the sampling distribution of $\bar X$ make the rest of inferential statistics work:
- Its mean equals the population mean. $E[\bar X] = \mu$. On average, the sample mean is right.
- Its standard deviation is smaller than the population's — and by a precise amount: $\operatorname{SD}(\bar X) = \sigma/\sqrt n$. Averaging shrinks noise.
- For large $n$, its shape is approximately normal — regardless of the population's shape. This is the central limit theorem, and it is the engine that makes inference work even when the population itself is wildly non-normal.
The sampling distribution is not something you ever construct from data — you usually have just one sample. It is a thought experiment about what the statistic would look like if you could repeat the whole study, again and again. That thought experiment is the basis for every probabilistic claim about the population.
4. The standard error $\sigma/\sqrt n$
The standard error of a statistic is the standard deviation of its sampling distribution. For the sample mean drawn from a population with standard deviation $\sigma$,
$$ \operatorname{SE}(\bar X) \;=\; \frac{\sigma}{\sqrt n}. $$Two things deserve attention. First, the standard error is not the standard deviation of the population, and it is not the standard deviation of your sample — it is the standard deviation of $\bar X$ across hypothetical repeated samples. Calling it a separate name keeps that distinction visible.
Second, the $\sqrt n$ in the denominator is one of the most quietly important features of statistics. Precision improves with $n$, but only at a square-root rate:
| Sample size $n$ | $\operatorname{SE}(\bar X)$ relative to $\sigma$ | Improvement |
|---|---|---|
| $1$ | $\sigma$ | baseline |
| $4$ | $\sigma/2$ | $2\times$ tighter |
| $25$ | $\sigma/5$ | $5\times$ tighter |
| $100$ | $\sigma/10$ | $10\times$ tighter |
| $10{,}000$ | $\sigma/100$ | $100\times$ tighter |
To cut your standard error in half you need four times as much data. To cut it by a factor of ten you need a hundred times as much. This is why huge sample sizes can still leave plenty of uncertainty, and why the marginal value of every extra observation is always shrinking.
"Standard deviation" describes the spread of data; "standard error" describes the spread of an estimator. They have the same units and the same formal definition (an SD), but they're measuring different distributions. Reporting an SD when you meant an SE — or the reverse — silently exaggerates or understates precision by a factor of $\sqrt n$.
In practice you almost never know $\sigma$. The honest move is to plug in $S$, the sample standard deviation, and call the result the estimated standard error $S/\sqrt n$. That substitution is harmless for large $n$ — and exactly what triggers the $t$-distribution for small $n$, which we'll see in §7.
5. Point estimation, bias, and variance
A point estimate is a single best guess of a parameter, computed from the sample. The underlying recipe — the function that turns data into a number — is called the estimator. For the population mean, the canonical estimator is the sample mean: $\hat\mu = \bar X$.
Two qualities tell you whether an estimator is any good.
$\operatorname{Bias}(\hat\theta) = E[\hat\theta] - \theta$. An estimator is unbiased if, on average across repeated samples, it equals the true parameter — i.e. if its sampling distribution is centered on $\theta$.
$\operatorname{Var}(\hat\theta)$ — the spread of the estimator's sampling distribution. Low-variance estimators give similar answers across repeated samples; high-variance ones swing wildly.
Total error is captured by the mean squared error:
$$ \operatorname{MSE}(\hat\theta) \;=\; E\!\left[(\hat\theta - \theta)^2\right] \;=\; \operatorname{Var}(\hat\theta) \;+\; \operatorname{Bias}(\hat\theta)^2. $$That decomposition is the bias–variance tradeoff in one line. You can shave variance at the cost of bias, or the reverse; the best estimator is the one that minimizes the sum.
The sample mean is unbiased
$\bar X = \tfrac{1}{n}\sum X_i$ is the textbook example of an unbiased estimator:
$$ E[\bar X] \;=\; \frac{1}{n}\sum_{i=1}^{n} E[X_i] \;=\; \frac{1}{n} \cdot n\mu \;=\; \mu. $$And its variance is $\operatorname{Var}(\bar X) = \sigma^2/n$, which is why its standard error is $\sigma/\sqrt n$. Both pieces of the sampling distribution — center and spread — fall out of the same one-line argument.
The sample variance is defined as $S^2 = \tfrac{1}{n-1}\sum (X_i - \bar X)^2$ — not divided by $n$. The $n - 1$ (called Bessel's correction) is exactly what makes $S^2$ unbiased for $\sigma^2$. Dividing by $n$ would systematically underestimate the population variance, because the deviations are measured against $\bar X$ rather than the unknown $\mu$.
6. Confidence intervals: what they do and don't mean
A point estimate alone is misleading — it suggests false precision. A confidence interval reports a range, calibrated so that the procedure that produced it captures the true parameter a known fraction of the time.
For a population mean $\mu$ with known $\sigma$, the standard $z$-based interval at confidence level $1 - \alpha$ is:
$$ \bar X \;\pm\; z_{\alpha/2} \cdot \frac{\sigma}{\sqrt n}. $$For 95% confidence, $z_{0.025} \approx 1.96$, so the interval is $\bar X \pm 1.96 \cdot \sigma/\sqrt n$. The two pieces are doing two jobs: $\bar X$ is the point estimate; $1.96 \cdot \sigma/\sqrt n$ is the margin of error, set by how far the sampling distribution is likely to stray from $\mu$.
The famous misinterpretation
Confidence intervals are the most-misinterpreted concept in introductory statistics, and the misinterpretation has a name. Almost everyone — students, journalists, even some textbooks — wants to say:
"There's a 95% chance the true mean is in this interval."
That sentence is wrong. The true mean $\mu$ is a fixed number; it is either in the interval or it is not. There is no random event there to assign probability to. What the 95% refers to is the procedure, not the interval you happen to have.
"If we repeated the procedure many times — draw a fresh sample, compute the same interval — 95% of the resulting intervals would contain the true $\mu$. The procedure is calibrated to have 95% long-run coverage."
The randomness lives in the interval endpoints, which change with every sample. The parameter is fixed.
- "There's a 95% chance $\mu$ is in this particular interval." (No: $\mu$ is fixed; this interval either contains it or doesn't.)
- "95% of the population is in this interval." (No: a CI is about a parameter, not the data.)
- "95% of future sample means will land in this interval." (No: this is about $\mu$, not future $\bar X$.)
The Bayesian framework does let you say "there's a 95% probability $\mu$ is in this range" — but only via a credible interval, computed from a posterior distribution with a prior. Frequentist confidence intervals (the ones in this section) are a different object, and conflating the two is the single most common error in statistical reporting.
How the width responds to inputs
| If you change… | The CI width… | Because |
|---|---|---|
| Sample size $n$ up | shrinks (as $1/\sqrt n$) | $\operatorname{SE}$ shrinks |
| Population SD $\sigma$ down | shrinks | $\operatorname{SE}$ shrinks |
| Confidence up (e.g. 95% → 99%) | grows | bigger $z_{\alpha/2}$ needed for higher coverage |
You can't get a tighter interval and stronger confidence and the same sample size — something has to give. The three knobs trade off against each other, and that tradeoff is the entire economics of study design.
7. When $\sigma$ is unknown: the $t$-distribution
The clean formula $\bar X \pm z_{\alpha/2} \cdot \sigma/\sqrt n$ assumes you know the population standard deviation $\sigma$. You almost never do. The natural fix — substitute the sample standard deviation $S$ — works fine when $n$ is large, but introduces real extra uncertainty when $n$ is small: now you're estimating both the mean and the spread from the same handful of data, and the standardized quantity
$$ T \;=\; \frac{\bar X - \mu}{S/\sqrt n} $$is no longer standard normal. It follows Student's $t$-distribution with $n - 1$ degrees of freedom.
The $t$-distribution is bell-shaped and symmetric, like the normal, but has heavier tails — extreme values are more likely than the normal would predict, which is exactly the kind of cushion you want when you don't really know $\sigma$. As $n \to \infty$, the $t$-distribution converges to the normal, and the two intervals become indistinguishable.
The $t$-based confidence interval simply swaps $z$ for the corresponding $t$ critical value:
$$ \bar X \;\pm\; t_{\alpha/2,\, n-1} \cdot \frac{S}{\sqrt n}. $$| Situation | Critical value | Use |
|---|---|---|
| $\sigma$ known (rare) | $z_{\alpha/2}$ | $z$-interval |
| $\sigma$ unknown, small $n$ | $t_{\alpha/2,\, n-1}$ | $t$-interval (the everyday case) |
| $\sigma$ unknown, large $n$ ($\gtrsim 30$) | $z_{\alpha/2}$ (close enough) | either works in practice |
William Sealy Gosset derived the $t$-distribution in 1908 while working as a chemist at Guinness, where small-batch beer testing meant he routinely had $n$ in the single digits. Guinness's confidentiality policy forced him to publish under the pseudonym "Student," and the name stuck.
8. The flip side: hypothesis testing
Confidence intervals say "given the data, here is a calibrated range of parameter values consistent with it." Hypothesis testing asks the dual question: "given a specific proposed parameter value, is the data consistent with it?" The two are flip sides of the same coin.
The link is exact: at significance level $\alpha$, you would reject a null hypothesis $H_0: \mu = \mu_0$ if and only if $\mu_0$ falls outside the $(1-\alpha)$ confidence interval for $\mu$. A 95% CI is exactly the set of null values you would fail to reject at the 5% level.
That equivalence is worth holding onto, because it dissolves a lot of the apparent strangeness of $p$-values: every $p$-value below a threshold corresponds to a specific null value lying outside a specific CI. Same inference, different presentation. The dedicated topic on hypothesis testing picks up the formalism — null and alternative hypotheses, Type I and II errors, test statistics, $p$-values — from this starting point.
9. Common pitfalls
The single most common mistake. Repeat: $\mu$ is fixed, the interval is random. 95% refers to the long-run behavior of the procedure across hypothetical repeated samples, not the probability assigned to any particular interval you've already computed.
Reporting "$\bar x = 50$, SD = 10" describes the data. Reporting "$\bar x = 50$, SE = 1" describes the precision of $\bar x$ as an estimate of $\mu$. Mixing them up — putting SD where SE belongs, or vice versa — silently changes your error bars by a factor of $\sqrt n$.
If $\sigma$ is unknown and $n$ is small (say, $n < 30$), the $z$-interval is too narrow — its actual coverage falls below the nominal level. Default to $t$ unless you genuinely know $\sigma$ or $n$ is large.
Dividing by $n$ gives a biased estimate of $\sigma^2$ (the maximum-likelihood estimator for a normal). Dividing by $n - 1$ gives the unbiased sample variance $S^2$. Most introductory contexts want the unbiased version; software defaults vary, so check.
The whole machinery assumes the sample is a random draw from the population. A convenience sample — voluntary survey, easy-to-reach subjects — can give a beautifully tight confidence interval around the wrong number. Random sampling is not a technicality; it is the assumption that makes the math mean what it says.
Each 95% CI is individually calibrated to 95% coverage. Compute twenty of them and the probability that all contain their parameters is roughly $0.95^{20} \approx 0.36$. Multiple comparisons need joint adjustments (Bonferroni, etc.) — a single threshold doesn't survive being applied many times.
10. Worked examples
Example 1 · A 95% CI for the mean, $\sigma$ known
A sample of $n = 100$ household incomes gives $\bar x = 50{,}000$. From a national database, $\sigma$ is taken as known: $\sigma = 10{,}000$.
Step 1. Standard error:
$$ \operatorname{SE}(\bar X) \;=\; \frac{\sigma}{\sqrt n} \;=\; \frac{10{,}000}{\sqrt{100}} \;=\; 1{,}000. $$Step 2. For 95% confidence, $z_{0.025} = 1.96$. Margin of error:
$$ 1.96 \times 1{,}000 \;=\; 1{,}960. $$Step 3. Interval:
$$ 50{,}000 \;\pm\; 1{,}960 \;=\; (48{,}040,\; 51{,}960). $$Interpretation. The procedure that produced this interval has 95% long-run coverage — across many repeated samples, 95% of intervals built this way would contain the true mean. It is not right to say "there's a 95% chance the true mean is between $48{,}040$ and $51{,}960$."
Example 2 · A $t$-interval when $\sigma$ is unknown
A small clinical study measures recovery time on $n = 10$ patients: $\bar x = 8.4$ days, $s = 1.8$ days.
Step 1. Estimated standard error:
$$ \operatorname{SE} \;=\; \frac{s}{\sqrt n} \;=\; \frac{1.8}{\sqrt{10}} \;\approx\; 0.569. $$Step 2. Degrees of freedom: $n - 1 = 9$. From a $t$-table, $t_{0.025,\,9} \approx 2.262$.
Step 3. Margin of error:
$$ 2.262 \times 0.569 \;\approx\; 1.287. $$Step 4. Interval:
$$ 8.4 \;\pm\; 1.287 \;=\; (7.11,\; 9.69)\;\text{days}. $$Compare to what you'd have gotten with $z = 1.96$: a margin of $1.96 \times 0.569 \approx 1.12$, giving the misleadingly tighter $(7.28, 9.52)$. The $t$ interval is wider because it honestly accounts for not knowing $\sigma$.
Example 3 · How $n$ shrinks the standard error
Suppose $\sigma = 20$. The standard error of $\bar X$ at several sample sizes:
$$ \begin{aligned} n &= 25 :\quad \operatorname{SE} = 20/\sqrt{25} = 4 \\ n &= 100 :\quad \operatorname{SE} = 20/\sqrt{100} = 2 \\ n &= 400 :\quad \operatorname{SE} = 20/\sqrt{400} = 1 \\ n &= 1600 :\quad \operatorname{SE} = 20/\sqrt{1600} = 0.5 \end{aligned} $$Each time you cut SE in half you need $4\times$ more data. To go from SE $= 4$ to SE $= 0.5$ — an $8\times$ tightening — you needed $64\times$ the sample size. Square-root scaling is the entire reason "big enough" sample sizes for high-precision work get expensive fast.
Example 4 · Confidence interval for a proportion
A survey asks 400 voters; 240 support a measure. Sample proportion $\hat p = 240/400 = 0.6$.
Step 1. Standard error (Wald approximation):
$$ \operatorname{SE}(\hat p) \;=\; \sqrt{\frac{\hat p (1-\hat p)}{n}} \;=\; \sqrt{\frac{0.6 \cdot 0.4}{400}} \;=\; \sqrt{0.0006} \;\approx\; 0.0245. $$Step 2. 95% CI:
$$ 0.6 \;\pm\; 1.96 \times 0.0245 \;\approx\; 0.6 \;\pm\; 0.048 \;=\; (0.552,\; 0.648). $$The CLT lets us use a $z$-interval here because $n\hat p$ and $n(1-\hat p)$ are both comfortably large. Newspapers report this as "60% support, margin of error $\pm 5\%$."
Example 5 · Why the sample mean is unbiased
Let $X_1, \ldots, X_n$ be iid draws from a population with mean $\mu$. Then:
$$ E[\bar X] \;=\; E\!\left[\frac{1}{n}\sum_{i=1}^{n} X_i\right] \;=\; \frac{1}{n}\sum_{i=1}^{n} E[X_i] \;=\; \frac{1}{n} \cdot n\mu \;=\; \mu. $$The estimator's sampling distribution is centered exactly on the target parameter — that is what "unbiased" formally means. (Linearity of expectation does all the work; the argument doesn't require normality, just iid sampling and a finite mean.)