Inferential Statistics — Statistics & Probability

What you'll leave with

A sharp distinction between populations and samples, and between parameters and statistics.
The single big idea: a sampling distribution is the distribution of a statistic (like $\bar X$) across hypothetical repeated samples.
Why the standard error of the mean is $\sigma/\sqrt n$ — and why doubling precision requires quadrupling the sample.
What a 95% confidence interval actually means (a procedure with 95% long-run coverage) — and the famous misinterpretation to never make again.
Bias vs variance for estimators, why the sample mean is unbiased, and when the $t$-distribution replaces the normal.

1. Why inference exists

Descriptive statistics summarizes data you already have — the mean, median, standard deviation, and shape of a sample sit in front of you, computable in a single pass. Inferential statistics takes the harder step: it treats that sample as a window onto a population you cannot fully observe and asks what you may responsibly claim about the world behind it.

The catch is that a sample is a finite, randomly selected slice. Pick a different 100 people and the mean shifts a little. Pick a third sample and it shifts again. Any single number you compute is contaminated by the luck of which observations happened to fall into your hands. Inferential statistics is the discipline of turning that contamination into a quantified, honest statement: not "the population mean is $50.3$," but "given what we saw, the population mean almost certainly sits between $48$ and $53$."

Descriptive statistics tells you what the sample is. Inferential statistics tells you what the sample is evidence for.

2. Population vs sample, parameter vs statistic

The whole framework depends on holding four words apart. Slipping one for another is the source of most undergraduate confusion.

Population

The complete set of units you'd like to know about — every voter in the country, every widget the factory will ever produce, every patient who could in principle receive the drug. Usually too large, too future, or too hidden to enumerate.

Sample

A finite subset actually drawn from the population — the $n$ rows in your dataset. Random sampling is the assumption that makes the rest of the math work; non-random samples can produce confident-looking numbers about nothing in particular.

A parameter is a fixed (but unknown) number that describes the population: the true mean $\mu$, the true standard deviation $\sigma$, the true proportion $p$. Parameters are what you want to know. A statistic is a number computed from the sample: the sample mean $\bar X$, the sample standard deviation $S$, the sample proportion $\hat p$. Statistics are what you actually have.

	Population (parameter)	Sample (statistic)
Mean	$\mu$	$\bar X = \tfrac{1}{n}\sum X_i$
Variance	$\sigma^2$	$S^2 = \tfrac{1}{n-1}\sum (X_i - \bar X)^2$
Standard deviation	$\sigma$	$S$
Proportion	$p$	$\hat p = \tfrac{\text{successes}}{n}$

Greek letters for the population, Latin for the sample. That convention isn't decorative — it is the visible reminder that the two quantities live in different worlds. $\mu$ is a fixed real number you can never see. $\bar X$ is a random variable: it depends on which sample you happened to draw.

Convention

Capital letters like $X_1, \ldots, X_n$ refer to the sample as random variables (before you look). Lowercase $x_1, \ldots, x_n$ are the realized numbers (after you look). $\bar X$ is the random sample mean; $\bar x$ is its observed value.

3. The sampling distribution

Here is the central concept. Imagine you could draw not one but many independent samples of size $n$ from the same population, computing the sample mean $\bar X$ from each. Each $\bar X$ comes out a little different. The collection of all those $\bar X$ values — across every possible sample of size $n$ — has its own distribution.

Sampling distribution

The probability distribution of a statistic across all possible samples of a given size $n$ drawn from a population. It tells you how the statistic itself varies from sample to sample.

Three facts about the sampling distribution of $\bar X$ make the rest of inferential statistics work:

Its mean equals the population mean. $E[\bar X] = \mu$. On average, the sample mean is right.
Its standard deviation is smaller than the population's — and by a precise amount: $\operatorname{SD}(\bar X) = \sigma/\sqrt n$. Averaging shrinks noise.
For large $n$, its shape is approximately normal — regardless of the population's shape. This is the central limit theorem, and it is the engine that makes inference work even when the population itself is wildly non-normal.

A wide population (top) gives rise to a much narrower sampling distribution of the sample mean (bottom). Both are centered at $\mu$; the bottom is tighter by a factor of $\sqrt n$.

Mental model

The sampling distribution is not something you ever construct from data — you usually have just one sample. It is a thought experiment about what the statistic would look like if you could repeat the whole study, again and again. That thought experiment is the basis for every probabilistic claim about the population.

4. The standard error $\sigma/\sqrt n$

The standard error of a statistic is the standard deviation of its sampling distribution. For the sample mean drawn from a population with standard deviation $\sigma$,

$$ \operatorname{SE}(\bar X) \;=\; \frac{\sigma}{\sqrt n}. $$

Two things deserve attention. First, the standard error is not the standard deviation of the population, and it is not the standard deviation of your sample — it is the standard deviation of $\bar X$ across hypothetical repeated samples. Calling it a separate name keeps that distinction visible.

Second, the $\sqrt n$ in the denominator is one of the most quietly important features of statistics. Precision improves with $n$, but only at a square-root rate:

Sample size $n$	$\operatorname{SE}(\bar X)$ relative to $\sigma$	Improvement
$1$	$\sigma$	baseline
$4$	$\sigma/2$	$2\times$ tighter
$25$	$\sigma/5$	$5\times$ tighter
$100$	$\sigma/10$	$10\times$ tighter
$10{,}000$	$\sigma/100$	$100\times$ tighter

To cut your standard error in half you need four times as much data. To cut it by a factor of ten you need a hundred times as much. This is why huge sample sizes can still leave plenty of uncertainty, and why the marginal value of every extra observation is always shrinking.

Standard error vs standard deviation

"Standard deviation" describes the spread of data; "standard error" describes the spread of an estimator. They have the same units and the same formal definition (an SD), but they're measuring different distributions. Reporting an SD when you meant an SE — or the reverse — silently exaggerates or understates precision by a factor of $\sqrt n$.

In practice you almost never know $\sigma$. The honest move is to plug in $S$, the sample standard deviation, and call the result the estimated standard error $S/\sqrt n$. That substitution is harmless for large $n$ — and exactly what triggers the $t$-distribution for small $n$, which we'll see in §7.

5. Point estimation, bias, and variance

A point estimate is a single best guess of a parameter, computed from the sample. The underlying recipe — the function that turns data into a number — is called the estimator. For the population mean, the canonical estimator is the sample mean: $\hat\mu = \bar X$.

Two qualities tell you whether an estimator is any good.

Bias

$\operatorname{Bias}(\hat\theta) = E[\hat\theta] - \theta$. An estimator is unbiased if, on average across repeated samples, it equals the true parameter — i.e. if its sampling distribution is centered on $\theta$.

Variance

$\operatorname{Var}(\hat\theta)$ — the spread of the estimator's sampling distribution. Low-variance estimators give similar answers across repeated samples; high-variance ones swing wildly.

Total error is captured by the mean squared error:

$$ \operatorname{MSE}(\hat\theta) \;=\; E\!\left[(\hat\theta - \theta)^2\right] \;=\; \operatorname{Var}(\hat\theta) \;+\; \operatorname{Bias}(\hat\theta)^2. $$

That decomposition is the bias–variance tradeoff in one line. You can shave variance at the cost of bias, or the reverse; the best estimator is the one that minimizes the sum.

The sample mean is unbiased

$\bar X = \tfrac{1}{n}\sum X_i$ is the textbook example of an unbiased estimator:

$$ E[\bar X] \;=\; \frac{1}{n}\sum_{i=1}^{n} E[X_i] \;=\; \frac{1}{n} \cdot n\mu \;=\; \mu. $$

And its variance is $\operatorname{Var}(\bar X) = \sigma^2/n$, which is why its standard error is $\sigma/\sqrt n$. Both pieces of the sampling distribution — center and spread — fall out of the same one-line argument.

Why $n - 1$ for sample variance

The sample variance is defined as $S^2 = \tfrac{1}{n-1}\sum (X_i - \bar X)^2$ — not divided by $n$. The $n - 1$ (called Bessel's correction) is exactly what makes $S^2$ unbiased for $\sigma^2$. Dividing by $n$ would systematically underestimate the population variance, because the deviations are measured against $\bar X$ rather than the unknown $\mu$.

6. Confidence intervals: what they do and don't mean

A point estimate alone is misleading — it suggests false precision. A confidence interval reports a range, calibrated so that the procedure that produced it captures the true parameter a known fraction of the time.

For a population mean $\mu$ with known $\sigma$, the standard $z$-based interval at confidence level $1 - \alpha$ is:

$$ \bar X \;\pm\; z_{\alpha/2} \cdot \frac{\sigma}{\sqrt n}. $$

For 95% confidence, $z_{0.025} \approx 1.96$, so the interval is $\bar X \pm 1.96 \cdot \sigma/\sqrt n$. The two pieces are doing two jobs: $\bar X$ is the point estimate; $1.96 \cdot \sigma/\sqrt n$ is the margin of error, set by how far the sampling distribution is likely to stray from $\mu$.

The famous misinterpretation

Confidence intervals are the most-misinterpreted concept in introductory statistics, and the misinterpretation has a name. Almost everyone — students, journalists, even some textbooks — wants to say:

"There's a 95% chance the true mean is in this interval."

That sentence is wrong. The true mean $\mu$ is a fixed number; it is either in the interval or it is not. There is no random event there to assign probability to. What the 95% refers to is the procedure, not the interval you happen to have.

What a 95% CI does mean

"If we repeated the procedure many times — draw a fresh sample, compute the same interval — 95% of the resulting intervals would contain the true $\mu$. The procedure is calibrated to have 95% long-run coverage."

The randomness lives in the interval endpoints, which change with every sample. The parameter is fixed.

What it does NOT mean

"There's a 95% chance $\mu$ is in this particular interval." (No: $\mu$ is fixed; this interval either contains it or doesn't.)
"95% of the population is in this interval." (No: a CI is about a parameter, not the data.)
"95% of future sample means will land in this interval." (No: this is about $\mu$, not future $\bar X$.)

Pitfall

The Bayesian framework does let you say "there's a 95% probability $\mu$ is in this range" — but only via a credible interval, computed from a posterior distribution with a prior. Frequentist confidence intervals (the ones in this section) are a different object, and conflating the two is the single most common error in statistical reporting.

How the width responds to inputs

If you change…	The CI width…	Because
Sample size $n$ up	shrinks (as $1/\sqrt n$)	$\operatorname{SE}$ shrinks
Population SD $\sigma$ down	shrinks	$\operatorname{SE}$ shrinks
Confidence up (e.g. 95% → 99%)	grows	bigger $z_{\alpha/2}$ needed for higher coverage

You can't get a tighter interval and stronger confidence and the same sample size — something has to give. The three knobs trade off against each other, and that tradeoff is the entire economics of study design.

7. When $\sigma$ is unknown: the $t$-distribution

The clean formula $\bar X \pm z_{\alpha/2} \cdot \sigma/\sqrt n$ assumes you know the population standard deviation $\sigma$. You almost never do. The natural fix — substitute the sample standard deviation $S$ — works fine when $n$ is large, but introduces real extra uncertainty when $n$ is small: now you're estimating both the mean and the spread from the same handful of data, and the standardized quantity

$$ T \;=\; \frac{\bar X - \mu}{S/\sqrt n} $$

is no longer standard normal. It follows Student's $t$-distribution with $n - 1$ degrees of freedom.

The $t$-distribution is bell-shaped and symmetric, like the normal, but has heavier tails — extreme values are more likely than the normal would predict, which is exactly the kind of cushion you want when you don't really know $\sigma$. As $n \to \infty$, the $t$-distribution converges to the normal, and the two intervals become indistinguishable.

The $t$-based confidence interval simply swaps $z$ for the corresponding $t$ critical value:

$$ \bar X \;\pm\; t_{\alpha/2,\, n-1} \cdot \frac{S}{\sqrt n}. $$

Situation	Critical value	Use
$\sigma$ known (rare)	$z_{\alpha/2}$	$z$-interval
$\sigma$ unknown, small $n$	$t_{\alpha/2,\, n-1}$	$t$-interval (the everyday case)
$\sigma$ unknown, large $n$ ($\gtrsim 30$)	$z_{\alpha/2}$ (close enough)	either works in practice

Where the $t$ came from

William Sealy Gosset derived the $t$-distribution in 1908 while working as a chemist at Guinness, where small-batch beer testing meant he routinely had $n$ in the single digits. Guinness's confidentiality policy forced him to publish under the pseudonym "Student," and the name stuck.

8. The flip side: hypothesis testing

Confidence intervals say "given the data, here is a calibrated range of parameter values consistent with it." Hypothesis testing asks the dual question: "given a specific proposed parameter value, is the data consistent with it?" The two are flip sides of the same coin.

The link is exact: at significance level $\alpha$, you would reject a null hypothesis $H_0: \mu = \mu_0$ if and only if $\mu_0$ falls outside the $(1-\alpha)$ confidence interval for $\mu$. A 95% CI is exactly the set of null values you would fail to reject at the 5% level.

That equivalence is worth holding onto, because it dissolves a lot of the apparent strangeness of $p$-values: every $p$-value below a threshold corresponds to a specific null value lying outside a specific CI. Same inference, different presentation. The dedicated topic on hypothesis testing picks up the formalism — null and alternative hypotheses, Type I and II errors, test statistics, $p$-values — from this starting point.

9. Common pitfalls

"95% chance the parameter is in the interval"

The single most common mistake. Repeat: $\mu$ is fixed, the interval is random. 95% refers to the long-run behavior of the procedure across hypothetical repeated samples, not the probability assigned to any particular interval you've already computed.

Standard error vs standard deviation

Reporting "$\bar x = 50$, SD = 10" describes the data. Reporting "$\bar x = 50$, SE = 1" describes the precision of $\bar x$ as an estimate of $\mu$. Mixing them up — putting SD where SE belongs, or vice versa — silently changes your error bars by a factor of $\sqrt n$.

Using $z$ when you should use $t$

If $\sigma$ is unknown and $n$ is small (say, $n < 30$), the $z$-interval is too narrow — its actual coverage falls below the nominal level. Default to $t$ unless you genuinely know $\sigma$ or $n$ is large.

$n$ vs $n - 1$ in the sample variance

Dividing by $n$ gives a biased estimate of $\sigma^2$ (the maximum-likelihood estimator for a normal). Dividing by $n - 1$ gives the unbiased sample variance $S^2$. Most introductory contexts want the unbiased version; software defaults vary, so check.

Non-random samples break everything

The whole machinery assumes the sample is a random draw from the population. A convenience sample — voluntary survey, easy-to-reach subjects — can give a beautifully tight confidence interval around the wrong number. Random sampling is not a technicality; it is the assumption that makes the math mean what it says.

Multiple intervals are not jointly calibrated

Each 95% CI is individually calibrated to 95% coverage. Compute twenty of them and the probability that all contain their parameters is roughly $0.95^{20} \approx 0.36$. Multiple comparisons need joint adjustments (Bonferroni, etc.) — a single threshold doesn't survive being applied many times.

10. Worked examples

Example 1 · A 95% CI for the mean, $\sigma$ known

A sample of $n = 100$ household incomes gives $\bar x = 50{,}000$. From a national database, $\sigma$ is taken as known: $\sigma = 10{,}000$.

Step 1. Standard error:

$$ \operatorname{SE}(\bar X) \;=\; \frac{\sigma}{\sqrt n} \;=\; \frac{10{,}000}{\sqrt{100}} \;=\; 1{,}000. $$

Step 2. For 95% confidence, $z_{0.025} = 1.96$. Margin of error:

$$ 1.96 \times 1{,}000 \;=\; 1{,}960. $$

Step 3. Interval:

$$ 50{,}000 \;\pm\; 1{,}960 \;=\; (48{,}040,\; 51{,}960). $$

Interpretation. The procedure that produced this interval has 95% long-run coverage — across many repeated samples, 95% of intervals built this way would contain the true mean. It is not right to say "there's a 95% chance the true mean is between $48{,}040$ and $51{,}960$."

Example 2 · A $t$-interval when $\sigma$ is unknown

A small clinical study measures recovery time on $n = 10$ patients: $\bar x = 8.4$ days, $s = 1.8$ days.

Step 1. Estimated standard error:

$$ \operatorname{SE} \;=\; \frac{s}{\sqrt n} \;=\; \frac{1.8}{\sqrt{10}} \;\approx\; 0.569. $$

Step 2. Degrees of freedom: $n - 1 = 9$. From a $t$-table, $t_{0.025,\,9} \approx 2.262$.

Step 3. Margin of error:

$$ 2.262 \times 0.569 \;\approx\; 1.287. $$

Step 4. Interval:

$$ 8.4 \;\pm\; 1.287 \;=\; (7.11,\; 9.69)\;\text{days}. $$

Compare to what you'd have gotten with $z = 1.96$: a margin of $1.96 \times 0.569 \approx 1.12$, giving the misleadingly tighter $(7.28, 9.52)$. The $t$ interval is wider because it honestly accounts for not knowing $\sigma$.

Example 3 · How $n$ shrinks the standard error

Suppose $\sigma = 20$. The standard error of $\bar X$ at several sample sizes:

$$ \begin{aligned} n &= 25 :\quad \operatorname{SE} = 20/\sqrt{25} = 4 \\ n &= 100 :\quad \operatorname{SE} = 20/\sqrt{100} = 2 \\ n &= 400 :\quad \operatorname{SE} = 20/\sqrt{400} = 1 \\ n &= 1600 :\quad \operatorname{SE} = 20/\sqrt{1600} = 0.5 \end{aligned} $$

Each time you cut SE in half you need $4\times$ more data. To go from SE $= 4$ to SE $= 0.5$ — an $8\times$ tightening — you needed $64\times$ the sample size. Square-root scaling is the entire reason "big enough" sample sizes for high-precision work get expensive fast.

Example 4 · Confidence interval for a proportion

A survey asks 400 voters; 240 support a measure. Sample proportion $\hat p = 240/400 = 0.6$.

Step 1. Standard error (Wald approximation):

$$ \operatorname{SE}(\hat p) \;=\; \sqrt{\frac{\hat p (1-\hat p)}{n}} \;=\; \sqrt{\frac{0.6 \cdot 0.4}{400}} \;=\; \sqrt{0.0006} \;\approx\; 0.0245. $$

Step 2. 95% CI:

$$ 0.6 \;\pm\; 1.96 \times 0.0245 \;\approx\; 0.6 \;\pm\; 0.048 \;=\; (0.552,\; 0.648). $$

The CLT lets us use a $z$-interval here because $n\hat p$ and $n(1-\hat p)$ are both comfortably large. Newspapers report this as "60% support, margin of error $\pm 5\%$."

Example 5 · Why the sample mean is unbiased

Let $X_1, \ldots, X_n$ be iid draws from a population with mean $\mu$. Then:

$$ E[\bar X] \;=\; E\!\left[\frac{1}{n}\sum_{i=1}^{n} X_i\right] \;=\; \frac{1}{n}\sum_{i=1}^{n} E[X_i] \;=\; \frac{1}{n} \cdot n\mu \;=\; \mu. $$

The estimator's sampling distribution is centered exactly on the target parameter — that is what "unbiased" formally means. (Linearity of expectation does all the work; the argument doesn't require normality, just iid sampling and a finite mean.)

Sources & further reading

The treatment above synthesizes standard introductory-statistics material; the sources below are where to go for fuller derivations, more examples, and the interactive visualizations that make sampling distributions click.

Confidence Intervals Textbook OpenStax · Introductory Statistics 2e, Ch. 8

Peer-reviewed, openly licensed chapter covering $z$- and $t$-based CIs and proportion intervals with worked examples and exercises. The closest thing to a canonical undergraduate reference for the material on this page.
OpenIntro Statistics Textbook Diez, Çetinkaya-Rundel, Barr · free PDF

Free, modern, end-to-end statistics textbook. Excellent companion if you want to see point estimation, CIs, and hypothesis testing developed in a single coherent narrative with real datasets.
Seeing Theory · Frequentist Inference Tutorial Brown University

Browser-based interactive visualizations of sampling distributions, point estimates, and confidence intervals. The fastest way to internalize "the procedure has 95% coverage" by watching repeated CIs slide past the true parameter.
STAT 415 · Introduction to Mathematical Statistics Course Penn State Eberly College

Full lecture notes for a mathematical-statistics course at the level just above this page. Best for the formal derivations — unbiasedness, MSE, MLE, the distribution of $T$ — once you want the proofs.
Confidence Interval Reference Wolfram MathWorld

Short, dense, precise definition. Use this when you want the formal frequentist statement of coverage stated in the language professional statisticians actually use.
Statistical inference Encyclopedia Wikipedia

Wide overview of the entire field, including the contrast between frequentist and Bayesian inference that this page mentions only in passing. Useful for placing inferential statistics in the broader mathematical landscape.