1. The framework
You ran a study. You measured something. The number you got isn't quite what the boring, no-effect story would predict — but it isn't wildly off either. How do you decide whether to take the result seriously?
Hypothesis testing answers that question with a deliberately conservative procedure: assume the boring story is true, then ask how often a universe running on the boring story would, by chance alone, produce data at least as strange as yours.
- Start with a default assumption — the null hypothesis $H_0$. Often a statement of "no effect" or "no difference." For a drug trial: "this drug does nothing."
- Collect data — and reduce it to a single number called the test statistic (a sample mean, a difference of means, a count of heads, etc.).
- Compute how unusual the data is, under $H_0$ — the probability of observing data at least as extreme as yours, assuming $H_0$ is true. This is the p-value.
- Decide — if the p-value is small enough (below some threshold $\alpha$, usually $0.05$), reject $H_0$ in favor of the alternative $H_1$. Otherwise, fail to reject $H_0$.
The whole machine is a falsification engine. It can never prove $H_0$ — it can only tell you whether your data was strange enough, conditional on $H_0$, that holding onto $H_0$ becomes uncomfortable.
The procedure is biased toward $H_0$ on purpose. Science accumulates by being slow to accept new effects — if we rejected the boring story whenever data was suggestive, the literature would fill with phantoms. The null gets the benefit of the doubt; the alternative must earn the upgrade.
A picture of the p-value
If $H_0$ is true, the test statistic has some sampling distribution — for sample means, the Central Limit Theorem says it's approximately normal. The p-value is the area in the tail past your observed value.
The shaded sliver is everything that's at least as extreme as what you saw. Big sliver — your data is unremarkable. Tiny sliver — your data is hard to explain by chance alone, and $H_0$ starts to creak.
2. Null and alternative hypotheses
Every test has exactly two hypotheses, and they cover the full space of possibilities between them.
The conservative claim — what you'd believe by default. Usually a statement of "no effect," "no difference," or "the parameter equals some reference value." Formally, $H_0$ specifies the distribution of the data precisely enough to compute probabilities under it.
The claim you would adopt only if the evidence pushes you. It's the change you're looking for: "the drug does something," "the coin is biased," "the mean is bigger than $50$."
Examples
| Setting | $H_0$ | $H_1$ |
|---|---|---|
| Testing a coin for fairness | $p = 0.5$ | $p \neq 0.5$ |
| Drug vs placebo (recovery rate) | $\mu_\text{drug} = \mu_\text{placebo}$ | $\mu_\text{drug} > \mu_\text{placebo}$ |
| Is the average IQ in a population $100$? | $\mu = 100$ | $\mu \neq 100$ |
One-sided or two-sided?
A two-sided alternative (like $p \neq 0.5$) cares about deviation in either direction. A one-sided alternative (like $\mu_\text{drug} > \mu_\text{placebo}$) cares only about one direction. The choice should be made before looking at the data, based on what question you're actually asking. If you only care whether the drug helps — not whether it might hurt — a one-sided test is right.
Choosing the direction of $H_1$ after peeking at the data is a flavor of p-hacking. A two-sided $p = 0.06$ that becomes a one-sided $p = 0.03$ only after you noticed which way the effect went is not a real $p = 0.03$.
3. p-values and significance
The p-value is the single number that summarizes how surprising your data is under the null. Write it precisely:
$$ p = P(\text{data at least as extreme as observed} \mid H_0) $$Every word matters. "Data at least as extreme as observed" means including all the more-extreme outcomes you didn't see — it's a tail area, not a single point. "Given $H_0$" means we're computing this probability assuming the null is true; if it isn't, the p-value is meaningless about the world.
The decision rule
Choose a significance level $\alpha$ before running the test. Conventional values are $0.05$, $0.01$, or $0.001$, but these are conventions, not laws of nature. Then:
$$ \begin{aligned} p < \alpha &\quad\Longrightarrow\quad \text{reject } H_0 \\ p \geq \alpha &\quad\Longrightarrow\quad \text{fail to reject } H_0 \end{aligned} $$Notice the asymmetry: you reject or you fail to reject. You never accept $H_0$. The test simply hasn't given you enough reason to abandon it.
A worked computation
Suppose a population has mean $\mu_0 = 100$ and standard deviation $\sigma = 15$ under $H_0$. You sample $n = 36$ people and observe a sample mean $\bar{x} = 105$. Test $H_0: \mu = 100$ against $H_1: \mu \neq 100$.
By the Central Limit Theorem, $\bar{X} \sim \mathcal{N}(\mu_0, \sigma^2/n)$ under $H_0$. The standard error is
$$ \text{SE} = \frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{36}} = 2.5 $$The z-statistic is
$$ z = \frac{\bar{x} - \mu_0}{\text{SE}} = \frac{105 - 100}{2.5} = 2.0 $$For a two-sided test, the p-value is the area in both tails past $|z| = 2.0$:
$$ p = 2 \cdot P(Z > 2.0) \approx 2 \cdot 0.0228 = 0.0456 $$Since $p \approx 0.046 < 0.05$, reject $H_0$ at the $\alpha = 0.05$ level. The sample mean is surprising enough — under the null — that we don't believe the null anymore.
"Extreme" is defined by $H_1$. For a two-sided test, extreme means "far from the null value in either direction" — so the tail area is doubled. For a one-sided test, it means "far in the direction $H_1$ specifies" — one tail only.
4. Type I and Type II errors
The test can be wrong in two ways, and they're not symmetric. Real-world consequences usually depend on which error matters more.
| $H_0$ is true | $H_1$ is true | |
|---|---|---|
| Reject $H_0$ | Type I error (false positive) · rate $\alpha$ | Correct decision · power $1 - \beta$ |
| Fail to reject $H_0$ | Correct decision · $1 - \alpha$ | Type II error (false negative) · rate $\beta$ |
- Reject $H_0$ when it's actually true.
- "We found an effect" — but there is none.
- Rate: $\alpha$ (the significance level you chose).
- You control this directly by your choice of $\alpha$.
- Fail to reject $H_0$ when $H_1$ is actually true.
- "No effect found" — but there is one.
- Rate: $\beta$ (depends on the true effect size, $n$, and $\alpha$).
- You control this indirectly via sample size.
Power
The probability of correctly rejecting $H_0$ when $H_1$ is true:
$$ \text{Power} = 1 - \beta = P(\text{reject } H_0 \mid H_1 \text{ true}) $$A test with $80\%$ power detects a real effect $80\%$ of the time. Power rises with effect size, sample size, and $\alpha$; it falls with noisier data.
Power is the unsung hero of study design. A study with $20\%$ power that finds nothing tells you almost nothing — the effect could easily be real and you just missed it. Before running a test, you should know what effect size you're hunting and how much data you need to have a reasonable chance of seeing it.
Shrinking $\alpha$ (fewer false positives) inflates $\beta$ (more false negatives) — unless you also grow your sample size. The only way to drive both error rates down simultaneously is to collect more data.
5. What p-values do and don't say
Almost every introductory misuse of hypothesis testing comes from confusing the conditional probability $P(\text{data} \mid H_0)$ with the reverse $P(H_0 \mid \text{data})$. They are not the same number. They are not even close in general.
The p-value is a statement about the data, given a hypothesis. It is not a statement about the hypothesis, given the data.
A p-value of $0.05$ means: if $H_0$ were true, you'd see data this extreme about $5\%$ of the time. It does not mean that $H_0$ has a $5\%$ probability of being true — that's a different question (a Bayesian one), and you need a prior to answer it.
Why this matters: the replication crisis
Across psychology, biomedicine, and parts of economics, large-scale efforts to re-run famous "$p < 0.05$" findings have failed at startling rates — often only $30$–$50\%$ of headline results replicate. Several forces conspire here:
- Publication bias. Journals print significant results; null results sit in file drawers. The literature is enriched for false positives.
- The garden of forking paths. Each study makes many small choices (which outliers to drop, which subgroup to analyze, when to stop collecting). Each choice is an implicit test. Run twenty tests at $\alpha = 0.05$ and you expect one significant result by chance alone.
- Low power. Many fields chronically under-power their studies. Significant results from low-power studies are more likely to be false positives and to overstate the effect size.
- The $0.05$ fetish. Treating $0.049$ as a triumph and $0.051$ as a failure is a category error. The difference between those two p-values is not, itself, statistically significant.
"$p = 0.05$ means there's a $5\%$ chance $H_0$ is true" — wrong. It means: if $H_0$ were true, data this extreme would occur about $5\%$ of the time. The probability that $H_0$ is true given the data depends on how plausible $H_0$ was to start with — which the p-value doesn't know.
Used carefully, hypothesis testing is a precise, useful tool. Used as a rubber stamp — "I got $p < 0.05$, therefore I have discovered a thing" — it's how careers get built on noise.