Hypothesis Testing

What you'll leave with

The four-step framework: null hypothesis, data, p-value, decision.
How to set up $H_0$ and $H_1$ — and why the burden of proof sits on $H_1$.
What the p-value actually is — a conditional probability about data, not about hypotheses.
Type I vs Type II errors, their rates $\alpha$ and $\beta$, and the meaning of power $1 - \beta$.
The most common mis-readings of $p < 0.05$ and why the replication crisis happened on their back.

1. The framework

You ran a study. You measured something. The number you got isn't quite what the boring, no-effect story would predict — but it isn't wildly off either. How do you decide whether to take the result seriously?

Hypothesis testing answers that question with a deliberately conservative procedure: assume the boring story is true, then ask how often a universe running on the boring story would, by chance alone, produce data at least as strange as yours.

The four moves

Start with a default assumption — the null hypothesis $H_0$. Often a statement of "no effect" or "no difference." For a drug trial: "this drug does nothing."
Collect data — and reduce it to a single number called the test statistic (a sample mean, a difference of means, a count of heads, etc.).
Compute how unusual the data is, under $H_0$ — the probability of observing data at least as extreme as yours, assuming $H_0$ is true. This is the p-value.
Decide — if the p-value is small enough (below some threshold $\alpha$, usually $0.05$), reject $H_0$ in favor of the alternative $H_1$. Otherwise, fail to reject $H_0$.

The whole machine is a falsification engine. It can never prove $H_0$ — it can only tell you whether your data was strange enough, conditional on $H_0$, that holding onto $H_0$ becomes uncomfortable.

Why this asymmetry?

The procedure is biased toward $H_0$ on purpose. Science accumulates by being slow to accept new effects — if we rejected the boring story whenever data was suggestive, the literature would fill with phantoms. The null gets the benefit of the doubt; the alternative must earn the upgrade.

A picture of the p-value

If $H_0$ is true, the test statistic has some sampling distribution — for sample means, the Central Limit Theorem says it's approximately normal. The p-value is the area in the tail past your observed value.

The shaded sliver is everything that's at least as extreme as what you saw. Big sliver — your data is unremarkable. Tiny sliver — your data is hard to explain by chance alone, and $H_0$ starts to creak.

2. Null and alternative hypotheses

Every test has exactly two hypotheses, and they cover the full space of possibilities between them.

Null hypothesis ($H_0$)

The conservative claim — what you'd believe by default. Usually a statement of "no effect," "no difference," or "the parameter equals some reference value." Formally, $H_0$ specifies the distribution of the data precisely enough to compute probabilities under it.

Alternative hypothesis ($H_1$ or $H_a$)

The claim you would adopt only if the evidence pushes you. It's the change you're looking for: "the drug does something," "the coin is biased," "the mean is bigger than $50$."

Examples

Setting	$H_0$	$H_1$
Testing a coin for fairness	$p = 0.5$	$p \neq 0.5$
Drug vs placebo (recovery rate)	$\mu_\text{drug} = \mu_\text{placebo}$	$\mu_\text{drug} > \mu_\text{placebo}$
Is the average IQ in a population $100$?	$\mu = 100$	$\mu \neq 100$

One-sided or two-sided?

A two-sided alternative (like $p \neq 0.5$) cares about deviation in either direction. A one-sided alternative (like $\mu_\text{drug} > \mu_\text{placebo}$) cares only about one direction. The choice should be made before looking at the data, based on what question you're actually asking. If you only care whether the drug helps — not whether it might hurt — a one-sided test is right.

Pick the alternative before seeing the data

Choosing the direction of $H_1$ after peeking at the data is a flavor of p-hacking. A two-sided $p = 0.06$ that becomes a one-sided $p = 0.03$ only after you noticed which way the effect went is not a real $p = 0.03$.

3. p-values and significance

The p-value is the single number that summarizes how surprising your data is under the null. Write it precisely:

$$ p = P(\text{data at least as extreme as observed} \mid H_0) $$

Every word matters. "Data at least as extreme as observed" means including all the more-extreme outcomes you didn't see — it's a tail area, not a single point. "Given $H_0$" means we're computing this probability assuming the null is true; if it isn't, the p-value is meaningless about the world.

The decision rule

Choose a significance level $\alpha$ before running the test. Conventional values are $0.05$, $0.01$, or $0.001$, but these are conventions, not laws of nature. Then:

$$ \begin{aligned} p < \alpha &\quad\Longrightarrow\quad \text{reject } H_0 \\ p \geq \alpha &\quad\Longrightarrow\quad \text{fail to reject } H_0 \end{aligned} $$

Notice the asymmetry: you reject or you fail to reject. You never accept $H_0$. The test simply hasn't given you enough reason to abandon it.

A worked computation

Suppose a population has mean $\mu_0 = 100$ and standard deviation $\sigma = 15$ under $H_0$. You sample $n = 36$ people and observe a sample mean $\bar{x} = 105$. Test $H_0: \mu = 100$ against $H_1: \mu \neq 100$.

By the Central Limit Theorem, $\bar{X} \sim \mathcal{N}(\mu_0, \sigma^2/n)$ under $H_0$. The standard error is

$$ \text{SE} = \frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{36}} = 2.5 $$

The z-statistic is

$$ z = \frac{\bar{x} - \mu_0}{\text{SE}} = \frac{105 - 100}{2.5} = 2.0 $$

For a two-sided test, the p-value is the area in both tails past $|z| = 2.0$:

$$ p = 2 \cdot P(Z > 2.0) \approx 2 \cdot 0.0228 = 0.0456 $$

Since $p \approx 0.046 < 0.05$, reject $H_0$ at the $\alpha = 0.05$ level. The sample mean is surprising enough — under the null — that we don't believe the null anymore.

Reading "extreme"

"Extreme" is defined by $H_1$. For a two-sided test, extreme means "far from the null value in either direction" — so the tail area is doubled. For a one-sided test, it means "far in the direction $H_1$ specifies" — one tail only.

4. Type I and Type II errors

The test can be wrong in two ways, and they're not symmetric. Real-world consequences usually depend on which error matters more.

	$H_0$ is true	$H_1$ is true
Reject $H_0$	Type I error (false positive) · rate $\alpha$	Correct decision · power $1 - \beta$
Fail to reject $H_0$	Correct decision · $1 - \alpha$	Type II error (false negative) · rate $\beta$

Type I — false positive

Reject $H_0$ when it's actually true.
"We found an effect" — but there is none.
Rate: $\alpha$ (the significance level you chose).
You control this directly by your choice of $\alpha$.

Type II — false negative

Fail to reject $H_0$ when $H_1$ is actually true.
"No effect found" — but there is one.
Rate: $\beta$ (depends on the true effect size, $n$, and $\alpha$).
You control this indirectly via sample size.

Power

The probability of correctly rejecting $H_0$ when $H_1$ is true:

$$ \text{Power} = 1 - \beta = P(\text{reject } H_0 \mid H_1 \text{ true}) $$

A test with $80\%$ power detects a real effect $80\%$ of the time. Power rises with effect size, sample size, and $\alpha$; it falls with noisier data.

Power is the unsung hero of study design. A study with $20\%$ power that finds nothing tells you almost nothing — the effect could easily be real and you just missed it. Before running a test, you should know what effect size you're hunting and how much data you need to have a reasonable chance of seeing it.

The tradeoff

Shrinking $\alpha$ (fewer false positives) inflates $\beta$ (more false negatives) — unless you also grow your sample size. The only way to drive both error rates down simultaneously is to collect more data.

5. What p-values do and don't say

Almost every introductory misuse of hypothesis testing comes from confusing the conditional probability $P(\text{data} \mid H_0)$ with the reverse $P(H_0 \mid \text{data})$. They are not the same number. They are not even close in general.

The p-value is a statement about the data, given a hypothesis. It is not a statement about the hypothesis, given the data.

A p-value of $0.05$ means: if $H_0$ were true, you'd see data this extreme about $5\%$ of the time. It does not mean that $H_0$ has a $5\%$ probability of being true — that's a different question (a Bayesian one), and you need a prior to answer it.

Why this matters: the replication crisis

Across psychology, biomedicine, and parts of economics, large-scale efforts to re-run famous "$p < 0.05$" findings have failed at startling rates — often only $30$–$50\%$ of headline results replicate. Several forces conspire here:

Publication bias. Journals print significant results; null results sit in file drawers. The literature is enriched for false positives.
The garden of forking paths. Each study makes many small choices (which outliers to drop, which subgroup to analyze, when to stop collecting). Each choice is an implicit test. Run twenty tests at $\alpha = 0.05$ and you expect one significant result by chance alone.
Low power. Many fields chronically under-power their studies. Significant results from low-power studies are more likely to be false positives and to overstate the effect size.
The $0.05$ fetish. Treating $0.049$ as a triumph and $0.051$ as a failure is a category error. The difference between those two p-values is not, itself, statistically significant.

The cardinal misreading

"$p = 0.05$ means there's a $5\%$ chance $H_0$ is true" — wrong. It means: if $H_0$ were true, data this extreme would occur about $5\%$ of the time. The probability that $H_0$ is true given the data depends on how plausible $H_0$ was to start with — which the p-value doesn't know.

Used carefully, hypothesis testing is a precise, useful tool. Used as a rubber stamp — "I got $p < 0.05$, therefore I have discovered a thing" — it's how careers get built on noise.

7. Common pitfalls

The p-value is not $P(H_0 \mid \text{data})$

This is the single most common misreading. The p-value is computed assuming $H_0$; it cannot tell you the probability that $H_0$ is true. Inverting the conditional requires a prior — which is the Bayesian path, not the frequentist one.

"Statistically significant" is not "practically important"

With a large enough sample, even microscopic effects become statistically significant. A drug that lowers blood pressure by $0.1$ mmHg can be $p < 0.001$ if you tested it on a million people — and clinically useless. Always report effect sizes and confidence intervals alongside p-values.

$\alpha = 0.05$ is a convention, not a law

Ronald Fisher proposed $0.05$ as a rough rule of thumb in the 1920s, not a sacred threshold. A finding with $p = 0.06$ is not meaningfully different from one with $p = 0.04$. The number is continuous; the decision is binary; the gap between them is artificial.

Multiple testing inflates Type I rate

If you run $20$ independent tests at $\alpha = 0.05$, the probability of at least one false positive is $1 - (1 - 0.05)^{20} \approx 0.64$. Without correction (Bonferroni, Benjamini–Hochberg, or similar), running many tests guarantees you'll find "significant" results that are pure chance. Pre-register your hypotheses, or correct the threshold.

8. Worked examples

Each example targets one piece of the machinery. Try to work the problem before opening the answer — the goal is to feel the steps, not to memorize them.

Example 1 · Set up $H_0$ and $H_1$ for a coin-fairness test

You suspect a coin is biased and want to test it. Let $p$ be the true probability of heads.

Null hypothesis. The conservative claim — the coin is fair:

$$ H_0: \; p = 0.5 $$

Alternative hypothesis. You suspect bias but don't know in which direction, so a two-sided alternative:

$$ H_1: \; p \neq 0.5 $$

If you had a specific suspicion — "this coin lands heads too often" — the alternative would instead be one-sided: $H_1: p > 0.5$. Decide which before flipping.

Example 2 · Compute a p-value for a sample-mean test

A factory claims its widgets have mean weight $\mu_0 = 50$ g with standard deviation $\sigma = 4$ g. You sample $n = 64$ widgets and measure $\bar{x} = 51.2$ g. Test $H_0: \mu = 50$ against $H_1: \mu \neq 50$ at $\alpha = 0.05$.

Step 1. Standard error of the sample mean (Central Limit Theorem):

$$ \text{SE} = \frac{\sigma}{\sqrt{n}} = \frac{4}{\sqrt{64}} = 0.5 $$

Step 2. Compute the z-statistic:

$$ z = \frac{\bar{x} - \mu_0}{\text{SE}} = \frac{51.2 - 50}{0.5} = 2.4 $$

Step 3. Two-sided p-value — area in both tails past $|z| = 2.4$:

$$ p = 2 \cdot P(Z > 2.4) \approx 2 \cdot 0.0082 = 0.0164 $$

Step 4. Compare to $\alpha$: $p \approx 0.016 < 0.05$, so reject $H_0$. The mean weight is significantly different from $50$ g.

But also report the effect size. The point estimate is $1.2$ g above target — whether that matters depends on context (food labeling? aerospace tolerances?). Significance is not importance.

Example 3 · Identify Type I vs Type II errors

A medical screening test for a rare disease. Let $H_0$: "this patient does not have the disease."

Scenario A. The test flags a healthy patient as having the disease. They go through unnecessary follow-up procedures.

→ Reject $H_0$ when $H_0$ is true. Type I error (false positive).

Scenario B. The test misses an actually sick patient. They go home untreated.

→ Fail to reject $H_0$ when $H_1$ is true. Type II error (false negative).

Which matters more? For screening, a Type II error (missing a real case) is usually far worse than a Type I error (a false alarm that gets cleared up by follow-up testing). The test should be tuned for low $\beta$ — high sensitivity — even at the cost of higher $\alpha$. Different contexts call for different tradeoffs.

Example 4 · Interpret a p-value correctly

A study reports: "Patients on the new drug had lower blood pressure than controls ($p = 0.03$)."

What this does mean: If the drug truly had no effect (i.e. $H_0$ were true), you'd see a difference at least this large in about $3\%$ of replicated studies — purely by sampling variability. That's surprising enough, at $\alpha = 0.05$, to reject $H_0$ and conclude the drug had some effect.

What this does not mean:

It does not mean there is a $3\%$ chance the drug has no effect.
It does not mean there is a $97\%$ chance the drug works.
It does not tell you how much the drug lowers blood pressure — for that you need an effect size and a confidence interval.
It does not guarantee a replication will reach $p < 0.05$ again — especially if the original study was underpowered.

Example 5 · Recognize a common mis-interpretation

A press release reads: "Our new fertilizer increases crop yield ($p = 0.04$). This means there's only a $4\%$ chance the result is due to chance."

What's wrong? The second sentence is the cardinal misreading. The p-value is not the probability that the result is due to chance, nor the probability that $H_0$ is true.

A correct restatement: "If the fertilizer truly had no effect, we'd expect to see a yield difference this large or larger about $4\%$ of the time, by sampling variability alone. That's small enough that we don't believe the no-effect explanation."

Why the distinction matters. The original phrasing implies a probability about reality ("the result is due to chance"). The correct phrasing makes clear it's a probability about data, conditional on a hypothesis. If the no-effect explanation was a priori very plausible — say, fertilizer effects on this crop have failed every previous trial — even a $p = 0.04$ shouldn't move you very much. The base rate matters, and the p-value doesn't know about it.

Sources & further reading

Hypothesis testing is one of the most consequential — and most misused — tools in applied statistics. The references below cover the mechanics, the formalism, and the live debate over how the standard $\alpha = 0.05$ procedure should evolve.

Hypothesis Testing with One Sample Textbook OpenStax · Statistics, Chapter 9

Peer-reviewed, openly licensed introduction. Covers the four-step procedure, Type I/II errors, and one-sample tests with the right level of mechanical detail for first exposure.
Significance tests (one sample) Tutorial Khan Academy · AP Statistics

Bite-sized lessons with practice problems. Best if you want to drill computing p-values and identifying errors until the moves are automatic.
Statistical hypothesis testing Encyclopedia Wikipedia

A wide survey including the Fisher/Neyman-Pearson historical split, criticisms, and a thorough catalog of specific tests. Useful for placing this topic in the broader inferential landscape.
p-value Encyclopedia Wikipedia

A focused article on the p-value itself, including the ASA's 2016 statement on its misuse and an extended discussion of the replication crisis. Read this if you want the cautionary half of the story in detail.