Two-Sample Hypothesis Testing — Statistics & Probability

What you'll leave with

How to recognize a two-sample situation and pose its null and alternative hypotheses.
The crucial difference between independent and paired designs — and why getting it wrong throws away most of your power.
The test-statistic formulas for two means (pooled and unpooled) and two proportions, with the standard-error logic intact.
Why "is there a difference?" is the easy question and "how big a difference?" is the one that actually matters.

1. The two-sample setup

You have two groups. You measured something on each. You want to know whether the two underlying populations differ — not just whether the two sample means happen to differ (they always do, a little, just by chance), but whether the difference is bigger than chance can comfortably explain.

Examples that all fit the pattern:

A clinical trial: blood-pressure change for patients on drug A vs. drug B.
A factory: defect rate from line 1 vs. line 2.
A website: conversion rate for the old checkout vs. the new one (A/B test).
A study: test scores for students taught by method X vs. method Y.

The unknowns are two population parameters: typically the two means $\mu_1, \mu_2$ or the two proportions $p_1, p_2$. The data give you the two corresponding sample statistics: $\bar{x}_1, \bar{x}_2$ or $\hat{p}_1, \hat{p}_2$. The test is built around the difference:

$$ \text{null:} \quad H_0: \mu_1 - \mu_2 = 0 \qquad \text{alt:} \quad H_a: \mu_1 - \mu_2 \neq 0 $$

Equivalently: $H_0$ says the two populations have the same mean; $H_a$ says they don't. The alternative can also be one-sided ($\mu_1 - \mu_2 > 0$ or $< 0$) if the question genuinely runs in one direction — e.g., "does the new drug lower blood pressure?"

The pivot

Every two-sample test reduces to the same recipe: pick an estimator of the difference (almost always $\bar{x}_1 - \bar{x}_2$ or $\hat{p}_1 - \hat{p}_2$), divide by its standard error, and compare to a reference distribution. The only thing that changes across tests is how the standard error is computed.

2. Independent vs. paired — pick the right design

Before any formula, ask: are the two samples independent, or are they paired?

Independent

50 patients randomly given drug A; a different 50 given drug B.
Sales from 30 stores in region 1 vs. 30 different stores in region 2.
Test scores of students in class X vs. unrelated students in class Y.

Paired

The same 50 patients measured before and after a treatment.
Twins, one in each condition.
The same 30 stores' sales last year vs. this year.

The signature of a paired design is a natural matching: every observation in group 1 has exactly one partner in group 2, and the pairing carries information. The most common case is a "before/after" measurement on the same unit, but pairs can also be matched twins, left-eye/right-eye comparisons, two judges rating the same items, and so on.

Why pairing matters

If you measure the same 30 patients before and after a drug, much of the variation between patients (their starting baselines) cancels out when you look at changes. The standard error of the average change is much smaller than the standard error of the difference of two independent group means. Treating a paired design as if it were independent throws this gain away — sometimes a 5–10× loss of effective power.

The decision rule

If you can answer "which observation in group 1 corresponds to this observation in group 2?" — and the correspondence carries information — the design is paired. If the only thing the two groups have in common is the variable you're measuring, the design is independent.

3. Two independent means

The setup: two independent SRSs of sizes $n_1$ and $n_2$, sample means $\bar{x}_1$ and $\bar{x}_2$, sample standard deviations $s_1$ and $s_2$. You want to test $H_0: \mu_1 - \mu_2 = 0$.

The estimator of the difference is $\bar{x}_1 - \bar{x}_2$. Because the two samples are independent, the variance of the difference is the sum of the variances:

$$ \text{Var}(\bar{x}_1 - \bar{x}_2) = \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} $$

That single fact — variances add when the things are independent — is the engine of the whole test. Take the square root, swap in the sample SDs for the unknown $\sigma$s, and you have your standard error.

Unpooled (Welch's) t-test

Test statistic (Welch)

$$ t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}}} $$

Compare to a $t$ distribution with the Welch–Satterthwaite degrees of freedom:

$$ \nu \approx \frac{\left(\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}\right)^2}{\dfrac{(s_1^2/n_1)^2}{n_1 - 1} + \dfrac{(s_2^2/n_2)^2}{n_2 - 1}} $$

The Welch formula doesn't assume the two populations have the same variance. The cost is the ugly degrees-of-freedom expression — but you don't compute it by hand; any statistics package returns it. The benefit is that the test is robust to unequal variances and unequal sample sizes, which is the rule rather than the exception in real data.

Pooled t-test

If you are willing to assume $\sigma_1 = \sigma_2 = \sigma$ — sometimes plausible by design, e.g., two parallel arms of a tightly controlled trial — you can pool the two sample variances into a single, more precise estimate:

$$ s_p^2 = \frac{(n_1 - 1)\, s_1^2 + (n_2 - 1)\, s_2^2}{n_1 + n_2 - 2} $$

Test statistic (pooled)

$$ t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\,\sqrt{\dfrac{1}{n_1} + \dfrac{1}{n_2}}} \qquad \text{df} = n_1 + n_2 - 2 $$

The pooled test gives slightly tighter confidence intervals when the equal-variance assumption holds. When it doesn't, the pooled test can have actual Type I error rates that differ substantially from the nominal $\alpha$.

Default to Welch

Modern statistical practice is: use Welch unless you have a strong, design-based reason to believe the variances are equal. The robustness gain costs almost nothing in power when variances really are equal, and it saves you when they aren't.

4. Paired data: a one-sample test in disguise

When the data are paired, the right move is to collapse the two columns into one. For each pair $i$, compute the difference:

$$ d_i = x_{1i} - x_{2i} $$

Now you have a single column of $n$ differences. The hypothesis $H_0: \mu_1 - \mu_2 = 0$ becomes $H_0: \mu_d = 0$, and the entire problem reduces to a one-sample t-test on the differences.

Paired t-test

$$ t = \frac{\bar{d}}{s_d / \sqrt{n}} \qquad \text{df} = n - 1 $$

where $\bar{d}$ is the mean of the differences and $s_d$ is their standard deviation.

That's it — no new machinery. The reason this works so well is that the pairing has already done the heavy lifting: between-pair variation (the baseline differences from patient to patient) is gone, leaving only within-pair variation (each patient's change). $s_d$ is typically much smaller than either $s_1$ or $s_2$ individually.

The picture: when you look at the two columns independently, the spread is wide and the means look similar. When you draw lines between the matched pairs, almost every line slopes the same direction by the same amount — a strong, easy-to-detect effect that the independent test would have buried in noise.

5. Two proportions

For categorical outcomes — converted vs. didn't, voted vs. didn't, defective vs. not — the parameter is a proportion $p$ rather than a mean. Two independent samples of sizes $n_1$ and $n_2$, with sample proportions $\hat{p}_1 = x_1/n_1$ and $\hat{p}_2 = x_2/n_2$. The estimator is the difference $\hat{p}_1 - \hat{p}_2$ and the test is again $z = (\text{estimate} - 0) / \text{SE}$.

One subtle move: under $H_0: p_1 = p_2$, both samples are drawn from a population with a single, unknown common proportion. The most efficient estimate of that common $p$ pools both samples:

$$ \hat{p} = \frac{x_1 + x_2}{n_1 + n_2} $$

and we use this pooled $\hat{p}$ — not $\hat{p}_1$ or $\hat{p}_2$ separately — in the standard error:

Two-proportion z-test

$$ z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1 - \hat{p}) \left(\dfrac{1}{n_1} + \dfrac{1}{n_2}\right)}} $$

Compare to a standard normal distribution. (No degrees of freedom — for proportions, the $z$-test is the standard tool when $n_1 \hat{p}, n_1(1-\hat{p}), n_2 \hat{p}, n_2(1-\hat{p})$ are all at least 5 or 10.)

Why pool here but not for the means test? Because the null hypothesis specifically equates the two proportions, the most efficient estimator of the common $p$ uses every data point. For means, $H_0$ equates the two means but says nothing about the two variances — so pooling the variances is an extra assumption you might not want to make.

For confidence intervals, don't pool

When you're computing a CI for $p_1 - p_2$ (rather than testing if it's zero), use the unpooled standard error $\sqrt{\hat{p}_1(1-\hat{p}_1)/n_1 + \hat{p}_2(1-\hat{p}_2)/n_2}$. Pooling only makes sense under the null; once you're estimating, the two proportions are different by assumption.

6. Effect size — beyond "significant"

A small p-value means: assuming the null, the observed difference would be rare. It does not mean the difference is big. With a large enough $n$, any non-zero gap clears any threshold of significance. The first thing to report alongside a p-value is the magnitude of the effect.

For two means, the canonical effect size is Cohen's $d$:

$$ d = \frac{\bar{x}_1 - \bar{x}_2}{s_p} $$

It's the difference in means expressed in standard-deviation units — invariant to the scale of the measurement. Rough verbal benchmarks (Cohen's own, not laws of nature): $d \approx 0.2$ small, $0.5$ medium, $0.8$ large. For proportions, the analog is the difference $\hat{p}_1 - \hat{p}_2$ itself or the odds ratio $\frac{\hat{p}_1/(1-\hat{p}_1)}{\hat{p}_2/(1-\hat{p}_2)}$.

Report both, always

A p-value alone answers "could this be chance?" An effect size alone answers "is the difference big enough to care about?" Either by itself is half the story. The two together — plus a confidence interval for the difference — are what a competent analysis looks like.

7. When assumptions fail

The t- and z-tests above lean on two assumptions: (a) the samples are SRSs from their populations, and (b) the sampling distribution of the relevant statistic is approximately normal. Thanks to the Central Limit Theorem, (b) becomes very forgiving for large samples — but for small $n$ with skewed or heavy-tailed data, normality matters.

When the normality assumption is suspect, there are non-parametric alternatives that drop it. They work on the ranks of the observations rather than the values themselves and so are insensitive to outliers and skew.

Parametric test	Non-parametric alternative	Tests
Independent two-sample t-test	Mann–Whitney U (Wilcoxon rank-sum)	Whether one distribution tends to give larger values than the other
Paired t-test	Wilcoxon signed-rank	Whether the distribution of differences is symmetric around zero
Two-proportion z-test	Fisher's exact test	Same hypothesis, exact computation — handy when expected cell counts are small

These tests trade a small loss of power (when normality really did hold) for robustness against violations. For small samples with messy data, the non-parametric versions are usually the safer call. The details of each belong in their own discussion; here it's enough to know they exist and when to reach for them.

8. Common pitfalls

Treating paired data as independent

This is the most damaging mistake in two-sample testing. Before-and-after measurements on the same subject are paired by design; analyzing them as two independent groups inflates the standard error, lowers the t-statistic, and hides real effects. Always check: does each row in group 1 correspond to a specific row in group 2?

Pooling variance without thinking

The pooled t-test assumes $\sigma_1 = \sigma_2$. When that's false and sample sizes are very unequal, the test's Type I error rate can be off by a factor of 2 or more. Default to Welch's test; use the pooled version only when you have a strong reason to believe the variances are equal.

Comparing variances when you meant means

"Group A is more variable than group B" is a different hypothesis from "Group A has a different mean than group B." Don't run a t-test and then quote the SDs as if you'd tested for equal variances. If the spread is the question, use an F-test or Levene's test instead.

Small $n$ + heavy tails

The t-test is fairly robust to mild non-normality at moderate $n$, but it can fail badly for small samples ($n < 15$) drawn from skewed or heavy-tailed populations. Plot the data first. If it's clearly non-normal and $n$ is small, switch to the rank-based alternative.

Equating "no significant difference" with "no difference"

Failing to reject $H_0$ does not prove the two populations are the same. It just means your data didn't have enough evidence to rule out equality. Always pair the conclusion with a confidence interval for $\mu_1 - \mu_2$ — if the interval is $(-0.1, +0.1)$, that's strong evidence of "essentially no difference"; if it's $(-5, +5)$, you don't really know.

9. Worked examples

Walk through each one before opening the solution. The number at the end matters less than the moves you made to get there.

Example 1 · Drug A vs. drug B (independent means)

A trial randomly assigns 40 patients to drug A and 35 to drug B. The outcome (systolic blood pressure reduction, in mmHg) is summarized:

Drug A: $n_1 = 40$, $\bar{x}_1 = 12.4$, $s_1 = 5.1$
Drug B: $n_2 = 35$, $\bar{x}_2 = 9.8$, $s_2 = 4.3$

Test $H_0: \mu_1 = \mu_2$ vs. $H_a: \mu_1 \neq \mu_2$ at $\alpha = 0.05$, using Welch's t.

Step 1 · Standard error.

$$ \text{SE} = \sqrt{\frac{5.1^2}{40} + \frac{4.3^2}{35}} = \sqrt{0.650 + 0.528} = \sqrt{1.178} \approx 1.085 $$

Step 2 · Test statistic.

$$ t = \frac{12.4 - 9.8}{1.085} = \frac{2.6}{1.085} \approx 2.40 $$

Step 3 · Degrees of freedom (Welch–Satterthwaite).

$$ \nu \approx \frac{(0.650 + 0.528)^2}{(0.650)^2/39 + (0.528)^2/34} \approx \frac{1.387}{0.0108 + 0.0082} \approx 73 $$

Step 4 · Decision. A two-tailed t-test with $\nu = 73$ gives a p-value of about $0.019$, less than $\alpha = 0.05$. Reject $H_0$.

Step 5 · Effect size. Pooled SD $\approx 4.75$, so $d \approx 2.6 / 4.75 \approx 0.55$ — a "medium" effect. Drug A reduced blood pressure by about 2.6 mmHg more than drug B on average; that's both statistically significant and clinically meaningful.

Example 2 · Before/after weight loss (paired)

Twelve subjects are weighed at the start and end of a 12-week program. The change for each subject (in kg) is: $-3.2, -1.8, -4.1, +0.5, -2.7, -3.9, -1.5, -2.0, +1.2, -4.4, -3.1, -2.5$.

Step 1 · Reduce to one sample of differences. Compute the mean and SD of the 12 changes directly.

$$ \bar{d} = -2.29 \quad,\quad s_d \approx 1.74 $$

Step 2 · One-sample t-statistic.

$$ t = \frac{-2.29}{1.74 / \sqrt{12}} = \frac{-2.29}{0.502} \approx -4.56 $$

Step 3 · Compare to $t_{11}$. The two-tailed p-value is below $0.001$. Reject $H_0: \mu_d = 0$. The program produced a real, substantial reduction.

What would go wrong if you treated this as independent? You would compute the two sample means (start ≈ 80, end ≈ 78) and their SDs (≈ 8 kg each — most of the variation is between people). The independent t-statistic would be small and the difference would look like noise. Pairing rescued a clear effect from a misleading-looking dataset.

Example 3 · A/B test on conversion rate (two proportions)

An e-commerce site randomly routes visitors to one of two checkout layouts. Old layout: $n_1 = 5{,}000$ visitors, $x_1 = 410$ conversions. New layout: $n_2 = 5{,}200$ visitors, $x_2 = 471$ conversions.

Step 1 · Sample proportions.

$$ \hat{p}_1 = 410 / 5000 = 0.0820 \quad,\quad \hat{p}_2 = 471 / 5200 \approx 0.0906 $$

Step 2 · Pooled estimate under $H_0$.

$$ \hat{p} = \frac{410 + 471}{5000 + 5200} = \frac{881}{10200} \approx 0.0864 $$

Step 3 · Standard error and test statistic.

$$ \text{SE} = \sqrt{0.0864 \cdot 0.9136 \cdot \left(\frac{1}{5000} + \frac{1}{5200}\right)} \approx \sqrt{3.10 \times 10^{-5}} \approx 0.00557 $$ $$ z = \frac{0.0906 - 0.0820}{0.00557} \approx 1.54 $$

Step 4 · Decision. Two-tailed p-value ≈ $0.123$. At $\alpha = 0.05$, do not reject $H_0$ — the observed lift is not large enough to be confident the new layout is genuinely better.

Step 5 · The honest interpretation. A 95% CI for $p_2 - p_1$ (using the unpooled SE for an interval) is roughly $(-0.0024, +0.0196)$ — a range from "very slightly worse" to "noticeably better." The data are consistent with a meaningful improvement, just not enough to clear the significance bar. Most A/B-testing teams would run more traffic before shipping the change.

Example 4 · Choosing the right test (decision flow)

For each scenario, name the right test.

(a) Test scores for 60 students taught by method A vs. 55 students taught by method B. → Welch's two-sample t-test. Independent groups, continuous outcome, possibly unequal variances.

(b) Each of 40 tasters rates two coffees, A and B, on a 100-point scale. → Paired t-test on the within-taster differences. The same person rated both — that's a pair.

(c) Click-through rate for 12,000 users shown ad design X vs. 12,500 shown ad design Y. → Two-proportion z-test. Independent groups, binary outcome, large $n$.

(d) Pain scores from 8 patients before and after surgery, with the distributions clearly skewed. → Wilcoxon signed-rank. Paired, but $n$ is small and the data aren't normal — the rank-based version is the safe call.

Sources & further reading

The formulas and conventions above are standard; the primary sources are the right place to go when a specific claim needs a citation, and the deeper links are where to turn when this page has done its job.

Hypothesis Testing with Two Samples Textbook OpenStax · Introductory Statistics 2e, Chapter 10

Peer-reviewed, openly licensed chapter that covers exactly the territory here: independent vs. matched-pairs, two means with both pooled and unpooled variance, and two proportions. The canonical first source.
Hypothesis Testing for Two Means and Two Proportions Textbook OpenStax · Introductory Statistics 2e, §10.5

The specific section with side-by-side worked examples for both kinds of two-sample tests. Useful if you want more numerical practice in the same notation used above.
Significance tests and confidence intervals (two samples) Tutorial Khan Academy · AP Statistics

Short videos and practice problems organized by test type. Best if you want to drill the mechanics — choosing the right test, computing the statistic, reading off the p-value.
Two-Sample t-Test for Equal Means Reference NIST/SEMATECH e-Handbook of Statistical Methods

Formal reference treatment with explicit formulas for both the pooled and Welch versions, plus the assumptions each one relies on. The go-to source when you want a definition stated in the careful language of metrology.
Welch's t-test Encyclopedia Wikipedia

Concise account of the unequal-variance test, its degrees-of-freedom approximation, and the history of why it's preferred over the pooled version in modern practice.
Mann–Whitney U test Encyclopedia Wikipedia

The non-parametric alternative to the independent two-sample t-test. Useful when the normality assumption is dubious — and a frequent reach for small, skewed samples.