Topic · Statistics & Probability

ANOVA — Analysis of Variance

Comparing three or more group means at once. You could run a t-test between every pair, but you'd inflate the false-positive rate. ANOVA does the comparison in one shot, by asking a clever indirect question: is the variation between groups bigger than the variation within them?

What you'll leave with

  • Why running many t-tests is a trap — and exactly how the false-positive rate inflates.
  • What the F-distribution is and why ANOVA's test statistic lives on it.
  • The variance-partition identity: $SST = SSB + SSW$, and what each term means.
  • How to assemble an ANOVA table from raw group data, end to end.
  • What a significant F tells you — and, importantly, what it does not.

1. Why not just do many t-tests

Say you have $k = 4$ groups and want to know whether any of their means differ. The naive plan is to run a two-sample t-test for every pair. With four groups, that's $\binom{4}{2} = 6$ tests. With six groups, it jumps to $\binom{6}{2} = 15$.

Each test carries its own $\alpha = 0.05$ chance of a false positive. If the tests were independent (they aren't quite, but the issue is real), the probability of at least one false positive across $m$ tests is

$$ 1 - (1 - 0.05)^m $$

For $m = 6$: about $26\%$. For $m = 15$: about $54\%$. By the time you have six groups, you're more likely than not to find at least one "significant" difference just by chance — even if every group has the same mean.

Family-wise error rate

The probability of making at least one Type I error across a family of tests is the family-wise error rate. Many t-tests inflates it dramatically. ANOVA controls it by giving you one omnibus test at level $\alpha$, regardless of how many groups you have.

ANOVA replaces the cluster of pairwise tests with a single question: is there any difference among the group means at all? That's a yes/no answer at one significance level, paid for with one test.

2. The F-distribution

ANOVA's test statistic is a ratio of two estimated variances. To know how surprising any particular ratio is, you need the distribution of such ratios under the null hypothesis — which is the F-distribution.

F-distribution

If $U \sim \chi^2_{d_1}$ and $V \sim \chi^2_{d_2}$ are independent chi-square random variables, then

$$ F = \frac{U / d_1}{V / d_2} $$

has an F-distribution with $d_1$ numerator and $d_2$ denominator degrees of freedom. Right-skewed, supported on $[0, \infty)$, mean $\approx 1$ for large $d_2$.

What the curve looks like

F(2, 20) F(5, 20) F(10, 20) 0 1 2 3 4 5 F value 0 0.25 0.50 0.75

Three features that matter. The distribution is non-negative (it's a ratio of non-negative quantities). It's right-skewed with a long upper tail. And it's centered near 1 — under the null, the two variances should be roughly equal, so their ratio should hover around 1. A big F-value (well into the right tail) is the signal that something is up.

3. Variance partitioning: SST = SSB + SSW

Suppose you have $k$ groups, with $n_i$ observations in group $i$, and total sample size $N = \sum n_i$. Let $\bar x_i$ denote the group $i$ sample mean and $\bar x$ the overall (grand) mean.

The total sum of squares measures how much each observation deviates from the grand mean:

$$ SST = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (x_{ij} - \bar x)^2 $$

It can be split exactly into two pieces. The between-groups sum of squares measures how much the group means deviate from the grand mean, weighted by group size:

$$ SSB = \sum_{i=1}^{k} n_i (\bar x_i - \bar x)^2 $$

The within-groups sum of squares measures how much each observation deviates from its own group's mean:

$$ SSW = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (x_{ij} - \bar x_i)^2 $$

The remarkable identity is that these add up to the total, with no remainder:

$$ \boxed{\; SST = SSB + SSW \;} $$

All the variation in the data either lives between the group means (signal — different groups behave differently) or within the groups (noise — random fluctuation around each group's center). ANOVA's whole insight is to look at the ratio.

What it looks like

grand mean between within Group 1 Group 2 Group 3 group mean between (SSB) within (SSW)

4. Mean squares and the F-statistic

Sums of squares depend on how many things you summed. To compare them, scale each by its degrees of freedom — turning sums into average squared deviations, called mean squares:

$$ MSB = \frac{SSB}{k - 1}, \qquad MSW = \frac{SSW}{N - k} $$

The numerator df is $k - 1$ because there are $k$ group means but they're constrained to be consistent with the grand mean (so $k - 1$ are free). The denominator df is $N - k$ because each observation contributes one piece, minus one constraint per group (each group mean).

The F-statistic is just the ratio:

$$ F = \frac{MSB}{MSW} $$

Under $H_0$ (all group means equal), $MSB$ and $MSW$ are both unbiased estimators of the same within-group variance $\sigma^2$, so their ratio sits near 1. Under $H_a$ (group means differ), $MSB$ is inflated by the genuine differences between groups while $MSW$ stays put — pushing the ratio into the right tail of $F_{k-1,\,N-k}$.

Mental model

$MSW$ is the "noise floor" — how much variation you'd expect from random fluctuation alone, estimated from each group's own scatter. $MSB$ is "signal + noise" — how much variation appears between the group means. When signal/noise is large (well above 1), you've found a real effect.

5. The ANOVA table

Every one-way ANOVA result is conventionally laid out in the same five-column table. Memorize the structure once and you can read any ANOVA output.

SourceSSdfMSF
Between groups$SSB$$k - 1$$MSB = SSB/(k-1)$$F = MSB/MSW$
Within groups$SSW$$N - k$$MSW = SSW/(N-k)$
Total$SST$$N - 1$

Check: the SS column adds to $SST$ ($SSB + SSW$), and the df column adds to $N - 1$ ($(k-1) + (N-k)$). Both checks must pass for arithmetic to be right.

6. Assumptions

ANOVA's distribution theory rests on three assumptions. Violations don't always invalidate the test, but knowing when you've stretched the assumptions tells you when to trust the p-value.

  1. Independence. Observations within and across groups are independent. Repeated measurements on the same subject violate this — use repeated-measures ANOVA instead.
  2. Normality within groups. The data in each group is approximately normally distributed. ANOVA is fairly robust to moderate skew, especially with large groups, but heavily skewed data calls for a non-parametric alternative like the Kruskal–Wallis test.
  3. Equal variances (homogeneity of variance, or homoscedasticity). All groups share roughly the same variance. Check with a Levene test or Bartlett test; if it fails, use Welch's ANOVA instead, which doesn't require equal variances.

7. After a significant F: post-hoc tests

A significant F tells you "at least one pair of group means differs." It doesn't tell you which pair. To find that, you run a post-hoc (after-the-fact) test that compares pairs while controlling the family-wise error rate.

  • Tukey's HSD (Honestly Significant Difference). The standard choice for all pairwise comparisons when group sizes are roughly equal. Controls family-wise error at $\alpha$ exactly.
  • Bonferroni correction. The simplest, most conservative approach: divide your significance level by the number of comparisons. Easy to defend but loses power when there are many groups.
  • Scheffé's method. The most flexible — allows any linear contrast among means, not just pairwise — at the cost of being the most conservative.
Order of operations

The convention is omnibus first, then post-hoc. If the overall F is not significant, you stop — running pairwise tests after a non-significant F amounts to fishing through a noise distribution.

8. One-way vs two-way ANOVA

Everything above is one-way ANOVA: a single categorical factor (e.g., fertilizer type) with several levels (A, B, C, D), and a continuous response.

Two-way ANOVA adds a second categorical factor. For example: fertilizer type × soil type. Now you can ask three questions in the same analysis:

  • Does fertilizer type affect yield? (main effect of factor A)
  • Does soil type affect yield? (main effect of factor B)
  • Does the effect of fertilizer depend on soil type? (interaction effect)

The variance partition extends: $SST = SSA + SSB + SS_{AB} + SSW$. The mechanics are the same — compute mean squares, take ratios, get F-statistics — but the bookkeeping is heavier. For the rest of this topic we stay with one-way.

9. Common pitfalls

Treating ANOVA as locating differences

A significant F means "the means aren't all equal" — full stop. It does not identify which groups differ. Post-hoc tests do. Quoting a significant F and then claiming "group A is different from group C" without a post-hoc is a common reviewer-bait.

Skipping assumption checks

Unequal variances or strong non-normality can wreck the test. ANOVA's robustness is real but limited — at minimum, look at the group standard deviations side-by-side. If the largest is more than twice the smallest, equal-variance ANOVA is questionable.

Confusing one-way with two-way

If your experiment crosses two factors (say drug × dose), running separate one-way ANOVAs on each factor misses the interaction — the most interesting effect in many designs. Use the design that matches the data collection.

Pitfall: a non-significant F doesn't prove equality

"Failed to reject $H_0$" is not the same as "the groups have equal means." It's "we couldn't find enough evidence to claim they differ." With small samples or noisy data, real differences hide easily. Don't write "ANOVA showed no difference" — write "ANOVA did not detect a difference."

Pitfall: forgetting independence

Repeated measurements on the same subject (the same patient at three time points, say) violate the independence assumption. Standard one-way ANOVA treats them as separate observations and underestimates the error variance. Use repeated-measures ANOVA or a mixed model.

10. Worked examples

Each example walks through the full ANOVA table from raw group means. The arithmetic is repetitive on purpose — the goal is for the SS → MS → F pipeline to become muscle memory.

Example 1 · Three teaching methods (a complete one-way ANOVA)

Three classrooms try three different teaching methods. Test scores from each class ($n = 4$ per class):

  • Method A: 80, 85, 78, 81 (mean $\bar x_A = 81$)
  • Method B: 70, 72, 68, 74 (mean $\bar x_B = 71$)
  • Method C: 90, 88, 92, 86 (mean $\bar x_C = 89$)

Grand mean: $\bar x = (81 + 71 + 89)/3 = 80.33$. Total $N = 12$, $k = 3$.

SSB (between groups): $4 \cdot [(81 - 80.33)^2 + (71 - 80.33)^2 + (89 - 80.33)^2]$

$$ = 4 \cdot [0.45 + 87.05 + 75.17] \approx 650.7 $$

SSW (within groups): for each group, sum of squared deviations from its own mean.

  • A: $(80-81)^2 + (85-81)^2 + (78-81)^2 + (81-81)^2 = 1 + 16 + 9 + 0 = 26$
  • B: $(70-71)^2 + (72-71)^2 + (68-71)^2 + (74-71)^2 = 1 + 1 + 9 + 9 = 20$
  • C: $(90-89)^2 + (88-89)^2 + (92-89)^2 + (86-89)^2 = 1 + 1 + 9 + 9 = 20$

So $SSW = 26 + 20 + 20 = 66$.

Mean squares:

$$ MSB = \frac{650.7}{3 - 1} = 325.3, \qquad MSW = \frac{66}{12 - 3} = 7.33 $$

F-statistic:

$$ F = \frac{325.3}{7.33} \approx 44.4 $$

Degrees of freedom: $(2, 9)$. Critical value at $\alpha = 0.05$ is $F_{0.05, 2, 9} \approx 4.26$. Our F of $44.4$ is enormously bigger.

Conclusion. Reject $H_0$ — the teaching methods do not all produce the same mean score. A post-hoc Tukey test would identify which pairs differ (almost certainly all of them, given how separated the means are).

Example 2 · Fertilizer effects across 4 fields

Four fertilizer types are applied to plots, with yields (in bushels per acre) measured. Suppose the ANOVA table comes out:

SourceSSdfMSF
Between fertilizers2403804.0
Within3201620
Total56019

Here $k = 4$ groups, $N = 20$ observations. Critical value $F_{0.05, 3, 16} \approx 3.24$. Our $F = 4.0$ exceeds it, so we reject the null at $\alpha = 0.05$: fertilizers do not all produce the same mean yield.

Check the arithmetic. $SS$: $240 + 320 = 560$ ✓. $df$: $3 + 16 = 19$ ✓.

Example 3 · Reaction time across three drug doses

Three doses of a stimulant (low, medium, high) are tested for effect on reaction time. Suppose group means are $310, 305, 308$ ms with within-group SDs around $25$ ms and $n = 10$ per group.

The group means barely differ (range 5 ms) while within-group spread is much larger (SD $\approx 25$). Without computing precise numbers: $MSB$ will be small (means almost equal), $MSW$ will reflect the within-group variability — so $F$ will be close to or below 1.

Conclusion. Almost certainly fail to reject $H_0$. The drug doses don't produce detectable differences in mean reaction time at this sample size and noise level. Note this is not proof of equivalence — a much larger sample might reveal a real but small effect.

Example 4 · Identifying when an assumption is violated

You compare income across three job categories. Group sizes are equal ($n = 50$ each). Group SDs come out as $\$12{,}000$, $\$28{,}000$, and $\$65{,}000$. The income distributions are heavily right-skewed.

Two problems. First, the standard deviations differ by a factor of $5$ — equal-variance ANOVA's assumption is badly violated. Second, the distributions are skewed, not normal — though with $n = 50$ per group, the CLT helps the sampling distribution of the mean.

Better moves. (a) Log-transform income before running ANOVA (often normalizes income data). (b) Use Welch's ANOVA, which doesn't assume equal variances. (c) Use Kruskal–Wallis, the non-parametric alternative that compares medians.

Example 5 · The danger of running too many t-tests

You have 5 marketing campaigns and want to know if any pair differs in conversion rate. Pairwise: $\binom{5}{2} = 10$ tests. Even if every campaign has the same true conversion rate, the chance of at least one false positive at $\alpha = 0.05$ each is:

$$ 1 - 0.95^{10} \approx 0.40 $$

A 40% chance of declaring a "winner" that doesn't exist. Run a one-way ANOVA first at level $\alpha = 0.05$; if it's significant, then use Tukey's HSD to identify which campaign pairs differ. Family-wise error stays at $0.05$.

Sources & further reading

The content above synthesizes standard treatments. For more detail on assumptions, post-hoc tests, or extensions like two-way and repeated-measures designs, reach for the primary sources.

Test your understanding

A quiz that builds from easy to hard. Pick an answer to get instant feedback and a worked explanation. Your progress is saved in this browser — come back anytime to continue.

Question 1 of 1
0 correct