Topic · Statistics & Probability

Chi-Square Tests

Pearson's invention for asking "do these counts fit my model?" When your data is categorical — survey responses, dice rolls, marketing-channel breakdowns — the t-test doesn't apply. The chi-square test does.

What you'll leave with

  • What the chi-square distribution is — and where the squared-normals come from.
  • Three flavors of test (goodness-of-fit, independence, homogeneity) and which question each one answers.
  • How to compute expected counts in any contingency table from the row, column, and grand totals.
  • The degrees-of-freedom formula for each flavor — and why "5" keeps showing up.
  • Why a significant chi-square never tells you direction, only that the model is wrong somewhere.

1. The chi-square distribution

The chi-square distribution shows up whenever you square standard normal random variables and add them up. If $Z_1, Z_2, \ldots, Z_k$ are independent standard normals, then

$$ \chi^2_k = Z_1^2 + Z_2^2 + \cdots + Z_k^2 $$

has a chi-square distribution with $k$ degrees of freedom. That's the whole definition. Everything else — the right-skewed shape, the mean of $k$, the variance of $2k$, the way it sneaks up on a normal curve as $k$ grows — falls out of this one identity.

Chi-square distribution (χ²)

A continuous distribution supported on $[0, \infty)$, parameterized by a positive integer $k$ (the degrees of freedom). Mean $= k$. Variance $= 2k$. Right-skewed for small $k$, approaching a normal shape as $k$ grows.

What the curves look like

df = 2 df = 4 df = 8 0 4 8 12 16 20 χ² value 0 0.25 0.5

Three things to notice. It's non-negative — there's no way to get a negative chi-square because you're summing squares. It's right-skewed for small df — most of the mass piles up near zero, with a long thin tail. It depends only on one parameter, the degrees of freedom, which tells you both the mean and the spread.

2. The test statistic, in plain words

Every chi-square test you'll meet in introductory statistics computes the same quantity. You have a set of observed counts $O_i$ in $k$ categories, and a corresponding set of expected counts $E_i$ that some null hypothesis predicts. The statistic is

$$ \chi^2 \;=\; \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} $$

Each term measures how far the observed count is from what we expected, scaled by what we expected. Big discrepancies in cells with small expected counts contribute more than the same discrepancy in cells where lots of action was predicted. Sum them up; that's your test statistic.

Under the null hypothesis (the model is correct), this statistic is approximately distributed as $\chi^2$ with some number of degrees of freedom that depends on which flavor of test you're running. A large value lands in the right tail and produces a small p-value — evidence against the model.

One statistic, three questions

The formula never changes. What changes between the three flavors of chi-square test is how you compute the expected counts $E_i$ and how you count degrees of freedom. Get those two things right and the rest is arithmetic.

3. Goodness-of-fit test

The simplest flavor. You have one categorical variable, observed in some sample, and a hypothesized distribution for it. The question: are the observed counts consistent with the hypothesized distribution?

Hypotheses:

  • $H_0$: the variable follows the hypothesized distribution.
  • $H_a$: it does not.

To compute the expected count in category $i$, multiply the hypothesized probability $p_i$ by the total sample size $n$:

$$ E_i = n \cdot p_i $$

The dice example

Roll a six-sided die 120 times and tally the results. If the die is fair, each face has probability $\tfrac{1}{6}$ and the expected count per face is $E_i = 120 \cdot \tfrac{1}{6} = 20$. Suppose you observe:

Face123456Total
Observed $O_i$162219241524120
Expected $E_i$202020202020120

Compute the statistic term by term:

$$ \chi^2 = \tfrac{(16-20)^2}{20} + \tfrac{(22-20)^2}{20} + \tfrac{(19-20)^2}{20} + \tfrac{(24-20)^2}{20} + \tfrac{(15-20)^2}{20} + \tfrac{(24-20)^2}{20} $$ $$ = \tfrac{16 + 4 + 1 + 16 + 25 + 16}{20} = \tfrac{78}{20} = 3.9 $$

Degrees of freedom: $\text{categories} - 1 = 6 - 1 = 5$. The critical value for $\chi^2_5$ at $\alpha = 0.05$ is $11.07$. Our statistic of $3.9$ is way below that; we fail to reject $H_0$. The die looks fair.

4. Test of independence

Two categorical variables, one sample. Question: are they independent, or does knowing one tell you something about the other?

Data is laid out in a contingency table with $r$ rows (one per category of the first variable) and $c$ columns (one per category of the second). The cell at row $i$, column $j$ holds the count $O_{ij}$ of observations in that combination.

Hypotheses:

  • $H_0$: the two variables are independent.
  • $H_a$: they are associated (not independent).

The trick is computing expected counts. Under independence, the probability of being in cell $(i,j)$ factors into row probability times column probability. Estimate each from the marginals:

$$ \hat P(\text{row } i) = \frac{R_i}{N}, \qquad \hat P(\text{col } j) = \frac{C_j}{N} $$

where $R_i$ is the $i$-th row total, $C_j$ is the $j$-th column total, and $N$ is the grand total. Multiply, then scale by $N$ to get an expected count:

$$ E_{ij} = N \cdot \frac{R_i}{N} \cdot \frac{C_j}{N} = \frac{R_i \, C_j}{N} $$

That's the formula worth memorizing: row total times column total, divided by grand total.

Once you have the $E_{ij}$, plug into the same statistic, summing over every cell:

$$ \chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} $$

Degrees of freedom: $(r-1)(c-1)$. The intuition: once you know the row and column totals, you can fill in the first $(r-1)$ rows and $(c-1)$ columns freely; the rest are forced by the marginals.

5. Test of homogeneity

Same arithmetic as independence, different question. You have several distinct populations, and you sample from each one separately. You want to know: do they share the same distribution across categories of one variable?

Hypotheses:

  • $H_0$: the categorical variable has the same distribution in every population.
  • $H_a$: the distributions differ.

The contingency table looks identical: rows for populations, columns for categories. Expected counts are the same $E_{ij} = R_i C_j / N$, the statistic is the same, and the degrees of freedom are the same $(r-1)(c-1)$.

Independence vs homogeneity

Independence and homogeneity differ only in how the data was collected. Independence: one sample, classified by two variables (row and column totals are both random). Homogeneity: several samples (one per population), classified by one variable (row totals are fixed by the sampling design, column totals are random). The mechanics are identical; the interpretation is what changes.

6. Expected counts and the 5-rule

The chi-square statistic is a continuous approximation to what's actually a discrete situation. When expected counts get small, the approximation breaks down — the right tail of the true distribution is heavier than the chi-square's, and your p-values get unreliable.

The 5-rule

Every expected count should be at least 5. A common, more permissive variant: no expected count below 1, and at least 80% of cells should have expected counts of 5 or more. If your table fails this check, the chi-square p-value is suspect.

When the 5-rule fails, you have a few options:

  • Combine categories. If two rare categories are conceptually similar, merge them and re-do the test with a smaller table.
  • Collect more data. Expected counts grow with $N$. Sometimes the answer is just a bigger sample.
  • Use Fisher's exact test. For 2×2 tables especially, Fisher's test enumerates exact probabilities under the null without any large-sample approximation.

7. Degrees of freedom

The single most common source of computational mistakes. Each flavor of test has its own rule:

TestDegrees of freedomWhy
Goodness-of-fit (fully specified distribution) $k - 1$ $k$ categories; once you fix $n$, the last count is forced.
Goodness-of-fit (parameters estimated from data) $k - 1 - p$ Each estimated parameter $p$ costs one extra df.
Independence (and homogeneity) $(r-1)(c-1)$ Marginals fix one row and one column.

The "parameters estimated" subtlety bites people. If your goodness-of-fit test uses an estimated Poisson rate $\hat\lambda$ from the same data, you lose one extra df. If you also estimated the proportion in some category, you lose another.

8. Common pitfalls

Using chi-square on continuous data

Chi-square tests need counts in distinct categories. If your data is heights in inches or reaction times in milliseconds, you can't just hand it to chi-square. Either bin into intervals first (with care — bin choices change the answer) or use a test designed for continuous data like the Kolmogorov-Smirnov test.

Expected counts too small

The biggest invisible failure mode. A 5×5 table with $N = 50$ has an average of 2 per cell — guaranteed to violate the 5-rule, and your software will still happily print a p-value. Always check expected counts before trusting the result.

Reading direction into a significant result

A significant chi-square says "the observed counts don't fit the model" — it doesn't say where the discrepancy lives or which categories are over- or under-represented. To investigate, inspect standardized residuals $(O_{ij} - E_{ij}) / \sqrt{E_{ij}}$ cell by cell.

Pitfall: using percentages instead of counts

The chi-square statistic is computed on raw counts, not percentages or proportions. If your contingency table is filled with row percentages and you plug them in, the test is meaningless. Convert back to counts before computing anything.

Pitfall: independence of observations

Every chi-square test assumes each observation contributes to exactly one cell, independently of the others. If you survey the same person twice, or count families instead of individuals when family members aren't independent, the test is invalid no matter how clean the math looks.

9. Worked examples

Try each one before opening the solution. The mechanics are repetitive on purpose — the goal is for "expected = row × column / grand" and "df = whatever the rule says" to become automatic.

Example 1 · Goodness-of-fit: is the die fair?

Roll a die 60 times. Observed: 8, 12, 10, 9, 11, 10.

Step 1. Expected under fairness: each face $E_i = 60/6 = 10$.

Step 2. Compute the statistic:

$$ \chi^2 = \tfrac{4 + 4 + 0 + 1 + 1 + 0}{10} = \tfrac{10}{10} = 1.0 $$

Step 3. Degrees of freedom: $6 - 1 = 5$. The 5%-level critical value is $11.07$; our statistic is far below it.

Conclusion. No evidence against fairness.

Example 2 · Independence: smoking and lung disease

A sample of 200 adults is classified by smoking status and presence of lung disease:

Lung diseaseNo diseaseRow total
Smoker305080
Non-smoker20100120
Column total50150200

Step 1. Compute expected counts $E_{ij} = R_i C_j / N$:

  • $E_{11} = 80 \cdot 50 / 200 = 20$
  • $E_{12} = 80 \cdot 150 / 200 = 60$
  • $E_{21} = 120 \cdot 50 / 200 = 30$
  • $E_{22} = 120 \cdot 150 / 200 = 90$

Step 2. Each expected count exceeds 5 — the assumption holds.

Step 3. Compute $\chi^2$:

$$ \chi^2 = \tfrac{(30-20)^2}{20} + \tfrac{(50-60)^2}{60} + \tfrac{(20-30)^2}{30} + \tfrac{(100-90)^2}{90} $$ $$ = 5 + 1.667 + 3.333 + 1.111 \approx 11.11 $$

Step 4. Degrees of freedom: $(2-1)(2-1) = 1$. Critical value at $\alpha = 0.05$ is $3.84$.

Conclusion. Reject independence. Smoking status and lung disease are associated in this sample.

Example 3 · Homogeneity: preferred browser across age groups

Survey 100 people in each of three age groups about their preferred browser. The data:

ChromeFirefoxSafariOtherRow
18–2955152010100
30–4945202510100
50+30104020100
Col130458540300

Step 1. Expected counts. Each row total is 100, so $E_{ij} = 100 \cdot C_j / 300 = C_j / 3$. That gives expected counts of $43.3, 15, 28.3, 13.3$ for every row.

Step 2. All expected counts ≥ 5 (smallest is $13.3$).

Step 3. Sum $(O_{ij} - E_{ij})^2 / E_{ij}$ across all 12 cells; computation gives roughly $\chi^2 \approx 19.0$.

Step 4. Degrees of freedom: $(3-1)(4-1) = 6$. Critical value at $\alpha = 0.05$ is $12.59$.

Conclusion. Reject homogeneity. Browser preferences are not the same across age groups. (A glance at the table suggests older respondents lean toward Safari and Other, but the chi-square alone doesn't tell you that — residuals do.)

Example 4 · Goodness-of-fit with estimated parameters

You count emails arriving in 60 one-minute windows and want to test whether the counts follow a Poisson distribution. To compute expected counts, you first estimate $\hat\lambda$ from the data — say $\hat\lambda = 2.3$. Bin into categories 0, 1, 2, 3, 4+ (five categories).

Degrees of freedom: $k - 1 - p = 5 - 1 - 1 = 3$.

The minus-1 is for the constraint that $\sum O_i = n$; the additional minus-1 is for the estimated $\hat\lambda$. Skipping this adjustment is the most common mistake in Poisson goodness-of-fit.

Example 5 · Expected counts too small

You classify 40 small businesses across 5 industries and 4 size brackets — a 5×4 table. Total cells: 20. Average count per cell: $40/20 = 2$. Even before computing any statistic, you know the 5-rule will fail badly.

Options. Either merge industries into broader categories (say, manufacturing vs services vs other) to bring expected counts up, or switch to Fisher's exact test if you can keep the categories. Don't just report the chi-square p-value and hope the reviewer doesn't notice.

Sources & further reading

The content above synthesizes standard treatments from introductory and reference texts. Reach for the primary sources whenever you want more rigor, more examples, or the formal statement of an assumption.

Test your understanding

A quiz that builds from easy to hard. Pick an answer to get instant feedback and a worked explanation. Your progress is saved in this browser — come back anytime to continue.

Question 1 of 22
0 correct