1. The chi-square distribution
The chi-square distribution shows up whenever you square standard normal random variables and add them up. If $Z_1, Z_2, \ldots, Z_k$ are independent standard normals, then
$$ \chi^2_k = Z_1^2 + Z_2^2 + \cdots + Z_k^2 $$has a chi-square distribution with $k$ degrees of freedom. That's the whole definition. Everything else — the right-skewed shape, the mean of $k$, the variance of $2k$, the way it sneaks up on a normal curve as $k$ grows — falls out of this one identity.
A continuous distribution supported on $[0, \infty)$, parameterized by a positive integer $k$ (the degrees of freedom). Mean $= k$. Variance $= 2k$. Right-skewed for small $k$, approaching a normal shape as $k$ grows.
What the curves look like
Three things to notice. It's non-negative — there's no way to get a negative chi-square because you're summing squares. It's right-skewed for small df — most of the mass piles up near zero, with a long thin tail. It depends only on one parameter, the degrees of freedom, which tells you both the mean and the spread.
2. The test statistic, in plain words
Every chi-square test you'll meet in introductory statistics computes the same quantity. You have a set of observed counts $O_i$ in $k$ categories, and a corresponding set of expected counts $E_i$ that some null hypothesis predicts. The statistic is
$$ \chi^2 \;=\; \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} $$Each term measures how far the observed count is from what we expected, scaled by what we expected. Big discrepancies in cells with small expected counts contribute more than the same discrepancy in cells where lots of action was predicted. Sum them up; that's your test statistic.
Under the null hypothesis (the model is correct), this statistic is approximately distributed as $\chi^2$ with some number of degrees of freedom that depends on which flavor of test you're running. A large value lands in the right tail and produces a small p-value — evidence against the model.
The formula never changes. What changes between the three flavors of chi-square test is how you compute the expected counts $E_i$ and how you count degrees of freedom. Get those two things right and the rest is arithmetic.
3. Goodness-of-fit test
The simplest flavor. You have one categorical variable, observed in some sample, and a hypothesized distribution for it. The question: are the observed counts consistent with the hypothesized distribution?
Hypotheses:
- $H_0$: the variable follows the hypothesized distribution.
- $H_a$: it does not.
To compute the expected count in category $i$, multiply the hypothesized probability $p_i$ by the total sample size $n$:
$$ E_i = n \cdot p_i $$The dice example
Roll a six-sided die 120 times and tally the results. If the die is fair, each face has probability $\tfrac{1}{6}$ and the expected count per face is $E_i = 120 \cdot \tfrac{1}{6} = 20$. Suppose you observe:
| Face | 1 | 2 | 3 | 4 | 5 | 6 | Total |
|---|---|---|---|---|---|---|---|
| Observed $O_i$ | 16 | 22 | 19 | 24 | 15 | 24 | 120 |
| Expected $E_i$ | 20 | 20 | 20 | 20 | 20 | 20 | 120 |
Compute the statistic term by term:
$$ \chi^2 = \tfrac{(16-20)^2}{20} + \tfrac{(22-20)^2}{20} + \tfrac{(19-20)^2}{20} + \tfrac{(24-20)^2}{20} + \tfrac{(15-20)^2}{20} + \tfrac{(24-20)^2}{20} $$ $$ = \tfrac{16 + 4 + 1 + 16 + 25 + 16}{20} = \tfrac{78}{20} = 3.9 $$Degrees of freedom: $\text{categories} - 1 = 6 - 1 = 5$. The critical value for $\chi^2_5$ at $\alpha = 0.05$ is $11.07$. Our statistic of $3.9$ is way below that; we fail to reject $H_0$. The die looks fair.
4. Test of independence
Two categorical variables, one sample. Question: are they independent, or does knowing one tell you something about the other?
Data is laid out in a contingency table with $r$ rows (one per category of the first variable) and $c$ columns (one per category of the second). The cell at row $i$, column $j$ holds the count $O_{ij}$ of observations in that combination.
Hypotheses:
- $H_0$: the two variables are independent.
- $H_a$: they are associated (not independent).
The trick is computing expected counts. Under independence, the probability of being in cell $(i,j)$ factors into row probability times column probability. Estimate each from the marginals:
$$ \hat P(\text{row } i) = \frac{R_i}{N}, \qquad \hat P(\text{col } j) = \frac{C_j}{N} $$where $R_i$ is the $i$-th row total, $C_j$ is the $j$-th column total, and $N$ is the grand total. Multiply, then scale by $N$ to get an expected count:
$$ E_{ij} = N \cdot \frac{R_i}{N} \cdot \frac{C_j}{N} = \frac{R_i \, C_j}{N} $$That's the formula worth memorizing: row total times column total, divided by grand total.
Once you have the $E_{ij}$, plug into the same statistic, summing over every cell:
$$ \chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} $$Degrees of freedom: $(r-1)(c-1)$. The intuition: once you know the row and column totals, you can fill in the first $(r-1)$ rows and $(c-1)$ columns freely; the rest are forced by the marginals.
5. Test of homogeneity
Same arithmetic as independence, different question. You have several distinct populations, and you sample from each one separately. You want to know: do they share the same distribution across categories of one variable?
Hypotheses:
- $H_0$: the categorical variable has the same distribution in every population.
- $H_a$: the distributions differ.
The contingency table looks identical: rows for populations, columns for categories. Expected counts are the same $E_{ij} = R_i C_j / N$, the statistic is the same, and the degrees of freedom are the same $(r-1)(c-1)$.
Independence and homogeneity differ only in how the data was collected. Independence: one sample, classified by two variables (row and column totals are both random). Homogeneity: several samples (one per population), classified by one variable (row totals are fixed by the sampling design, column totals are random). The mechanics are identical; the interpretation is what changes.
6. Expected counts and the 5-rule
The chi-square statistic is a continuous approximation to what's actually a discrete situation. When expected counts get small, the approximation breaks down — the right tail of the true distribution is heavier than the chi-square's, and your p-values get unreliable.
Every expected count should be at least 5. A common, more permissive variant: no expected count below 1, and at least 80% of cells should have expected counts of 5 or more. If your table fails this check, the chi-square p-value is suspect.
When the 5-rule fails, you have a few options:
- Combine categories. If two rare categories are conceptually similar, merge them and re-do the test with a smaller table.
- Collect more data. Expected counts grow with $N$. Sometimes the answer is just a bigger sample.
- Use Fisher's exact test. For 2×2 tables especially, Fisher's test enumerates exact probabilities under the null without any large-sample approximation.
7. Degrees of freedom
The single most common source of computational mistakes. Each flavor of test has its own rule:
| Test | Degrees of freedom | Why |
|---|---|---|
| Goodness-of-fit (fully specified distribution) | $k - 1$ | $k$ categories; once you fix $n$, the last count is forced. |
| Goodness-of-fit (parameters estimated from data) | $k - 1 - p$ | Each estimated parameter $p$ costs one extra df. |
| Independence (and homogeneity) | $(r-1)(c-1)$ | Marginals fix one row and one column. |
The "parameters estimated" subtlety bites people. If your goodness-of-fit test uses an estimated Poisson rate $\hat\lambda$ from the same data, you lose one extra df. If you also estimated the proportion in some category, you lose another.
8. Common pitfalls
Chi-square tests need counts in distinct categories. If your data is heights in inches or reaction times in milliseconds, you can't just hand it to chi-square. Either bin into intervals first (with care — bin choices change the answer) or use a test designed for continuous data like the Kolmogorov-Smirnov test.
The biggest invisible failure mode. A 5×5 table with $N = 50$ has an average of 2 per cell — guaranteed to violate the 5-rule, and your software will still happily print a p-value. Always check expected counts before trusting the result.
A significant chi-square says "the observed counts don't fit the model" — it doesn't say where the discrepancy lives or which categories are over- or under-represented. To investigate, inspect standardized residuals $(O_{ij} - E_{ij}) / \sqrt{E_{ij}}$ cell by cell.
The chi-square statistic is computed on raw counts, not percentages or proportions. If your contingency table is filled with row percentages and you plug them in, the test is meaningless. Convert back to counts before computing anything.
Every chi-square test assumes each observation contributes to exactly one cell, independently of the others. If you survey the same person twice, or count families instead of individuals when family members aren't independent, the test is invalid no matter how clean the math looks.
9. Worked examples
Try each one before opening the solution. The mechanics are repetitive on purpose — the goal is for "expected = row × column / grand" and "df = whatever the rule says" to become automatic.
Example 1 · Goodness-of-fit: is the die fair?
Roll a die 60 times. Observed: 8, 12, 10, 9, 11, 10.
Step 1. Expected under fairness: each face $E_i = 60/6 = 10$.
Step 2. Compute the statistic:
$$ \chi^2 = \tfrac{4 + 4 + 0 + 1 + 1 + 0}{10} = \tfrac{10}{10} = 1.0 $$Step 3. Degrees of freedom: $6 - 1 = 5$. The 5%-level critical value is $11.07$; our statistic is far below it.
Conclusion. No evidence against fairness.
Example 2 · Independence: smoking and lung disease
A sample of 200 adults is classified by smoking status and presence of lung disease:
| Lung disease | No disease | Row total | |
|---|---|---|---|
| Smoker | 30 | 50 | 80 |
| Non-smoker | 20 | 100 | 120 |
| Column total | 50 | 150 | 200 |
Step 1. Compute expected counts $E_{ij} = R_i C_j / N$:
- $E_{11} = 80 \cdot 50 / 200 = 20$
- $E_{12} = 80 \cdot 150 / 200 = 60$
- $E_{21} = 120 \cdot 50 / 200 = 30$
- $E_{22} = 120 \cdot 150 / 200 = 90$
Step 2. Each expected count exceeds 5 — the assumption holds.
Step 3. Compute $\chi^2$:
$$ \chi^2 = \tfrac{(30-20)^2}{20} + \tfrac{(50-60)^2}{60} + \tfrac{(20-30)^2}{30} + \tfrac{(100-90)^2}{90} $$ $$ = 5 + 1.667 + 3.333 + 1.111 \approx 11.11 $$Step 4. Degrees of freedom: $(2-1)(2-1) = 1$. Critical value at $\alpha = 0.05$ is $3.84$.
Conclusion. Reject independence. Smoking status and lung disease are associated in this sample.
Example 3 · Homogeneity: preferred browser across age groups
Survey 100 people in each of three age groups about their preferred browser. The data:
| Chrome | Firefox | Safari | Other | Row | |
|---|---|---|---|---|---|
| 18–29 | 55 | 15 | 20 | 10 | 100 |
| 30–49 | 45 | 20 | 25 | 10 | 100 |
| 50+ | 30 | 10 | 40 | 20 | 100 |
| Col | 130 | 45 | 85 | 40 | 300 |
Step 1. Expected counts. Each row total is 100, so $E_{ij} = 100 \cdot C_j / 300 = C_j / 3$. That gives expected counts of $43.3, 15, 28.3, 13.3$ for every row.
Step 2. All expected counts ≥ 5 (smallest is $13.3$).
Step 3. Sum $(O_{ij} - E_{ij})^2 / E_{ij}$ across all 12 cells; computation gives roughly $\chi^2 \approx 19.0$.
Step 4. Degrees of freedom: $(3-1)(4-1) = 6$. Critical value at $\alpha = 0.05$ is $12.59$.
Conclusion. Reject homogeneity. Browser preferences are not the same across age groups. (A glance at the table suggests older respondents lean toward Safari and Other, but the chi-square alone doesn't tell you that — residuals do.)
Example 4 · Goodness-of-fit with estimated parameters
You count emails arriving in 60 one-minute windows and want to test whether the counts follow a Poisson distribution. To compute expected counts, you first estimate $\hat\lambda$ from the data — say $\hat\lambda = 2.3$. Bin into categories 0, 1, 2, 3, 4+ (five categories).
Degrees of freedom: $k - 1 - p = 5 - 1 - 1 = 3$.
The minus-1 is for the constraint that $\sum O_i = n$; the additional minus-1 is for the estimated $\hat\lambda$. Skipping this adjustment is the most common mistake in Poisson goodness-of-fit.
Example 5 · Expected counts too small
You classify 40 small businesses across 5 industries and 4 size brackets — a 5×4 table. Total cells: 20. Average count per cell: $40/20 = 2$. Even before computing any statistic, you know the 5-rule will fail badly.
Options. Either merge industries into broader categories (say, manufacturing vs services vs other) to bring expected counts up, or switch to Fisher's exact test if you can keep the categories. Don't just report the chi-square p-value and hope the reviewer doesn't notice.