1. Why sampling matters
You almost never get to measure everyone. The population of interest — every U.S. voter, every widget the factory will ever produce, every patient with a particular disease — is too large, too expensive, or too future to count. So instead you pick a subset, measure it, and reason from the part to the whole.
That reasoning only works if the subset is representative. The whole machinery of statistical inference assumes your data is a fair window into the population. Get the sampling wrong and no amount of clever analysis can rescue you — you're computing precise summaries of the wrong thing.
The population is every unit you want to draw conclusions about. The sample is the subset you actually observe. A parameter is a number that summarizes the population; a statistic is a number that summarizes the sample. Parameters are fixed but usually unknown; statistics vary from sample to sample and are what you compute.
The notation tracks the distinction. By convention:
| Quantity | Population (parameter) | Sample (statistic) |
|---|---|---|
| Mean | $\mu$ | $\bar{x}$ |
| Standard deviation | $\sigma$ | $s$ |
| Proportion | $p$ | $\hat{p}$ |
| Size | $N$ | $n$ |
Greek letters for the unknown truth, Roman letters (often with a hat or bar) for the estimate. Train your eye to read these symbols as "the real thing" versus "what we measured" — every formula in the rest of the chapter is some version of "use the sample on the right to learn about the parameter on the left."
Inference is the art of saying something believable about $\mu$ when all you have is $\bar{x}$. Sampling design is what makes that leap honest.
2. Simple random sampling
The cleanest design — and the one all the math of inferential statistics is built on — is the simple random sample (SRS).
A sample of size $n$ drawn from a population of size $N$ such that every possible subset of size $n$ is equally likely to be chosen. Equivalently: each individual has the same probability $n / N$ of being included, and selections are independent of who else got picked.
The operational definition is "draw names from a hat." In practice: assign every member of the population a unique ID, generate $n$ distinct random numbers in $[1, N]$, and include the matching units. A spreadsheet's random function and a sorted column will do it in a few seconds.
SRS is the gold standard because the math is honest about it. When statisticians write $\bar{x} \pm 1.96 \,\sigma / \sqrt{n}$, the formula is derived under the assumption that the sample is random — and "random" means a real probabilistic mechanism, not just "haphazard" or "convenient."
"I grabbed a few cookies from the jar without looking" is not random sampling — your hand goes to the easy-to-reach cookies. Random means a mechanism (a die, a generator, a table) decides for you. The whole point is to take the choice out of human hands precisely because humans are systematically biased without realizing it.
The strength and the weakness
The strength: SRS guarantees, in expectation, that the sample looks like the population on every variable you do and don't care about. No subgroup is systematically over- or under-represented; any imbalance is the luck of the draw and is exactly what the standard-error formulas account for.
The weakness: an SRS can be impractical when the population is enormous and scattered, or when you specifically need to compare subgroups. If only 3% of your population is left-handed and you want to estimate the left-handed mean to any useful precision, an SRS of size 100 expects just 3 left-handers in the sample — not enough. The remedy is to structure the randomness, which is what the next three designs do.
3. Stratified sampling
In a stratified sample, you partition the population into non-overlapping groups called strata, then draw a simple random sample inside each stratum and combine the results.
Two things make stratification powerful:
- Guaranteed representation of subgroups. If you want the left-handed mean estimated as precisely as the right-handed mean, draw enough left-handers on purpose. SRS leaves that to chance; stratification doesn't.
- Lower variance overall. When the strata are chosen so units inside a stratum are more similar than units across strata, the combined estimate is more precise than an SRS of the same total size $n$. The math: the variance of a stratified estimator depends on within-stratum variance, not the (typically larger) overall variance.
Strata are chosen based on a variable expected to be related to what you're measuring. For a national income survey: region, urban/rural, age group. For a quality-control sample: shift, machine, supplier. The rule of thumb is "split by the thing that matters." Strata you invent for cosmetic reasons buy you nothing.
The combined estimate is a weighted average of the stratum statistics, with weights proportional to the stratum sizes in the population:
$$ \bar{x}_{\text{strat}} = \sum_{h=1}^{H} \frac{N_h}{N} \, \bar{x}_h $$where $H$ is the number of strata, $N_h$ is the population size of stratum $h$, and $\bar{x}_h$ is the sample mean inside that stratum. The weights matter — over-sampling a small stratum without adjusting the weights is a common mistake.
4. Cluster sampling
Cluster sampling partitions the population into groups too — but here you randomly pick whole groups and then measure everyone inside the chosen groups (or take an SRS within each).
The motivation is cost. If your population is "all U.S. households," sampling 1,000 households via SRS means 1,000 separate visits scattered across the country. Sampling 50 city blocks and visiting every household on them is the same total work but a tiny fraction of the travel.
| Stratified | Cluster | |
|---|---|---|
| What you partition into | Strata (chosen to be internally similar) | Clusters (typically internally varied) |
| What you sample | Some units from every stratum | All (or most) units from some clusters |
| Why | Precision and guaranteed subgroup coverage | Logistical convenience and cost |
| Effect on variance | Lower than SRS | Usually higher than SRS |
Beginners mix them up because both involve groups. The mnemonic: stratify by the thing that matters (so units within a group are similar) and cluster by convenience (so units within a group are as varied as the overall population). Stratification samples within every group; clustering samples across groups.
Cluster sampling is also vulnerable to a particular failure: if units within a cluster are actually more similar than chance (kids in the same classroom, households on the same block), the effective sample size is smaller than $n$. Survey statisticians call this the design effect and inflate the standard errors accordingly.
5. Systematic sampling
Systematic sampling orders the population somehow, picks a random starting point, and then takes every $k$-th unit. To draw $n = 100$ from $N = 10{,}000$, set $k = 100$, pick a random integer in $[1, 100]$, and take that one plus every hundredth one after it.
Done well, this approximates an SRS and is easier to administer. It shines for assembly lines (every 50th widget off the belt) and exit polls (every 10th voter through the door).
Systematic sampling fails catastrophically when the spacing $k$ aligns with a hidden cycle in the data. Sample every 7th day in a hospital admissions log and you'll sample only Mondays; every 12th house on a block and you might sample only corner houses. Always check whether the ordering of your population could contain a period that matches your step size.
6. Convenience & voluntary response — the traps
The two designs you'll encounter most in the wild are also the two that almost never produce trustworthy results.
- The first 50 people who walk into the lobby
- Patients at the one clinic you have access to
- Students in the professor's intro class
- Online polls open to anyone who clicks
- Customer satisfaction cards left on the counter
- Call-in radio surveys
Both share the same flaw: who ends up in the sample is determined by something other than chance, and that "something" is correlated with the answer you care about. Convenience samples over-represent whoever is easy to reach; voluntary-response samples over-represent whoever cares enough — usually the people with strong opinions, especially negative ones.
You cannot fix a convenience or voluntary-response sample with a larger $n$. A million biased data points are still biased; you just compute the wrong answer more precisely. This is the most expensive error in applied statistics.
Sometimes the only data available is non-random — historical records, administrative data, observational studies. The honest move isn't to pretend the bias isn't there. It's to (a) describe the sampling mechanism explicitly, (b) reason about which directions it likely biases your conclusions, and (c) hedge your claims accordingly.
7. The four families of bias
"Bias" in statistics is technical: an estimator is biased when its expected value over many hypothetical samples differs from the parameter it's trying to estimate. The sources of bias have names because they come up over and over.
Selection bias
The sample is drawn in a way that systematically excludes or under-represents part of the population. The classic case: surveying voters by landline phone in 2024 — that frame quietly excludes everyone without a landline, who skews younger and more urban than the average voter.
Non-response bias
You contacted a perfectly randomized sample, but the people who actually answered are not representative of those you contacted. If the response rate is 20% and non-responders differ from responders on the variable you care about, the sample you analyze is no longer random in any useful sense.
Response bias
Respondents systematically give inaccurate answers. They under-report drinking, over-report voting, give the answer the interviewer seems to want, or simply misremember. Question wording, the medium of the survey, and who is asking can all push answers in predictable directions.
Survivorship bias
You sample only the units that survived some filter, and forget that the filter exists. Studies of "successful startups" miss the much larger pool of failed ones. WWII analysts proposing to armor the bullet-hole-riddled parts of returning bombers were almost convinced — until Abraham Wald pointed out that the planes hit elsewhere weren't returning at all.
Variance shrinks with $\sqrt{n}$: a bigger sample is more precise. Bias does not shrink at all. A small, well-designed survey beats a huge, badly-designed one — every time.
8. A note on sample size
How big a sample do you need? The detailed answer involves the precision you want, the variability of the population, and the confidence level you're targeting — all developed properly in the Inferential Statistics topic. For now, three intuitions to carry around:
- Precision scales with $\sqrt{n}$, not $n$. Cutting your margin of error in half requires quadrupling the sample size. The first few hundred observations buy you a lot; the next thousand buy you less.
- The population size barely matters once it's large. A poll of 1,000 from a population of 1 million is about as precise as one of 1,000 from 300 million. The standard-error formula involves $n$, not $N$, except via a finite-population correction that vanishes when $N \gg n$.
- Bias swamps sample size. A biased survey of 1,000,000 tells you less about the population than an honest random sample of 500. The arithmetic of standard errors only applies to samples that are actually random.
9. Common pitfalls
The Greek letter is the unknown population parameter; the Roman letter is what you computed. They are never the same number — and an inference statement like "$\mu \approx \bar{x}$" only makes sense within a margin of error you can quantify. Use the symbols precisely and the rest of the chapter is much easier.
A sample of one million convenience-respondents is not better than a random sample of one thousand. Sample size buys precision, not honesty. Always ask about the mechanism first, the size second.
If you draw an SRS within every cluster and combine, that's stratified sampling, not cluster sampling. Cluster sampling means you randomly drop entire clusters from the sample. Mixing the two up changes how you compute standard errors.
If 80% of those you contacted didn't answer, the 20% you have is no longer a random sample — it's a self-selected subset. Reporting the result as if you sampled 1,000 when really 200 chose to respond hides the most important uncertainty in your study.
10. Worked examples
Three classics. Try to predict the failure mode before reading the solution — that's where the learning lives.
Example 1 · The Literary Digest fiasco of 1936
In 1936, Literary Digest magazine mailed straw-poll ballots to 10 million Americans to predict the U.S. presidential election. Roughly 2.4 million ballots came back. Based on this enormous sample, the magazine forecast Alf Landon would defeat Franklin Roosevelt in a landslide.
Roosevelt won by the largest popular-vote margin of the century. The magazine folded shortly after.
What went wrong. Two compounding failures of sampling design:
- Selection bias. Ballots were mailed to people on lists of telephone subscribers, magazine subscribers, and registered automobile owners. In Depression-era America, this frame skewed dramatically wealthy — a population that broke for Landon. The frame was not the voting population.
- Non-response bias. Only about 24% returned the ballot. The kind of person who fills out and mails back an unsolicited political survey is not a random slice even of the people they were sent to.
The deeper lesson. $n = 2.4{,}000{,}000$ couldn't save the survey. A young pollster named George Gallup correctly predicted the election the same year with a sample of about 50,000 — drawn to be representative of the actual electorate. Honest sampling beat huge sampling by orders of magnitude.
Example 2 · Stratified sampling for an opinion poll
You're polling 1,500 likely voters in a state that's 60% urban, 30% suburban, and 10% rural. Past elections show big differences between the three. You suspect a pure SRS of 1,500 will land roughly 150 rural respondents — enough overall, but a wide margin of error specifically for the rural estimate.
Stratified design. Sample 600 urban, 450 suburban, and 450 rural respondents — keeping the design close to proportional overall, but over-sampling the small rural stratum so its estimate has comparable precision to the others.
Combining the results. Compute $\bar{x}_h$ inside each stratum, then weight by the population proportions, not the sample proportions:
$$ \bar{x}_{\text{state}} = 0.60 \,\bar{x}_{\text{urban}} + 0.30 \,\bar{x}_{\text{sub}} + 0.10 \,\bar{x}_{\text{rural}} $$If you forget the weights and just take $\bar{x} = (600\,\bar{x}_u + 450\,\bar{x}_s + 450\,\bar{x}_r) / 1500$, you over-count the rural stratum 4.5× and bias the estimate toward whatever rural voters happen to think.
Example 3 · Cluster vs. stratified for a school survey
A district has 200 schools. You want to estimate the mean math score for the district's 50,000 students. Two designs are on the table.
Cluster design. Randomly choose 20 schools (clusters), then test every student in those schools. Total students tested: ~5,000. Cheap, because you visit only 20 buildings.
Stratified design. Group schools by neighborhood income (strata), then SRS 25 students from every school. Total tested: 5,000. Expensive, because you visit every building.
Which is more precise? Students within a school tend to be similar (same teachers, same neighborhood). That makes them a poor cluster — the 250 students from one school give you less information than 250 students from 250 different schools. So the stratified design has lower variance, often substantially. The cluster design has lower cost. The right choice is whichever frontier of cost-vs-precision your problem sits on.
The takeaway. "Sampling design" is partly a precision question and partly a budget question. Cluster sampling buys budget by paying with precision; stratification buys precision by paying with logistics.
Example 4 · Spotting the bias
For each scenario, name the bias family and one concrete fix.
(a) A hospital studies its patients to estimate the survival rate for a disease. → Survivorship / selection bias. Patients who died before reaching the hospital aren't in the sample. Fix: define the population as "all diagnosed cases in the region" and use registry data, not in-hospital records.
(b) An online retailer emails a satisfaction survey; 4% respond. → Non-response bias. Respondents are typically the very satisfied and the very angry. Fix: shorter survey, follow-ups, and incentives to raise the response rate; report the rate alongside the results.
(c) An exit poll interviewer asks voters who they supported, in a country where that party is controversial. → Response bias. People shade their answers toward what's socially safe. Fix: anonymous paper ballot dropped into a box; or list-experiment techniques that disguise the sensitive question.
(d) A study of "famous CEOs" identifies common traits. → Survivorship bias. CEOs who failed aren't in the sample, so the "traits of success" might be common to everyone who tried. Fix: compare to a matched sample of failed founders, not just the survivors.