Topic · Statistics & Probability

Bayesian Statistics

You flip a coin ten times and get eight heads. Is the coin biased — or did a fair coin just have an unusual run? Classical statistics answers by imagining a long sequence of future flips of a coin whose true bias is fixed but unknown, and asking how surprising your data would be. Bayesian statistics flips that around: the data is what's fixed (you already flipped), and the coin's bias is what's uncertain — so you describe that uncertainty with a probability distribution, update it as more flips come in, and let Bayes' theorem handle the bookkeeping.

15 min read Prereqs: Conditional probability and Bayes · Distributions · Random variables Updated 2026·05·17

What you'll leave with

The philosophical shift: parameters as uncertain quantities, not fixed unknowns.
Bayes' theorem as a belief-update rule — and the four names: prior, likelihood, evidence, posterior.
The slogan posterior $\propto$ likelihood × prior, and what the missing constant does.
The Beta–Binomial conjugate pair: how a coin-flip posterior is just "add the data to the prior counts".
Why a 99% accurate test for a 1% disease still leaves you only 50% sure.
Credible intervals vs. confidence intervals — what each one actually claims.

1. Two views of a parameter

Suppose you flip a coin ten times and get eight heads. You want to know $\theta$, the coin's true probability of heads. There are two strikingly different ways to set up the question.

The frequentist says: $\theta$ is a fixed but unknown number. The data is the random thing — over many hypothetical repetitions of the experiment, you'd see different head counts. Inference is about procedures that have good long-run properties (low error rates, good coverage).

The Bayesian flips the script. The data, once observed, is fixed — it's the eight heads you actually saw. What's uncertain is $\theta$, and uncertainty about $\theta$ is described the only way uncertainty ever is: with a probability distribution. Before seeing data, you have a prior distribution over $\theta$. After seeing data, you have a posterior. Inference is one move, applied once: condition on what happened.

	Frequentist	Bayesian
Parameter $\theta$	Fixed, unknown number	Random variable with a distribution
Data	Random (one of many possible draws)	Fixed (the values you actually saw)
Probability of $\theta$ in $(a,b)$	Meaningless ($\theta$ isn't random)	A number you can compute from the posterior
Output of inference	Point estimate, CI, $p$-value	The full posterior distribution
Prior knowledge	Implicit (in model choice)	Explicit (in the prior distribution)

Two languages, one math

Both schools use the same probability axioms and often arrive at numerically similar answers, especially with a lot of data. The disagreement is about what probability means, and what counts as a sensible question to ask. "Is $\theta$ in this interval?" is a perfectly natural Bayesian question and a category error for the strict frequentist.

2. Bayes' theorem as a belief-update engine

Bayes' theorem itself — $P(A \mid B) = P(B \mid A)\,P(A) / P(B)$, including the classic 99%-accurate-test-for-a-1%-disease puzzle that motivates it — is derived in Conditional Probability & Bayes. Here we use it as the engine for inference on parameters: instead of two events, we apply it to a parameter $\theta$ and observed data $D$.

$$ P(\theta \mid D) \;=\; \frac{P(D \mid \theta)\,P(\theta)}{P(D)} $$

Four pieces, each with a name and a job:

Prior — $P(\theta)$

What you believed about $\theta$ before seeing the data. A full distribution over possible values, encoding everything you knew (or were willing to assume) up front.

Likelihood — $P(D \mid \theta)$

For each candidate value of $\theta$, how probable was the data you actually saw? This is read off the model — it's what makes a model a model. Note: as a function of $\theta$ with $D$ fixed, the likelihood is not a probability distribution over $\theta$.

Evidence (a.k.a. marginal likelihood) — $P(D)$

The probability of the data, averaged over all possible $\theta$: $P(D) = \int P(D \mid \theta)\, P(\theta)\, d\theta$. A single number that makes the right-hand side integrate to 1. Often the hardest piece to compute.

Posterior — $P(\theta \mid D)$

Your updated belief about $\theta$ after conditioning on $D$. This is the answer — everything else (point estimates, intervals, predictions) is summary statistics extracted from it.

Because the evidence $P(D)$ doesn't depend on $\theta$, it's just a normalising constant — it stretches or shrinks the right-hand side so the total area is 1, but it doesn't change the shape. So Bayesians constantly write the proportionality form:

$$ \underbrace{P(\theta \mid D)}_{\text{posterior}} \;\propto\; \underbrace{P(D \mid \theta)}_{\text{likelihood}} \;\times\; \underbrace{P(\theta)}_{\text{prior}} $$

If you can write down the prior and the likelihood, you know the posterior up to a constant — and very often that's enough. The constant gets recovered by demanding the result integrate to 1.

The engine. Multiply prior and likelihood, divide by the evidence, get the posterior.

Sequential update

One nice property of this engine: yesterday's posterior is today's prior. Update on $D_1$ to get $P(\theta \mid D_1)$, then treat that as the prior, update on $D_2$, and you arrive at the same place you'd get by updating on $(D_1, D_2)$ at once. The math knows when you saw the data.

3. Conjugate priors: the Beta–Binomial

Computing posteriors over continuous parameters means computing integrals, and most integrals don't have nice closed forms. Conjugate priors are the rare, lucky cases where the prior and the posterior live in the same family — so update is just a matter of bumping a few numbers.

Conjugate prior

For a given likelihood, a family of distributions such that if the prior is in the family, the posterior is too. The update reduces to arithmetic on the family's parameters.

The poster child is the Beta–Binomial pair. The Beta distribution lives on $[0, 1]$ and has two shape parameters $\alpha, \beta > 0$:

$$ p \sim \text{Beta}(\alpha, \beta), \qquad \pi(p) \;\propto\; p^{\alpha - 1}(1 - p)^{\beta - 1} $$

Suppose your data is $k$ successes in $n$ trials, modelled as Binomial$(n, p)$:

$$ P(k \mid p) \;\propto\; p^{k}(1 - p)^{n - k} $$

Multiply (posterior $\propto$ likelihood $\times$ prior):

$$ \pi(p \mid k) \;\propto\; p^{k}(1-p)^{n-k} \cdot p^{\alpha-1}(1-p)^{\beta-1} \;=\; p^{\alpha + k - 1}(1 - p)^{\beta + n - k - 1} $$

That's the kernel of another Beta distribution. The posterior is simply

$$ \boxed{\;\; p \mid k \;\sim\; \text{Beta}(\alpha + k,\; \beta + n - k) \;\;} $$

The update rule could not be more pleasant: add successes to $\alpha$, add failures to $\beta$. The hyperparameters of a Beta prior behave like "imaginary prior data" — $\alpha$ counts heads you'd seen, $\beta$ counts tails. Real data just adds to the tally.

Likelihood	Conjugate prior	Posterior	What you're learning
Binomial$(n, p)$	Beta$(\alpha, \beta)$	Beta$(\alpha + k, \beta + n - k)$	A success probability
Poisson$(\lambda)$	Gamma$(\alpha, \beta)$	Gamma$(\alpha + \sum x_i, \beta + n)$	A rate
Normal $(\mu, \sigma^2$ known$)$	Normal$(\mu_0, \sigma_0^2)$	Normal (precision-weighted)	A mean
Multinomial	Dirichlet	Dirichlet (add counts)	Category probabilities

Why "$\text{Beta}(1,1)$" is the uniform

Plug $\alpha = \beta = 1$ into $p^{\alpha-1}(1-p)^{\beta-1}$ and you get $1$ — flat. So $\text{Beta}(1,1)$ is the uniform distribution on $[0, 1]$: "I believe nothing in particular about $p$ beyond the fact that it's between $0$ and $1$." It's a natural starting prior for a parameter you have no prior information about.

4. Worked example: is this coin biased?

You suspect a coin is biased and want to learn its true heads-probability $\theta$. You start uniform: $\theta \sim \text{Beta}(1, 1)$. You flip the coin $10$ times and see $8$ heads, $2$ tails.

By the conjugate update rule, the posterior is

$$ \theta \mid \text{data} \;\sim\; \text{Beta}(1 + 8,\; 1 + 2) \;=\; \text{Beta}(9, 3). $$

The posterior mean of a $\text{Beta}(\alpha, \beta)$ is $\alpha / (\alpha + \beta)$, so

$$ E[\theta \mid \text{data}] \;=\; \frac{9}{9 + 3} \;=\; 0.75. $$

The maximum-likelihood estimate is $8/10 = 0.80$. The posterior mean is pulled a hair toward $0.5$ because the uniform prior contributes a touch of "ignorance ballast" — it's as if we'd seen one extra head and one extra tail before any real flips.

What's much more interesting than the point estimate is the shape of the posterior. The figure below shows three snapshots: the uniform prior, the posterior after $2$ heads in $3$ tosses, and the posterior after our $8$ heads in $10$. As data accumulates, the distribution sharpens around the true tendency of the coin.

Three snapshots of belief about $\theta$. Flat prior at top, broad bump after a few tosses, tight peak near $0.75$ after ten.

Notice three things in that picture:

The peak moves with the data. The posterior centres near the observed proportion — $2/3$ in step 1, $0.75$ in step 2.
The distribution sharpens. Every additional observation narrows the posterior, because $\alpha + \beta$ grows and the variance of $\text{Beta}(\alpha, \beta)$ is $\alpha\beta / [(\alpha + \beta)^2(\alpha + \beta + 1)]$, which shrinks like $1/n$ for large $n$.
The prior fades. With $n = 10$ the data already dominates a $\text{Beta}(1, 1)$ prior. A much stronger prior — say $\text{Beta}(50, 50)$, "I've seen 100 fair-ish flips" — would still leave a visible imprint at this sample size.

From the posterior $\text{Beta}(9, 3)$ you can read off whatever summary you like: the mean ($0.75$), the mode ($8/10 = 0.80$), or — coming up — a 95% credible interval.

5. Credible vs. confidence intervals

This is where the philosophy bites. Suppose for the coin we compute, from the $\text{Beta}(9, 3)$ posterior, the central interval that holds $95\%$ of the mass: roughly $(0.48,\ 0.94)$. What does that interval mean?

Bayesian — credible interval

Statement: $P(\theta \in (0.48,\ 0.94) \mid \text{data}) = 0.95$.
Read: given what we saw, the probability that the coin's true bias lies in this interval is $95\%$.
Random thing: $\theta$ (we're uncertain about it).
Fixed thing: the interval (it's a number we computed once).

Frequentist — confidence interval

Statement: the procedure that produced this interval, applied to many hypothetical repeats of the experiment, traps the true $\theta$ in $95\%$ of cases.
Does not say: "there's a 95% chance $\theta$ is in this particular interval".
Random thing: the interval (it would have come out differently with different data).
Fixed thing: $\theta$ (it doesn't have a distribution).

For a beginner, the difference can feel like hair-splitting. It isn't. Frequentist confidence intervals make a statement about the long-run behaviour of a procedure; credible intervals make a statement about your current belief about $\theta$. The second is what most users think they're getting from the first, and what they actually get only in a Bayesian framework.

The common misreading

You will see, in popular writing and even in some textbooks, "there's a 95% chance the parameter lies in the CI." Strictly speaking, that statement is incoherent under the frequentist interpretation — $\theta$ either is or isn't in the interval; it has no probability. The Bayesian statement is the one that means what the writer intended.

6. MAP, posterior mean, and the wider family

The posterior is the full answer. But you often want a single number — for plotting, for downstream use, for comparison. Three standard summaries:

Posterior mean: $E[\theta \mid D] = \int \theta\,\pi(\theta \mid D)\,d\theta$. Minimises squared error.
Posterior median: the value with equal posterior mass on each side. Minimises absolute error and is robust to skew.
Maximum a posteriori (MAP): $\hat\theta_{\text{MAP}} = \arg\max_\theta \pi(\theta \mid D)$. The peak of the posterior.

The MAP is the Bayesian cousin of the maximum likelihood estimate (MLE), with the prior thrown in:

$$ \hat\theta_{\text{MAP}} \;=\; \arg\max_\theta\;\bigl[\log P(D \mid \theta) + \log P(\theta)\bigr]. $$

If the prior is uniform, MAP equals MLE. If the prior is informative, MAP pulls the estimate toward the prior's preferred region — which is exactly what we want when data is scarce.

A point estimate isn't the answer

MAP and posterior mean throw away nearly everything the posterior knows. They're useful, but they're a tiny summary statistic. When data is scarce, when the posterior is skewed, or when the downstream decision is sensitive to uncertainty, hand around the whole posterior — or at least a credible interval — instead of a single number.

Beyond Beta–Binomial

The same trick — likelihood from a particular family, prior that "matches", posterior in the same family with parameters incremented by data — runs across the standard models:

Poisson rate: $\lambda \sim \text{Gamma}(\alpha, \beta)$, observations sum to $s$ over $n$ intervals, posterior $\text{Gamma}(\alpha + s, \beta + n)$.
Normal mean (known variance): $\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)$, sample mean $\bar x$ from $n$ points, posterior mean is a precision-weighted average of $\mu_0$ and $\bar x$.
Multinomial counts: probabilities $\sim \text{Dirichlet}(\alpha_1, \dots, \alpha_K)$, posterior just adds observed category counts.

When no conjugate prior exists (the common case in real problems), the integral $\int L \cdot \pi$ has no closed form and Bayesians reach for Markov Chain Monte Carlo — Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte Carlo — to draw samples from the posterior. Tools like Stan, PyMC, and NumPyro do this for you. But conjugate updates remain the cleanest teaching example, and the entry point to every Bayesian's intuition.

7. Common pitfalls

Ignoring the prior

Skipping the prior — or quietly using one and not saying so — hides a load-bearing assumption. The disease-test puzzle is the canonical warning: a likelihood, no matter how strong, gets reweighted by the base rate. State your prior out loud.

Choosing a prior to get the result you want

If your "weakly informative" prior just happens to land on the answer you expected, ask whether the data is doing any work. A reasonable practice is to repeat the analysis under a few priors and report sensitivity.

Credible $\neq$ confidence

They are not interchangeable. A $95\%$ credible interval claims a $95\%$ posterior probability that the parameter sits inside. A $95\%$ confidence interval claims that the method covers the true parameter $95\%$ of the time across hypothetical repeats. Different things.

Likelihood is not a distribution over $\theta$

As a function of $\theta$, $P(D \mid \theta)$ doesn't integrate to one — it isn't a probability density over $\theta$ at all. Calling its peak "the most likely value of $\theta$" is sloppy: it's the value under which the data is most likely. The two statements coincide only when the prior is uniform.

Improper priors and proper posteriors

"Flat" priors like uniform-on-the-real-line aren't valid probability distributions — they don't integrate to a finite number. They sometimes yield perfectly fine posteriors anyway, but sometimes don't. Don't assume your improper prior is harmless without checking.

8. Worked examples

Each problem has a hidden solution. Try the update yourself before opening it — the goal is to feel how mechanical conjugate updates become.

Example 1 · A skeptical prior on the coin

Re-do the coin problem with a strong "this coin is probably fair" prior: $\text{Beta}(20, 20)$. You see $8$ heads in $10$ tosses. What's the posterior?

Solution. Add successes to $\alpha$, failures to $\beta$:

$$ \theta \mid \text{data} \sim \text{Beta}(20 + 8,\; 20 + 2) = \text{Beta}(28, 22). $$

Posterior mean $= 28/50 = 0.56$. Even though the raw data shows $80\%$ heads, the strong prior pulls the estimate substantially toward $0.5$ — because $\text{Beta}(20, 20)$ behaves like having already seen $40$ fairly balanced flips.

Example 2 · Disease test with a sicker patient

Same test as before ($99\%$ sensitivity, $99\%$ specificity), but now the patient already has symptoms, raising their personal prior to $P(D) = 0.30$. They test positive. What's $P(D \mid +)$?

Solution.

$$ P(+) = (0.99)(0.30) + (0.01)(0.70) = 0.297 + 0.007 = 0.304 $$ $$ P(D \mid +) = \frac{(0.99)(0.30)}{0.304} \approx 0.977. $$

With a $30\%$ prior, a positive test pushes the posterior to $97.7\%$ — the same test result, very different conclusion, because the prior was very different. Priors aren't decoration.

Example 3 · Conjugate update for a Poisson rate

A call centre receives calls at rate $\lambda$ per hour. Prior: $\lambda \sim \text{Gamma}(2, 1)$ (mean $2$). Over $5$ hours you observe $14$ calls. Posterior?

Solution. The Gamma–Poisson update rule: $\text{Gamma}(\alpha + s,\; \beta + n)$ where $s$ is the total count and $n$ is the number of intervals.

$$ \lambda \mid \text{data} \sim \text{Gamma}(2 + 14,\; 1 + 5) = \text{Gamma}(16, 6). $$

Posterior mean $= 16/6 \approx 2.67$ calls/hour. The raw rate is $14/5 = 2.80$; the prior pulls it slightly toward $2$.

Example 4 · Sequential update vs. batch update

Start with $\text{Beta}(1,1)$. First observe $3$ heads in $5$ flips. Then observe another $5$ heads in $5$ flips. Compute the posterior two ways: (a) update once on the full $(8, 2)$ data; (b) update on $(3, 2)$, then update again on $(5, 0)$.

Solution.

Batch: $\text{Beta}(1 + 8,\; 1 + 2) = \text{Beta}(9, 3)$.

Sequential, step 1: $\text{Beta}(1 + 3,\; 1 + 2) = \text{Beta}(4, 3)$. Step 2: $\text{Beta}(4 + 5,\; 3 + 0) = \text{Beta}(9, 3)$. Same answer. The order data arrives in doesn't matter.

Example 5 · Reading off a credible interval

The posterior is $\text{Beta}(9, 3)$. Estimate (without a calculator) where the central $95\%$ of the mass lies.

Solution sketch. The mode is at $(9-1)/(9+3-2) = 0.8$, the mean at $9/12 = 0.75$, and the distribution is left-skewed with most mass between $0.45$ and $0.95$. A numerical computation gives the $2.5\%$ and $97.5\%$ quantiles as roughly $0.48$ and $0.94$.

The Bayesian statement: given the data and the uniform prior, there is a $95\%$ probability that the coin's true heads-probability lies between $0.48$ and $0.94$. Note how wide that interval is — $10$ flips just isn't enough data to be precise.

Sources & further reading

Bayesian statistics has unusually generous free-textbook coverage — the field has a culture of openly available material. The references below run from gentle, code-first introductions to the canonical graduate text.

Bayesian Data Analysis (third edition) Textbook Gelman, Carlin, Stern, Dunson, Vehtari, Rubin · free PDF

The reference graduate text. Comprehensive treatment of priors, conjugacy, hierarchical models, MCMC, and model checking — far beyond what this page covers, but the place to land once the basics here feel comfortable.
Think Bayes (2nd edition) Textbook Allen B. Downey · free online

A friendly, problem-driven introduction. Every concept arrives attached to a worked example you can poke at; especially good if you want to build intuition before tackling Gelman.
Bayes' theorem, the geometry of changing beliefs Video 3Blue1Brown · Grant Sanderson

A visually beautiful walk through Bayes' theorem on events, with the disease-test puzzle handled cleanly. Best paired with the follow-up on the assumptions baked into "$99\%$ accurate".
Seeing Theory — Bayesian Inference Tutorial Brown University

Interactive D3 visualisations of priors updating into posteriors as data arrives. Drag sliders, watch the curves move. Especially useful for the conjugate examples here.
Bayesian Epistemology Encyclopedia Stanford Encyclopedia of Philosophy

The philosophical underpinning. What does it mean for a parameter to "have a probability"? How does belief-update connect to rationality? Read this when the foundational questions start to itch.
Conjugate prior Reference Wikipedia

A comprehensive table of conjugate prior–likelihood pairs with their update rules. The fastest way to look up "if my likelihood is $X$, what prior should I reach for?"