Topic · Statistics & Probability
Bayesian Statistics
You flip a coin ten times and get eight heads. Is the coin biased — or did a fair coin just have an unusual run? Classical statistics answers by imagining a long sequence of future flips of a coin whose true bias is fixed but unknown, and asking how surprising your data would be. Bayesian statistics flips that around: the data is what's fixed (you already flipped), and the coin's bias is what's uncertain — so you describe that uncertainty with a probability distribution, update it as more flips come in, and let Bayes' theorem handle the bookkeeping.
15 min read
Prereqs: Conditional probability and Bayes · Distributions · Random variables
Updated 2026·05·17
1. Two views of a parameter
Suppose you flip a coin ten times and get eight heads. You want to know $\theta$, the coin's true probability of heads. There are two strikingly different ways to set up the question.
The frequentist says: $\theta$ is a fixed but unknown number. The data is the random thing — over many hypothetical repetitions of the experiment, you'd see different head counts. Inference is about procedures that have good long-run properties (low error rates, good coverage).
The Bayesian flips the script. The data, once observed, is fixed — it's the eight heads you actually saw. What's uncertain is $\theta$, and uncertainty about $\theta$ is described the only way uncertainty ever is: with a probability distribution. Before seeing data, you have a prior distribution over $\theta$. After seeing data, you have a posterior. Inference is one move, applied once: condition on what happened.
| Frequentist | Bayesian |
| Parameter $\theta$ | Fixed, unknown number | Random variable with a distribution |
| Data | Random (one of many possible draws) | Fixed (the values you actually saw) |
| Probability of $\theta$ in $(a,b)$ | Meaningless ($\theta$ isn't random) | A number you can compute from the posterior |
| Output of inference | Point estimate, CI, $p$-value | The full posterior distribution |
| Prior knowledge | Implicit (in model choice) | Explicit (in the prior distribution) |
Two languages, one math
Both schools use the same probability axioms and often arrive at numerically similar answers, especially with a lot of data. The disagreement is about what probability means, and what counts as a sensible question to ask. "Is $\theta$ in this interval?" is a perfectly natural Bayesian question and a category error for the strict frequentist.
2. Bayes' theorem as a belief-update engine
Bayes' theorem itself — $P(A \mid B) = P(B \mid A)\,P(A) / P(B)$, including the classic 99%-accurate-test-for-a-1%-disease puzzle that motivates it — is derived in Conditional Probability & Bayes. Here we use it as the engine for inference on parameters: instead of two events, we apply it to a parameter $\theta$ and observed data $D$.
$$ P(\theta \mid D) \;=\; \frac{P(D \mid \theta)\,P(\theta)}{P(D)} $$
Four pieces, each with a name and a job:
Prior — $P(\theta)$
What you believed about $\theta$ before seeing the data. A full distribution over possible values, encoding everything you knew (or were willing to assume) up front.
Likelihood — $P(D \mid \theta)$
For each candidate value of $\theta$, how probable was the data you actually saw? This is read off the model — it's what makes a model a model. Note: as a function of $\theta$ with $D$ fixed, the likelihood is not a probability distribution over $\theta$.
Evidence (a.k.a. marginal likelihood) — $P(D)$
The probability of the data, averaged over all possible $\theta$: $P(D) = \int P(D \mid \theta)\, P(\theta)\, d\theta$. A single number that makes the right-hand side integrate to 1. Often the hardest piece to compute.
Posterior — $P(\theta \mid D)$
Your updated belief about $\theta$ after conditioning on $D$. This is the answer — everything else (point estimates, intervals, predictions) is summary statistics extracted from it.
Because the evidence $P(D)$ doesn't depend on $\theta$, it's just a normalising constant — it stretches or shrinks the right-hand side so the total area is 1, but it doesn't change the shape. So Bayesians constantly write the proportionality form:
$$ \underbrace{P(\theta \mid D)}_{\text{posterior}} \;\propto\; \underbrace{P(D \mid \theta)}_{\text{likelihood}} \;\times\; \underbrace{P(\theta)}_{\text{prior}} $$
If you can write down the prior and the likelihood, you know the posterior up to a constant — and very often that's enough. The constant gets recovered by demanding the result integrate to 1.
Sequential update
One nice property of this engine: yesterday's posterior is today's prior. Update on $D_1$ to get $P(\theta \mid D_1)$, then treat that as the prior, update on $D_2$, and you arrive at the same place you'd get by updating on $(D_1, D_2)$ at once. The math knows when you saw the data.
3. Conjugate priors: the Beta–Binomial
Computing posteriors over continuous parameters means computing integrals, and most integrals don't have nice closed forms. Conjugate priors are the rare, lucky cases where the prior and the posterior live in the same family — so update is just a matter of bumping a few numbers.
Conjugate prior
For a given likelihood, a family of distributions such that if the prior is in the family, the posterior is too. The update reduces to arithmetic on the family's parameters.
The poster child is the Beta–Binomial pair. The Beta distribution lives on $[0, 1]$ and has two shape parameters $\alpha, \beta > 0$:
$$ p \sim \text{Beta}(\alpha, \beta), \qquad
\pi(p) \;\propto\; p^{\alpha - 1}(1 - p)^{\beta - 1} $$
Suppose your data is $k$ successes in $n$ trials, modelled as Binomial$(n, p)$:
$$ P(k \mid p) \;\propto\; p^{k}(1 - p)^{n - k} $$
Multiply (posterior $\propto$ likelihood $\times$ prior):
$$ \pi(p \mid k) \;\propto\; p^{k}(1-p)^{n-k} \cdot p^{\alpha-1}(1-p)^{\beta-1}
\;=\; p^{\alpha + k - 1}(1 - p)^{\beta + n - k - 1} $$
That's the kernel of another Beta distribution. The posterior is simply
$$ \boxed{\;\; p \mid k \;\sim\; \text{Beta}(\alpha + k,\; \beta + n - k) \;\;} $$
The update rule could not be more pleasant: add successes to $\alpha$, add failures to $\beta$. The hyperparameters of a Beta prior behave like "imaginary prior data" — $\alpha$ counts heads you'd seen, $\beta$ counts tails. Real data just adds to the tally.
| Likelihood | Conjugate prior | Posterior | What you're learning |
| Binomial$(n, p)$ | Beta$(\alpha, \beta)$ | Beta$(\alpha + k, \beta + n - k)$ | A success probability |
| Poisson$(\lambda)$ | Gamma$(\alpha, \beta)$ | Gamma$(\alpha + \sum x_i, \beta + n)$ | A rate |
| Normal $(\mu, \sigma^2$ known$)$ | Normal$(\mu_0, \sigma_0^2)$ | Normal (precision-weighted) | A mean |
| Multinomial | Dirichlet | Dirichlet (add counts) | Category probabilities |
Why "$\text{Beta}(1,1)$" is the uniform
Plug $\alpha = \beta = 1$ into $p^{\alpha-1}(1-p)^{\beta-1}$ and you get $1$ — flat. So $\text{Beta}(1,1)$ is the uniform distribution on $[0, 1]$: "I believe nothing in particular about $p$ beyond the fact that it's between $0$ and $1$." It's a natural starting prior for a parameter you have no prior information about.
4. Worked example: is this coin biased?
You suspect a coin is biased and want to learn its true heads-probability $\theta$. You start uniform: $\theta \sim \text{Beta}(1, 1)$. You flip the coin $10$ times and see $8$ heads, $2$ tails.
By the conjugate update rule, the posterior is
$$ \theta \mid \text{data} \;\sim\; \text{Beta}(1 + 8,\; 1 + 2) \;=\; \text{Beta}(9, 3). $$
The posterior mean of a $\text{Beta}(\alpha, \beta)$ is $\alpha / (\alpha + \beta)$, so
$$ E[\theta \mid \text{data}] \;=\; \frac{9}{9 + 3} \;=\; 0.75. $$
The maximum-likelihood estimate is $8/10 = 0.80$. The posterior mean is pulled a hair toward $0.5$ because the uniform prior contributes a touch of "ignorance ballast" — it's as if we'd seen one extra head and one extra tail before any real flips.
What's much more interesting than the point estimate is the shape of the posterior. The figure below shows three snapshots: the uniform prior, the posterior after $2$ heads in $3$ tosses, and the posterior after our $8$ heads in $10$. As data accumulates, the distribution sharpens around the true tendency of the coin.
Notice three things in that picture:
- The peak moves with the data. The posterior centres near the observed proportion — $2/3$ in step 1, $0.75$ in step 2.
- The distribution sharpens. Every additional observation narrows the posterior, because $\alpha + \beta$ grows and the variance of $\text{Beta}(\alpha, \beta)$ is $\alpha\beta / [(\alpha + \beta)^2(\alpha + \beta + 1)]$, which shrinks like $1/n$ for large $n$.
- The prior fades. With $n = 10$ the data already dominates a $\text{Beta}(1, 1)$ prior. A much stronger prior — say $\text{Beta}(50, 50)$, "I've seen 100 fair-ish flips" — would still leave a visible imprint at this sample size.
From the posterior $\text{Beta}(9, 3)$ you can read off whatever summary you like: the mean ($0.75$), the mode ($8/10 = 0.80$), or — coming up — a 95% credible interval.
5. Credible vs. confidence intervals
This is where the philosophy bites. Suppose for the coin we compute, from the $\text{Beta}(9, 3)$ posterior, the central interval that holds $95\%$ of the mass: roughly $(0.48,\ 0.94)$. What does that interval mean?
Bayesian — credible interval
- Statement: $P(\theta \in (0.48,\ 0.94) \mid \text{data}) = 0.95$.
- Read: given what we saw, the probability that the coin's true bias lies in this interval is $95\%$.
- Random thing: $\theta$ (we're uncertain about it).
- Fixed thing: the interval (it's a number we computed once).
Frequentist — confidence interval
- Statement: the procedure that produced this interval, applied to many hypothetical repeats of the experiment, traps the true $\theta$ in $95\%$ of cases.
- Does not say: "there's a 95% chance $\theta$ is in this particular interval".
- Random thing: the interval (it would have come out differently with different data).
- Fixed thing: $\theta$ (it doesn't have a distribution).
For a beginner, the difference can feel like hair-splitting. It isn't. Frequentist confidence intervals make a statement about the long-run behaviour of a procedure; credible intervals make a statement about your current belief about $\theta$. The second is what most users think they're getting from the first, and what they actually get only in a Bayesian framework.
The common misreading
You will see, in popular writing and even in some textbooks, "there's a 95% chance the parameter lies in the CI." Strictly speaking, that statement is incoherent under the frequentist interpretation — $\theta$ either is or isn't in the interval; it has no probability. The Bayesian statement is the one that means what the writer intended.
6. MAP, posterior mean, and the wider family
The posterior is the full answer. But you often want a single number — for plotting, for downstream use, for comparison. Three standard summaries:
- Posterior mean: $E[\theta \mid D] = \int \theta\,\pi(\theta \mid D)\,d\theta$. Minimises squared error.
- Posterior median: the value with equal posterior mass on each side. Minimises absolute error and is robust to skew.
- Maximum a posteriori (MAP): $\hat\theta_{\text{MAP}} = \arg\max_\theta \pi(\theta \mid D)$. The peak of the posterior.
The MAP is the Bayesian cousin of the maximum likelihood estimate (MLE), with the prior thrown in:
$$ \hat\theta_{\text{MAP}} \;=\; \arg\max_\theta\;\bigl[\log P(D \mid \theta) + \log P(\theta)\bigr]. $$
If the prior is uniform, MAP equals MLE. If the prior is informative, MAP pulls the estimate toward the prior's preferred region — which is exactly what we want when data is scarce.
A point estimate isn't the answer
MAP and posterior mean throw away nearly everything the posterior knows. They're useful, but they're a tiny summary statistic. When data is scarce, when the posterior is skewed, or when the downstream decision is sensitive to uncertainty, hand around the whole posterior — or at least a credible interval — instead of a single number.
Beyond Beta–Binomial
The same trick — likelihood from a particular family, prior that "matches", posterior in the same family with parameters incremented by data — runs across the standard models:
- Poisson rate: $\lambda \sim \text{Gamma}(\alpha, \beta)$, observations sum to $s$ over $n$ intervals, posterior $\text{Gamma}(\alpha + s, \beta + n)$.
- Normal mean (known variance): $\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)$, sample mean $\bar x$ from $n$ points, posterior mean is a precision-weighted average of $\mu_0$ and $\bar x$.
- Multinomial counts: probabilities $\sim \text{Dirichlet}(\alpha_1, \dots, \alpha_K)$, posterior just adds observed category counts.
When no conjugate prior exists (the common case in real problems), the integral $\int L \cdot \pi$ has no closed form and Bayesians reach for Markov Chain Monte Carlo — Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte Carlo — to draw samples from the posterior. Tools like Stan, PyMC, and NumPyro do this for you. But conjugate updates remain the cleanest teaching example, and the entry point to every Bayesian's intuition.
7. Common pitfalls
Ignoring the prior
Skipping the prior — or quietly using one and not saying so — hides a load-bearing assumption. The disease-test puzzle is the canonical warning: a likelihood, no matter how strong, gets reweighted by the base rate. State your prior out loud.
Choosing a prior to get the result you want
If your "weakly informative" prior just happens to land on the answer you expected, ask whether the data is doing any work. A reasonable practice is to repeat the analysis under a few priors and report sensitivity.
Credible $\neq$ confidence
They are not interchangeable. A $95\%$ credible interval claims a $95\%$ posterior probability that the parameter sits inside. A $95\%$ confidence interval claims that the method covers the true parameter $95\%$ of the time across hypothetical repeats. Different things.
Likelihood is not a distribution over $\theta$
As a function of $\theta$, $P(D \mid \theta)$ doesn't integrate to one — it isn't a probability density over $\theta$ at all. Calling its peak "the most likely value of $\theta$" is sloppy: it's the value under which the data is most likely. The two statements coincide only when the prior is uniform.
Improper priors and proper posteriors
"Flat" priors like uniform-on-the-real-line aren't valid probability distributions — they don't integrate to a finite number. They sometimes yield perfectly fine posteriors anyway, but sometimes don't. Don't assume your improper prior is harmless without checking.
8. Worked examples
Each problem has a hidden solution. Try the update yourself before opening it — the goal is to feel how mechanical conjugate updates become.
Example 1 · A skeptical prior on the coin
Re-do the coin problem with a strong "this coin is probably fair" prior: $\text{Beta}(20, 20)$. You see $8$ heads in $10$ tosses. What's the posterior?
Solution. Add successes to $\alpha$, failures to $\beta$:
$$ \theta \mid \text{data} \sim \text{Beta}(20 + 8,\; 20 + 2) = \text{Beta}(28, 22). $$
Posterior mean $= 28/50 = 0.56$. Even though the raw data shows $80\%$ heads, the strong prior pulls the estimate substantially toward $0.5$ — because $\text{Beta}(20, 20)$ behaves like having already seen $40$ fairly balanced flips.
Example 2 · Disease test with a sicker patient
Same test as before ($99\%$ sensitivity, $99\%$ specificity), but now the patient already has symptoms, raising their personal prior to $P(D) = 0.30$. They test positive. What's $P(D \mid +)$?
Solution.
$$ P(+) = (0.99)(0.30) + (0.01)(0.70) = 0.297 + 0.007 = 0.304 $$
$$ P(D \mid +) = \frac{(0.99)(0.30)}{0.304} \approx 0.977. $$
With a $30\%$ prior, a positive test pushes the posterior to $97.7\%$ — the same test result, very different conclusion, because the prior was very different. Priors aren't decoration.
Example 3 · Conjugate update for a Poisson rate
A call centre receives calls at rate $\lambda$ per hour. Prior: $\lambda \sim \text{Gamma}(2, 1)$ (mean $2$). Over $5$ hours you observe $14$ calls. Posterior?
Solution. The Gamma–Poisson update rule: $\text{Gamma}(\alpha + s,\; \beta + n)$ where $s$ is the total count and $n$ is the number of intervals.
$$ \lambda \mid \text{data} \sim \text{Gamma}(2 + 14,\; 1 + 5) = \text{Gamma}(16, 6). $$
Posterior mean $= 16/6 \approx 2.67$ calls/hour. The raw rate is $14/5 = 2.80$; the prior pulls it slightly toward $2$.
Example 4 · Sequential update vs. batch update
Start with $\text{Beta}(1,1)$. First observe $3$ heads in $5$ flips. Then observe another $5$ heads in $5$ flips. Compute the posterior two ways: (a) update once on the full $(8, 2)$ data; (b) update on $(3, 2)$, then update again on $(5, 0)$.
Solution.
Batch: $\text{Beta}(1 + 8,\; 1 + 2) = \text{Beta}(9, 3)$.
Sequential, step 1: $\text{Beta}(1 + 3,\; 1 + 2) = \text{Beta}(4, 3)$. Step 2: $\text{Beta}(4 + 5,\; 3 + 0) = \text{Beta}(9, 3)$. Same answer. The order data arrives in doesn't matter.
Example 5 · Reading off a credible interval
The posterior is $\text{Beta}(9, 3)$. Estimate (without a calculator) where the central $95\%$ of the mass lies.
Solution sketch. The mode is at $(9-1)/(9+3-2) = 0.8$, the mean at $9/12 = 0.75$, and the distribution is left-skewed with most mass between $0.45$ and $0.95$. A numerical computation gives the $2.5\%$ and $97.5\%$ quantiles as roughly $0.48$ and $0.94$.
The Bayesian statement: given the data and the uniform prior, there is a $95\%$ probability that the coin's true heads-probability lies between $0.48$ and $0.94$. Note how wide that interval is — $10$ flips just isn't enough data to be precise.