1. Conditioning
Probability without context is a number attached to an event. Probability with context — "given that something else happened" — is a different number, often dramatically different. The notation $P(A \mid B)$, read "the probability of $A$ given $B$," is how we name that shifted number.
For events $A$ and $B$ with $P(B) > 0$,
$$ P(A \mid B) \;=\; \frac{P(A \cap B)}{P(B)}. $$Read it geometrically: restrict the sample space to $B$, then ask what fraction of that restricted space is also in $A$.
The whole formula is doing one thing: rescaling. The numerator $P(A \cap B)$ counts the worlds where both $A$ and $B$ happen. The denominator $P(B)$ is the new "total" — we've thrown away every world where $B$ didn't happen, so $B$ now plays the role that "everything" played before.
A picture
Imagine the sample space as a rectangle. $A$ is a blob inside it. $B$ is another blob. The overlap $A \cap B$ is where they intersect. Once you're told "$B$ has occurred," the entire outside of $B$ is gone — it's not part of any future you're entertaining. So the question becomes: within $B$, what slice is also $A$?
Conditioning is zooming in. You hand-wave away every outcome where $B$ didn't happen, treat what's left as the new "everything," and recompute $A$'s share of that.
One immediate consequence: $P(A \mid B)$ and $P(A)$ are unrelated numbers in general. Knowing $B$ might raise $A$'s probability, lower it, or leave it untouched. The case where it leaves it untouched — $P(A \mid B) = P(A)$ — is exactly what it means for $A$ and $B$ to be independent.
2. The multiplication rule
The conditional-probability definition has another use: rearrange it. Multiplying both sides of
$$ P(A \mid B) = \frac{P(A \cap B)}{P(B)} $$by $P(B)$ gives
$$ P(A \cap B) \;=\; P(A \mid B) \cdot P(B). $$This is the multiplication rule, and it's the workhorse for computing joint probabilities one stage at a time. The same event can be sliced either way:
$$ P(A \cap B) \;=\; P(A \mid B) \cdot P(B) \;=\; P(B \mid A) \cdot P(A). $$Why is this useful? Because real-world problems are usually structured as sequences. First something happens; then, conditional on that, something else happens. The multiplication rule lets you walk through those stages and multiply at each step.
Draw two cards from a deck without replacement. The probability of "first card red AND second card red" is
$P(R_1 \cap R_2) = P(R_1) \cdot P(R_2 \mid R_1) = \tfrac{26}{52} \cdot \tfrac{25}{51}.$
The second factor is conditional because removing a red card changes what's in the deck for draw two.
Stacking the rule across more stages gives the chain rule:
$$ P(A_1 \cap A_2 \cap \cdots \cap A_n) \;=\; P(A_1) \cdot P(A_2 \mid A_1) \cdot P(A_3 \mid A_1 \cap A_2) \cdots P(A_n \mid A_1 \cap \cdots \cap A_{n-1}). $$Every joint probability you'll ever meet can, in principle, be unrolled this way.
3. Bayes's theorem
Often the conditional probability you want isn't the one you can easily measure. A doctor wants $P(\text{disease} \mid \text{positive test})$ — but what's actually known about a test is its accuracy, which is $P(\text{positive test} \mid \text{disease})$. Same two events, the bar pointing the other way, completely different number.
Bayes's theorem is the bridge. Start from the two ways of writing $P(A \cap B)$ via the multiplication rule:
$$ P(A \mid B) \cdot P(B) \;=\; P(B \mid A) \cdot P(A). $$Divide both sides by $P(B)$ (assuming it's positive):
$$ \boxed{\; P(A \mid B) \;=\; \frac{P(B \mid A) \cdot P(A)}{P(B)} \;} $$That's it. Two lines from the definition. The theorem looks profound and is profound, but its derivation is one of the most modest in mathematics — it's just rearranging a definition.
The names
Each piece of the formula has a name, and the names matter because they describe the role each plays in the act of updating beliefs.
| Symbol | Name | Meaning |
|---|---|---|
| $P(A)$ | Prior | What you believed about $A$ before seeing $B$. The starting point. |
| $P(B \mid A)$ | Likelihood | How likely the evidence $B$ would be if $A$ were true. Often the easiest piece to measure. |
| $P(B)$ | Evidence (or marginal) | The total probability of observing $B$, across all possibilities for $A$. The normalizer. |
| $P(A \mid B)$ | Posterior | What you should believe about $A$ after seeing $B$. The updated answer. |
When the evidence $P(B)$ isn't directly handed to you, expand it with the law of total probability — split the sample space into $A$ and $\bar A$ (the complement):
$$ P(B) \;=\; P(B \mid A) \cdot P(A) \;+\; P(B \mid \bar A) \cdot P(\bar A). $$Plugging this back in gives the form of Bayes you'll actually use in practice:
$$ P(A \mid B) \;=\; \frac{P(B \mid A) \cdot P(A)}{P(B \mid A) \cdot P(A) \;+\; P(B \mid \bar A) \cdot P(\bar A)}. $$The numerator is the probability of the world you care about (disease and positive test). The denominator is the probability of any world consistent with what you observed (positive test, however it arose). The ratio is the fraction of "saw $B$" worlds that are also "is $A$" worlds.
4. The classic medical-test example
This is the example that built Bayes's reputation as something every educated person ought to know. The numbers are small, the setup is realistic, and the answer is shocking on the first encounter.
A disease has prevalence $1\%$ in the population. A test for it is "$99\%$ accurate" — it returns a positive on $99\%$ of people who have the disease, and a negative on $99\%$ of people who don't. You test positive. What's the probability you actually have the disease?
The instinctive answer is "$99\%$, that's what 'accurate' means." The right answer is closer to $50\%$, and the gap between those two numbers is the entire point.
Naming the pieces
Let $D$ be "has the disease" and $+$ be "tests positive." The problem hands us:
- Prior: $P(D) = 0.01$, so $P(\bar D) = 0.99$.
- Sensitivity (true-positive rate): $P(+ \mid D) = 0.99$.
- Specificity (true-negative rate): $P(- \mid \bar D) = 0.99$, so the false-positive rate is $P(+ \mid \bar D) = 0.01$.
We want the posterior $P(D \mid +)$.
Plugging in
Bayes:
$$ P(D \mid +) \;=\; \frac{P(+ \mid D)\, P(D)}{P(+ \mid D)\, P(D) + P(+ \mid \bar D)\, P(\bar D)}. $$Substitute the numbers:
$$ P(D \mid +) \;=\; \frac{0.99 \times 0.01}{0.99 \times 0.01 \;+\; 0.01 \times 0.99} \;=\; \frac{0.0099}{0.0099 + 0.0099} \;=\; \frac{1}{2}. $$Exactly $50\%$. A positive result on a $99\%$-accurate test for a $1\%$-prevalence disease is a coin flip.
The intuition: count people
Imagine $10{,}000$ people get tested. The numbers fall out without algebra.
| Have disease | No disease | Total | |
|---|---|---|---|
| Test positive | $99$ | $99$ | $198$ |
| Test negative | $1$ | $9{,}801$ | $9{,}802$ |
| Total | $100$ | $9{,}900$ | $10{,}000$ |
Of the $100$ truly sick people, $99$ test positive (true positives). Of the $9{,}900$ healthy people, $1\%$ — that's $99$ of them — also test positive (false positives). The two groups are the same size. So among the $198$ positive results, exactly half come from sick people. The healthy population is so much larger than the sick population that even a tiny false-positive rate produces enough false alarms to drown out the real ones.
The test isn't broken. "$99\%$ accurate" is a perfectly fine description. The trap is treating $P(+ \mid D)$ — the test's accuracy — as if it equalled $P(D \mid +)$ — the question the patient is actually asking. They are completely different numbers, related only by Bayes's theorem and the base rate.
Push the prevalence to $10\%$ and rerun: $P(D \mid +)$ jumps to about $92\%$. Push it to $0.1\%$ and it collapses to about $9\%$. The base rate is doing most of the work; the test is just nudging.
5. Why Bayes matters
The medical-test example is a parlor trick. The deeper claim is that Bayes's theorem is the mathematical core of belief updating — of what it means to learn from evidence at all.
Read the formula aloud as a recipe:
Start with your prior belief in a hypothesis. Multiply by how well the hypothesis predicts the data you saw. Divide by the total probability of seeing that data under any hypothesis. The result is your new belief.
Run that loop over and over, with each posterior becoming the prior for the next round, and you have the engine that drives a remarkable swath of modern thinking:
- Science. Hypotheses don't get "proven." They get more (or less) probable as data accumulates. Bayes makes that update quantitative.
- Machine learning. Naive Bayes classifiers, Bayesian networks, Markov chain Monte Carlo, variational inference, and the entire field of Bayesian deep learning all start from the same two-line theorem.
- Medical diagnosis. Every clinical decision rule is implicitly Bayesian — combining prior risk with test results to update belief in a diagnosis.
- Forensics and the courtroom. Evidence updates a prior probability of guilt. Getting this wrong is the prosecutor's fallacy (see pitfalls below).
- Spam filters, search engines, recommendation systems. All variants of "given what we observed, what's the most probable underlying state?"
Beliefs aren't binary. They're probabilities, and rational learning is the act of nudging those probabilities in proportion to how well evidence matches each hypothesis. Bayes is the bookkeeping rule for that nudge.