Topic · Statistics & Probability

Conditional Probability & Bayes

How probabilities change when you learn something. Conditional probability is the mathematical answer to "given that I just observed X, what should I now believe about Y?" — and Bayes's theorem is the rule that flips it around when you know one direction and want the other.

What you'll leave with

  • A working definition of $P(A \mid B)$ as the probability of $A$ once the sample space is restricted to $B$.
  • The multiplication rule and how it lets you compute the probability of a sequence of events.
  • Bayes's theorem derived in two lines, with names for every piece: prior, likelihood, evidence, posterior.
  • An honest worked example showing why a "99% accurate" medical test for a rare disease is only about 50% reliable on a positive result.
  • The three cognitive traps that beginners — and a lot of experts — fall into: base-rate neglect, the prosecutor's fallacy, and confusing independence with disjointness.

1. Conditioning

Probability without context is a number attached to an event. Probability with context — "given that something else happened" — is a different number, often dramatically different. The notation $P(A \mid B)$, read "the probability of $A$ given $B$," is how we name that shifted number.

Conditional probability

For events $A$ and $B$ with $P(B) > 0$,

$$ P(A \mid B) \;=\; \frac{P(A \cap B)}{P(B)}. $$

Read it geometrically: restrict the sample space to $B$, then ask what fraction of that restricted space is also in $A$.

The whole formula is doing one thing: rescaling. The numerator $P(A \cap B)$ counts the worlds where both $A$ and $B$ happen. The denominator $P(B)$ is the new "total" — we've thrown away every world where $B$ didn't happen, so $B$ now plays the role that "everything" played before.

A picture

Imagine the sample space as a rectangle. $A$ is a blob inside it. $B$ is another blob. The overlap $A \cap B$ is where they intersect. Once you're told "$B$ has occurred," the entire outside of $B$ is gone — it's not part of any future you're entertaining. So the question becomes: within $B$, what slice is also $A$?

Sample space $\Omega$ A B A∩B Before conditioning given B New universe = B A∩B B \ A $P(A \mid B) = \frac{|A \cap B|}{|B|}$
Mental model

Conditioning is zooming in. You hand-wave away every outcome where $B$ didn't happen, treat what's left as the new "everything," and recompute $A$'s share of that.

One immediate consequence: $P(A \mid B)$ and $P(A)$ are unrelated numbers in general. Knowing $B$ might raise $A$'s probability, lower it, or leave it untouched. The case where it leaves it untouched — $P(A \mid B) = P(A)$ — is exactly what it means for $A$ and $B$ to be independent.

2. The multiplication rule

The conditional-probability definition has another use: rearrange it. Multiplying both sides of

$$ P(A \mid B) = \frac{P(A \cap B)}{P(B)} $$

by $P(B)$ gives

$$ P(A \cap B) \;=\; P(A \mid B) \cdot P(B). $$

This is the multiplication rule, and it's the workhorse for computing joint probabilities one stage at a time. The same event can be sliced either way:

$$ P(A \cap B) \;=\; P(A \mid B) \cdot P(B) \;=\; P(B \mid A) \cdot P(A). $$

Why is this useful? Because real-world problems are usually structured as sequences. First something happens; then, conditional on that, something else happens. The multiplication rule lets you walk through those stages and multiply at each step.

Example shape

Draw two cards from a deck without replacement. The probability of "first card red AND second card red" is

$P(R_1 \cap R_2) = P(R_1) \cdot P(R_2 \mid R_1) = \tfrac{26}{52} \cdot \tfrac{25}{51}.$

The second factor is conditional because removing a red card changes what's in the deck for draw two.

Stacking the rule across more stages gives the chain rule:

$$ P(A_1 \cap A_2 \cap \cdots \cap A_n) \;=\; P(A_1) \cdot P(A_2 \mid A_1) \cdot P(A_3 \mid A_1 \cap A_2) \cdots P(A_n \mid A_1 \cap \cdots \cap A_{n-1}). $$

Every joint probability you'll ever meet can, in principle, be unrolled this way.

3. Bayes's theorem

Often the conditional probability you want isn't the one you can easily measure. A doctor wants $P(\text{disease} \mid \text{positive test})$ — but what's actually known about a test is its accuracy, which is $P(\text{positive test} \mid \text{disease})$. Same two events, the bar pointing the other way, completely different number.

Bayes's theorem is the bridge. Start from the two ways of writing $P(A \cap B)$ via the multiplication rule:

$$ P(A \mid B) \cdot P(B) \;=\; P(B \mid A) \cdot P(A). $$

Divide both sides by $P(B)$ (assuming it's positive):

$$ \boxed{\; P(A \mid B) \;=\; \frac{P(B \mid A) \cdot P(A)}{P(B)} \;} $$

That's it. Two lines from the definition. The theorem looks profound and is profound, but its derivation is one of the most modest in mathematics — it's just rearranging a definition.

The names

Each piece of the formula has a name, and the names matter because they describe the role each plays in the act of updating beliefs.

SymbolNameMeaning
$P(A)$ Prior What you believed about $A$ before seeing $B$. The starting point.
$P(B \mid A)$ Likelihood How likely the evidence $B$ would be if $A$ were true. Often the easiest piece to measure.
$P(B)$ Evidence (or marginal) The total probability of observing $B$, across all possibilities for $A$. The normalizer.
$P(A \mid B)$ Posterior What you should believe about $A$ after seeing $B$. The updated answer.

When the evidence $P(B)$ isn't directly handed to you, expand it with the law of total probability — split the sample space into $A$ and $\bar A$ (the complement):

$$ P(B) \;=\; P(B \mid A) \cdot P(A) \;+\; P(B \mid \bar A) \cdot P(\bar A). $$

Plugging this back in gives the form of Bayes you'll actually use in practice:

$$ P(A \mid B) \;=\; \frac{P(B \mid A) \cdot P(A)}{P(B \mid A) \cdot P(A) \;+\; P(B \mid \bar A) \cdot P(\bar A)}. $$
Reading the formula

The numerator is the probability of the world you care about (disease and positive test). The denominator is the probability of any world consistent with what you observed (positive test, however it arose). The ratio is the fraction of "saw $B$" worlds that are also "is $A$" worlds.

4. The classic medical-test example

This is the example that built Bayes's reputation as something every educated person ought to know. The numbers are small, the setup is realistic, and the answer is shocking on the first encounter.

A disease has prevalence $1\%$ in the population. A test for it is "$99\%$ accurate" — it returns a positive on $99\%$ of people who have the disease, and a negative on $99\%$ of people who don't. You test positive. What's the probability you actually have the disease?

The instinctive answer is "$99\%$, that's what 'accurate' means." The right answer is closer to $50\%$, and the gap between those two numbers is the entire point.

Naming the pieces

Let $D$ be "has the disease" and $+$ be "tests positive." The problem hands us:

  • Prior: $P(D) = 0.01$, so $P(\bar D) = 0.99$.
  • Sensitivity (true-positive rate): $P(+ \mid D) = 0.99$.
  • Specificity (true-negative rate): $P(- \mid \bar D) = 0.99$, so the false-positive rate is $P(+ \mid \bar D) = 0.01$.

We want the posterior $P(D \mid +)$.

Plugging in

Bayes:

$$ P(D \mid +) \;=\; \frac{P(+ \mid D)\, P(D)}{P(+ \mid D)\, P(D) + P(+ \mid \bar D)\, P(\bar D)}. $$

Substitute the numbers:

$$ P(D \mid +) \;=\; \frac{0.99 \times 0.01}{0.99 \times 0.01 \;+\; 0.01 \times 0.99} \;=\; \frac{0.0099}{0.0099 + 0.0099} \;=\; \frac{1}{2}. $$

Exactly $50\%$. A positive result on a $99\%$-accurate test for a $1\%$-prevalence disease is a coin flip.

The intuition: count people

Imagine $10{,}000$ people get tested. The numbers fall out without algebra.

Have diseaseNo diseaseTotal
Test positive$99$$99$$198$
Test negative$1$$9{,}801$$9{,}802$
Total$100$$9{,}900$$10{,}000$

Of the $100$ truly sick people, $99$ test positive (true positives). Of the $9{,}900$ healthy people, $1\%$ — that's $99$ of them — also test positive (false positives). The two groups are the same size. So among the $198$ positive results, exactly half come from sick people. The healthy population is so much larger than the sick population that even a tiny false-positive rate produces enough false alarms to drown out the real ones.

Why this matters

The test isn't broken. "$99\%$ accurate" is a perfectly fine description. The trap is treating $P(+ \mid D)$ — the test's accuracy — as if it equalled $P(D \mid +)$ — the question the patient is actually asking. They are completely different numbers, related only by Bayes's theorem and the base rate.

Push the prevalence to $10\%$ and rerun: $P(D \mid +)$ jumps to about $92\%$. Push it to $0.1\%$ and it collapses to about $9\%$. The base rate is doing most of the work; the test is just nudging.

5. Why Bayes matters

The medical-test example is a parlor trick. The deeper claim is that Bayes's theorem is the mathematical core of belief updating — of what it means to learn from evidence at all.

Read the formula aloud as a recipe:

Start with your prior belief in a hypothesis. Multiply by how well the hypothesis predicts the data you saw. Divide by the total probability of seeing that data under any hypothesis. The result is your new belief.

Run that loop over and over, with each posterior becoming the prior for the next round, and you have the engine that drives a remarkable swath of modern thinking:

  • Science. Hypotheses don't get "proven." They get more (or less) probable as data accumulates. Bayes makes that update quantitative.
  • Machine learning. Naive Bayes classifiers, Bayesian networks, Markov chain Monte Carlo, variational inference, and the entire field of Bayesian deep learning all start from the same two-line theorem.
  • Medical diagnosis. Every clinical decision rule is implicitly Bayesian — combining prior risk with test results to update belief in a diagnosis.
  • Forensics and the courtroom. Evidence updates a prior probability of guilt. Getting this wrong is the prosecutor's fallacy (see pitfalls below).
  • Spam filters, search engines, recommendation systems. All variants of "given what we observed, what's the most probable underlying state?"
The big idea

Beliefs aren't binary. They're probabilities, and rational learning is the act of nudging those probabilities in proportion to how well evidence matches each hypothesis. Bayes is the bookkeeping rule for that nudge.

6. Playground: the medical-test paradox

Dial in a disease prevalence, a test's sensitivity, and its false-positive rate. The posterior — the probability you actually have the disease given a positive test — updates instantly, alongside the "naive" answer most people give. The frequency tree underneath shows what's happening: out of $10{,}000$ people, count the true positives and the false positives, then read off the fraction.

P(D | +) = 16.7%
Without Bayes you'd guess: 99%
1.0%
99%
5%
Of 10,000 people tested…
100 have the disease
99 test positive (true positive)
1 test negative (false negative)
9,900 don't have the disease
495 test positive (false positive)
9,405 test negative (true negative)
Of all 594 positive tests, only 99 actually have the disease — that's 16.7%.
Copied!
Try this

Start from "Classic example" (the famous 1% / 99% / 5% trio that returns ~16.7%). Now push prevalence up to 10% — the posterior leaps past 65%. Drop the false-positive rate to 1% — it climbs further. The same test, the same person, three radically different conclusions. The prior is doing as much work as the test.

7. Common pitfalls

Base-rate fallacy

Forgetting that $P(D)$ shows up in the formula. People reach for "$99\%$ accurate" and ignore that the prior probability of having the disease is $1\%$. Without the base rate, no amount of test accuracy can pin down the posterior. Always ask: how common was this in the first place?

Prosecutor's fallacy — confusing $P(A \mid B)$ with $P(B \mid A)$

"The chance of this DNA match occurring at random is $1$ in a million, therefore the chance the defendant is innocent is $1$ in a million." This swaps the conditional. $P(\text{match} \mid \text{innocent})$ is one number; $P(\text{innocent} \mid \text{match})$ is another, and getting to the second from the first requires Bayes — including the prior probability of innocence, which depends on the size of the suspect pool. The two numbers can differ by orders of magnitude, and the difference has put innocent people in prison.

Independence vs. disjointness

These are opposites, not synonyms. Disjoint events have $P(A \cap B) = 0$ — they can't both happen. Independent events have $P(A \cap B) = P(A) \cdot P(B)$ — knowing one tells you nothing about the other. Two disjoint events with positive probabilities are never independent: if $B$ happens, you know for certain $A$ didn't, which is the opposite of "no information."

Conditioning on a zero-probability event

The definition $P(A \mid B) = P(A \cap B) / P(B)$ requires $P(B) > 0$. If $B$ has probability zero, the formula is undefined — division by zero. There are technical extensions for continuous distributions where this comes up, but at the introductory level: if you've conditioned on something that can never happen, something earlier in your reasoning went wrong.

8. Worked examples

Try each one yourself before opening the solution. The point is to see whether your steps match the canonical recipe, not to check the final number.

Example 1 · Conditional from a two-way table

$200$ students were surveyed about whether they study mathematics and whether they study physics. The results:

PhysicsNo physicsTotal
Math$60$$40$$100$
No math$20$$80$$100$
Total$80$$120$$200$

Pick a student at random. What's $P(\text{math} \mid \text{physics})$?

Step 1. Restrict to physics students. There are $80$ of them.

Step 2. Of those, $60$ also do math.

Step 3. $P(\text{math} \mid \text{physics}) = \tfrac{60}{80} = 0.75$.

Compare with $P(\text{math}) = \tfrac{100}{200} = 0.5$. Knowing the student does physics raises the probability they also do math from $50\%$ to $75\%$ — the events are positively correlated.

Example 2 · The medical test, redone slowly with Bayes

Prevalence $P(D) = 0.01$; sensitivity $P(+ \mid D) = 0.99$; false-positive rate $P(+ \mid \bar D) = 0.01$. Find $P(D \mid +)$.

Step 1. Write Bayes:

$$ P(D \mid +) = \frac{P(+ \mid D)\,P(D)}{P(+)}. $$

Step 2. Compute the evidence $P(+)$ by total probability:

$$ P(+) = P(+ \mid D)\,P(D) + P(+ \mid \bar D)\,P(\bar D) = (0.99)(0.01) + (0.01)(0.99) = 0.0198. $$

Step 3. Compute the numerator:

$$ P(+ \mid D)\,P(D) = (0.99)(0.01) = 0.0099. $$

Step 4. Divide:

$$ P(D \mid +) = \frac{0.0099}{0.0198} = 0.5. $$

Check. The numerator counts the sick-and-positive worlds; the denominator counts all positive worlds. Half of all positives come from sick people because there are equally many sick true-positives ($99$) and healthy false-positives ($99$) per $10{,}000$ tested.

Example 3 · Monty Hall

Three doors. Behind one is a car; behind each of the others, a goat. You pick door 1. The host — who knows what's behind every door — opens door 3, revealing a goat, then offers you the chance to switch to door 2. Should you?

Let $C_i$ be "car is behind door $i$." Each has prior probability $\tfrac{1}{3}$. Let $H_3$ be "host opens door 3."

Step 1. Compute the likelihoods. If the car is behind door 1 (your pick), the host can open either 2 or 3 — call it $50/50$, so $P(H_3 \mid C_1) = \tfrac{1}{2}$. If the car is behind door 2, the host must open door 3 (he won't reveal the car), so $P(H_3 \mid C_2) = 1$. If the car is behind door 3, the host can't open it: $P(H_3 \mid C_3) = 0$.

Step 2. Evidence:

$$ P(H_3) = \tfrac{1}{2} \cdot \tfrac{1}{3} + 1 \cdot \tfrac{1}{3} + 0 \cdot \tfrac{1}{3} = \tfrac{1}{2}. $$

Step 3. Posteriors:

$$ P(C_1 \mid H_3) = \frac{(1/2)(1/3)}{1/2} = \tfrac{1}{3}, \qquad P(C_2 \mid H_3) = \frac{(1)(1/3)}{1/2} = \tfrac{2}{3}. $$

Conclusion. Switching doubles your probability of winning, from $\tfrac{1}{3}$ to $\tfrac{2}{3}$. The host's action wasn't independent of where the car was — it carried information, and Bayes extracts it.

Example 4 · Coin flips and the gambler's fallacy

A fair coin has been flipped five times and come up heads every time. What's the probability the next flip is heads?

Step 1. Let $H_i$ be heads on flip $i$. Independence of flips means $P(H_6 \mid H_1 \cap H_2 \cap H_3 \cap H_4 \cap H_5) = P(H_6) = \tfrac{1}{2}$.

Step 2. The coin has no memory. The streak of five heads is unusual ($\tfrac{1}{32}$ a priori), but once observed it tells you nothing about flip six, because the events are independent.

Watch out. The reasoning changes if you don't know the coin is fair. Then the streak is evidence the coin might be biased toward heads, and a Bayesian update on the coin's bias parameter would push $P(H_6 \mid \text{five heads})$ up, not down. The "gambler's fallacy" — believing tails is "due" — is the opposite mistake: pushing the probability down, in defiance of both independence and Bayesian inference.

Example 5 · Why $P(A \mid B) \neq P(B \mid A)$ in general

Concrete demonstration. Let $A$ = "person is a U.S. senator" and $B$ = "person is a U.S. citizen." Roughly, in the U.S.:

  • $P(B \mid A) \approx 1$ — virtually every senator is a citizen (in fact, by law).
  • $P(A \mid B) \approx \tfrac{100}{330{,}000{,}000} \approx 3 \times 10^{-7}$ — picking a random citizen, the chance they're one of the 100 senators is tiny.

The two conditionals differ by seven orders of magnitude. They're not the same number, they're not approximately the same number, and which one is relevant depends entirely on what you're asking.

Bayes check. Indeed,

$$ P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)} \approx \frac{1 \cdot (100/330{,}000{,}000)}{1} = 3 \times 10^{-7}, $$

which matches the direct count.

Sources & further reading

The content above is synthesized from established probability references. If anything reads ambiguously here, the primary sources are the ground truth — and the "going deeper" links are where to turn when this page has served its purpose.

Test your understanding

A quiz that builds from easy to hard. Pick an answer to get instant feedback and a worked explanation. Your progress is saved in this browser — come back anytime to continue.

Question 1 of 1
0 correct