Topic · Statistics & Probability

Random Variables

Probability gets dramatically more powerful the moment you stop asking "did this event happen?" and start asking "what number came out?" A random variable is the bridge — a way to attach a number to every outcome of a random experiment so you can add, average, and reason algebraically about chance.

What you'll leave with

  • The formal definition of a random variable as a function $X : \Omega \to \mathbb{R}$ — and why this abstraction matters.
  • How to describe a discrete RV with a PMF and a continuous RV with a PDF, and the constraints each must satisfy.
  • The CDF $F(x) = P(X \le x)$ — the single object that works for discrete, continuous, and mixed RVs alike.
  • Expectation $E[X]$ as the balance point of the distribution, and variance $\operatorname{Var}(X) = E[X^2] - E[X]^2$ as its spread.
  • Linearity of expectation — the tool that makes hard problems easy whether or not the variables are independent.
  • LOTUS for computing $E[g(X)]$ without first finding the distribution of $g(X)$, plus a working definition of independence between two RVs.

1. What a random variable is

Up to this point, probability has been about events — subsets of a sample space, things like "the die shows an even number" or "at least one of two coins lands heads." Events are useful, but they're qualitative. As soon as you want to ask quantitative questions — "on average, how many heads do I get in 10 flips?", "what's the typical height of a randomly chosen adult?" — you need a number attached to each outcome.

That's all a random variable is. A rule that says: look at the outcome, and report this number.

Random variable

A random variable $X$ is a function from the sample space $\Omega$ to the real numbers:

$X : \Omega \to \mathbb{R}.$

For each outcome $\omega \in \Omega$, the value $X(\omega)$ is a real number. The randomness lives in which $\omega$ occurs — once it does, $X$ deterministically reports the corresponding number.

The convention is to use a capital letter ($X$, $Y$, $N$) for the random variable itself and a lowercase letter ($x$, $y$, $n$) for a specific value it might take. The expression $P(X = 3)$ is shorthand for $P(\{\omega \in \Omega : X(\omega) = 3\})$ — the probability of the set of outcomes that $X$ sends to $3$. You almost never write the function explicitly. You just say "let $X$ be the number of heads in 10 flips" and trust that the underlying $\Omega$ is there if anyone asks.

Why a function?

Calling $X$ a function — rather than "a random number" — buys you something important: it lets several random variables share the same underlying experiment. Flip a coin ten times; let $X$ be the number of heads and $Y$ be the index of the first head. Both are functions on the same $\Omega$. The functional view is what makes joint behavior, dependence, and covariance even definable.

Two flavors

Random variables come in two essentially different kinds, and the machinery for each is different enough that it's worth naming them up front.

  • Discrete. $X$ takes values in a finite or countably infinite set — typically integers. Examples: a die roll, a count of successes, the number of customers in an hour.
  • Continuous. $X$ takes values in an interval (or a union of intervals) — uncountably many possible values. Examples: a height, a waiting time, a temperature.

The split matters because "the probability $X$ equals exactly $3.14159\ldots$" is sensible for a discrete RV (it's some number) but trivially zero for a continuous one (any single real number has zero probability among uncountably many). The next two sections handle each case on its own terms.

2. Discrete: the probability mass function

For a discrete random variable, you can describe everything there is to know by listing each value $X$ can take and the probability it takes it. That table is called the probability mass function.

Probability mass function (PMF)

For a discrete RV $X$, the PMF $p_X$ assigns each possible value $x$ its probability:

$p_X(x) = P(X = x).$

Two constraints:

  1. $p_X(x) \ge 0$ for every $x$ — probabilities aren't negative.
  2. $\displaystyle \sum_x p_X(x) = 1$ — the probabilities of all possible values add to one.

The second condition is the structural fingerprint of a probability distribution: something has to happen, and the PMF accounts for all of it. If you find a function that takes the right shape but doesn't sum to one, it isn't a PMF — at best it's something you need to normalize by dividing by its sum.

The PMF of a fair die is uniform: $p_X(k) = 1/6$ for $k \in \{1, 2, 3, 4, 5, 6\}$. The PMF of "number of heads in two fair flips" is non-uniform: $p_X(0) = 1/4$, $p_X(1) = 1/2$, $p_X(2) = 1/4$. In every case, the PMF is a complete description — give it to someone and they can compute any probability they like.

The picture: a stem plot

The natural visualization of a PMF is a stem plot (sometimes called a "lollipop chart"): for each value $x$ in the support, draw a vertical line of height $p_X(x)$.

0.00 0.05 0.10 0.15 0.20 2 3 4 5 6 7 8 9 10 11 12 6/36 ≈ 0.167 x = sum of two dice p(x) PMF of X (sum of two fair dice)

A PMF as a stem plot. The heights are probabilities; they sum to one across the support.

Each dot's height is a literal probability — readable straight off the y-axis. If you want $P(X \in A)$ for some set $A \subseteq \{2, \ldots, 12\}$, you just add the heights over $x \in A$.

3. Continuous: the probability density function

Continuous random variables don't have a PMF, because in a continuum no individual value carries any probability. The probability that a uniformly chosen real number in $[0, 1]$ equals exactly $0.5$ is $0$ — same as $0.4$, same as $\pi/4$, same as any other specific value. The probability lives in intervals, not in points. The tool for handling that is the probability density function.

Probability density function (PDF)

A continuous RV $X$ has PDF $f_X$ if for every interval $[a, b]$,

$\displaystyle P(a \le X \le b) = \int_a^b f_X(x)\, dx.$

Two constraints:

  1. $f_X(x) \ge 0$ for every $x$ — densities aren't negative.
  2. $\displaystyle \int_{-\infty}^{\infty} f_X(x)\, dx = 1$ — the total probability is one.

The PDF is the continuous analog of the PMF, with one critical replacement: sums become integrals. The constraint $\sum p_i = 1$ becomes $\int f = 1$. The probability of a region becomes the area under the curve over that region.

A PDF is not a probability

This is the single most important thing to internalize about continuous RVs. A PDF value $f_X(x)$ is a density — probability per unit length. It can be larger than 1, sometimes much larger. The uniform distribution on $[0, 0.1]$ has density $f(x) = 10$ on that interval, and that's fine. What can't exceed 1 is the integral over any region — because that's an actual probability.

The picture: a curve with shaded area

A PDF is naturally drawn as a smooth curve. The probability that $X$ falls in $[a, b]$ is the area shaded under the curve between those two $x$-values.

a b P(a ≤ X ≤ b) = shaded area f(x) x density f(x) PDF of a continuous X

A continuous PDF. The probability of any interval is the area under the curve over that interval.

One immediate consequence: $P(X = a) = \int_a^a f(x)\,dx = 0$ for any particular value $a$, regardless of how tall the density is there. Continuous probability has to be smeared over a region before it accumulates to anything positive. This is also why, for continuous $X$, the four expressions $P(a \le X \le b)$, $P(a < X \le b)$, $P(a \le X < b)$, $P(a < X < b)$ are all equal — the boundary contributes nothing.

4. The CDF — one object for both worlds

The PMF works for discrete RVs. The PDF works for continuous RVs. Neither works for both at once, and there are perfectly real distributions (mixed ones) that aren't either purely discrete or purely continuous. There is one description that handles every case uniformly, and you'll use it constantly: the cumulative distribution function.

Cumulative distribution function (CDF)

For any random variable $X$, the CDF is

$F_X(x) = P(X \le x).$

It is the probability that $X$ comes out at most $x$. The CDF has four universal properties, regardless of whether $X$ is discrete, continuous, or neither:

  1. $F_X$ is non-decreasing — as $x$ grows, more outcomes get included.
  2. $\lim_{x \to -\infty} F_X(x) = 0$ and $\lim_{x \to +\infty} F_X(x) = 1$.
  3. $F_X$ is right-continuous.
  4. For $a \le b$: $P(a < X \le b) = F_X(b) - F_X(a)$.

The CDF and the PMF/PDF are equivalent — given one, you can recover the other.

  • Discrete: $F_X(x) = \displaystyle\sum_{x_i \le x} p_X(x_i)$. The CDF is a staircase, jumping by $p_X(x_i)$ at each support point.
  • Continuous: $F_X(x) = \displaystyle\int_{-\infty}^{x} f_X(t)\, dt$, and reciprocally $f_X(x) = F_X'(x)$ wherever the derivative exists. The CDF is smooth.
Why the CDF is the great unifier

Any probability statement about $X$ can be written in terms of $F_X$ — for example, $P(X > a) = 1 - F_X(a)$ and $P(a < X \le b) = F_X(b) - F_X(a)$. The same algebra works for a die roll and for a normal distribution. That's why textbooks define a distribution by its CDF and treat PMFs and PDFs as derived shortcuts available when the underlying RV is well-behaved enough.

5. Expectation: the balance point

Once you have a distribution, the first question is usually: what's the typical value? The standard answer is the expectation, also called the mean or the expected value.

Expectation, $E[X]$

The probability-weighted average of $X$'s possible values:

$\displaystyle E[X] = \sum_x x \cdot p_X(x) \quad \text{(discrete)}$

$\displaystyle E[X] = \int_{-\infty}^{\infty} x \, f_X(x)\, dx \quad \text{(continuous)}$

The physical picture is irresistible and worth holding on to: imagine the PMF as a distribution of mass along the number line — point masses at each support value, weighted by $p_X$. The expectation is the center of mass, the single point at which the whole arrangement balances on a fulcrum. For a continuous distribution, replace the point masses with mass spread according to $f_X$; the balance point is again $E[X]$.

This is also why $E[X]$ is sometimes denoted $\mu_X$, or just $\mu$ when $X$ is understood. The Greek letter signals "true population mean," distinct from a sample mean computed from data — a distinction the variance and standard deviation page makes precise.

Quick examples

Fair die. $E[X] = \sum_{k=1}^{6} k \cdot \tfrac{1}{6} = \tfrac{1+2+3+4+5+6}{6} = \tfrac{21}{6} = 3.5$. Notice $3.5$ isn't a value the die can actually take — expectations are averages, not predictions of a single roll.

Uniform on $[0, 1]$. $f(x) = 1$ on $[0,1]$, so $E[X] = \int_0^1 x \cdot 1\, dx = \tfrac{1}{2}$. The balance point of a flat slab on $[0,1]$ is its midpoint, exactly as you'd guess.

Exponential with rate $\lambda$. $f(x) = \lambda e^{-\lambda x}$ on $[0, \infty)$ gives $E[X] = 1/\lambda$. Higher rate, shorter average waiting time — the units even work out.

Existence

Not every distribution has an expectation. The Cauchy distribution, $f(x) = \tfrac{1}{\pi(1+x^2)}$, is symmetric about zero — but $\int_{-\infty}^\infty x\,f(x)\,dx$ doesn't converge absolutely, and so $E[X]$ is undefined. "Mean of the data" still makes sense empirically; it just doesn't have a population value to estimate.

6. Variance and standard deviation

$E[X]$ tells you where the distribution is centered. It says nothing about how tightly it's clustered around that center. A constant $X = 5$ and a wild $X$ that's $-1000$ with probability $0.5$ and $1010$ with probability $0.5$ both have $E[X] = 5$, but they could hardly be more different. Variance is the standard fix.

Variance, $\operatorname{Var}(X)$

The expected squared deviation from the mean:

$\operatorname{Var}(X) = E[(X - E[X])^2].$

Equivalently — and almost always easier to compute:

$\operatorname{Var}(X) = E[X^2] - (E[X])^2.$

The squaring serves two purposes: it kills the sign (positive and negative deviations both contribute), and it punishes large deviations harder than small ones (a deviation of $10$ contributes $100$ to the average, while ten deviations of $1$ contribute only $10$ in total).

The computational form $\operatorname{Var}(X) = E[X^2] - (E[X])^2$ deserves its own line of memory. It comes from expanding the square:

$$ E[(X - \mu)^2] = E[X^2 - 2\mu X + \mu^2] = E[X^2] - 2\mu \, E[X] + \mu^2 = E[X^2] - \mu^2. $$

So computing variance reduces to computing two expectations — $E[X]$ and $E[X^2]$ — and subtracting. That's almost always easier than evaluating $E[(X - \mu)^2]$ directly, because you don't need to know $\mu$ in advance to start summing or integrating.

One downside: variance has the wrong units. If $X$ is in meters, $\operatorname{Var}(X)$ is in square meters. To get back to the original units, take the square root.

Standard deviation, $\sigma_X$

$\sigma_X = \sqrt{\operatorname{Var}(X)}.$

Same units as $X$. Intuitive scale for the typical deviation from the mean.

Variance under shift and scale

Two algebraic identities pay for themselves dozens of times over. For any constants $a$ and $b$:

  • $\operatorname{Var}(X + b) = \operatorname{Var}(X)$ — shifting the whole distribution leaves the spread alone.
  • $\operatorname{Var}(aX) = a^2 \operatorname{Var}(X)$ — scaling by $a$ scales variance by $a^2$ (not $a$).

Combining them: $\operatorname{Var}(aX + b) = a^2 \operatorname{Var}(X)$. In particular, $\sigma_{aX + b} = |a| \sigma_X$. The $a^2$ in the variance and the $|a|$ in the standard deviation are the most common source of arithmetic mistakes in this whole topic — write the rule down and look at it twice.

7. Linearity of expectation

If there's a single result in this entire topic that you should write on the inside of your wrist, it's this one.

Linearity of expectation

For any random variables $X$ and $Y$ and any constants $a$, $b$, $c$:

$E[aX + bY + c] = a\,E[X] + b\,E[Y] + c.$

This holds even when $X$ and $Y$ are dependent. No independence assumption is required.

Why is this surprising? Most things in probability fail without independence. Joint probabilities don't multiply; variances don't add. But expectations do add, always. The proof is just rearrangement: expectation is a (probability-weighted) sum, and sums can be rearranged. There's no statistical magic — there's also no escape clause.

Why does it matter? Because it lets you split a hard random variable into pieces whose individual expectations are easy, even if the pieces interact in messy ways. Two classic uses:

  • Indicator decomposition. Let $X$ be the number of fixed points of a random permutation of $\{1, \ldots, n\}$ — values that the permutation leaves where they started. The marginal distribution of $X$ is genuinely messy. But write $X = I_1 + I_2 + \cdots + I_n$ where $I_k$ is $1$ if $k$ is a fixed point and $0$ otherwise. Each $I_k$ has $E[I_k] = 1/n$ — the probability that $k$ maps to itself. By linearity, $E[X] = n \cdot (1/n) = 1$. On average, a random permutation has one fixed point, no matter how large $n$ is. The indicators are not independent; linearity didn't care.
  • Sum of dice. Roll $n$ dice. Let $S$ be their sum. Then $S = X_1 + \cdots + X_n$ and $E[S] = n \cdot E[X_1] = 3.5 n$. The variables happen to be independent here, but linearity didn't need that fact either.
What linearity does not give you

Expectations are linear, but they are not multiplicative. In general $E[XY] \neq E[X] \, E[Y]$ — the product rule needs independence (or at least uncorrelatedness). And expectations don't commute with non-linear functions: $E[g(X)] \neq g(E[X])$ for nonlinear $g$, a fact captured by Jensen's inequality.

8. Functions of a random variable (LOTUS)

You often want the expectation not of $X$ itself, but of some function of it — $E[X^2]$, $E[\sin X]$, $E[\log(1 + X)]$. The naive route is to first find the distribution of $Y = g(X)$, then compute $E[Y]$ from there. That's almost always more work than necessary. There's a shortcut.

Law of the unconscious statistician (LOTUS)

For any (well-behaved) function $g$ and any RV $X$:

$\displaystyle E[g(X)] = \sum_x g(x) \cdot p_X(x) \quad \text{(discrete)}$

$\displaystyle E[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot f_X(x)\, dx \quad \text{(continuous)}$

The name is a long-running joke: you compute $E[g(X)]$ by integrating $g(x)$ against the density of $X$ — even though, strictly, the expectation should be defined as $\int y f_Y(y)\,dy$ where $Y = g(X)$. The two integrals are equal, and most working statisticians use the shortcut "unconsciously" without justifying it from first principles each time.

Worked example. Let $X$ be a fair die. Compute $E[X^2]$:

$$ E[X^2] = \sum_{k=1}^{6} k^2 \cdot \tfrac{1}{6} = \tfrac{1 + 4 + 9 + 16 + 25 + 36}{6} = \tfrac{91}{6}. $$

Then variance falls out:

$$ \operatorname{Var}(X) = E[X^2] - (E[X])^2 = \tfrac{91}{6} - (\tfrac{7}{2})^2 = \tfrac{91}{6} - \tfrac{49}{4} = \tfrac{182 - 147}{12} = \tfrac{35}{12}. $$

Notice you never had to figure out the distribution of $X^2$ — its support, its PMF, anything. LOTUS lets you stay in the world of $X$ and just multiply by $g$.

9. Independence of two random variables

Two events are independent when one tells you nothing about the other. The same idea applies to random variables — extended over every possible value.

Independent random variables

Random variables $X$ and $Y$ are independent if for all $x$ and $y$,

$P(X \le x, Y \le y) = P(X \le x) \, P(Y \le y).$

Equivalently, the joint PMF/PDF factors into the marginals:

$p_{X,Y}(x, y) = p_X(x)\, p_Y(y) \quad \text{or} \quad f_{X,Y}(x, y) = f_X(x)\, f_Y(y).$

When $X$ and $Y$ are independent, two further identities unlock that don't hold in general:

  • $E[XY] = E[X] \, E[Y]$. (For dependent variables this can fail in either direction.)
  • $\operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)$. (Without independence there's a covariance correction term.)

Compare carefully against linearity: $E[X + Y] = E[X] + E[Y]$ always, but $\operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)$ only when independent. Mixing these up — assuming variances add for dependent variables — is one of the most expensive mistakes you can make in probabilistic modeling.

Mutual independence vs. pairwise

For three or more variables, "independent" usually means mutually independent: every subset factorizes. Pairwise independence (each pair independent) is strictly weaker, and you can construct examples where every pair is independent but the three together aren't. When in doubt, state which one you mean.

10. Preview: joint distributions and covariance

So far every random variable on this page has lived alone. The full theory describes several RVs at once — their joint distribution — and quantifies how they vary together. A future topic will go deep on this; here's a one-paragraph preview so the terminology doesn't catch you by surprise.

The joint behavior of $X$ and $Y$ is captured by the joint PMF $p_{X,Y}(x, y) = P(X = x, Y = y)$ or, in the continuous case, the joint PDF $f_{X,Y}(x, y)$. The summary of how they move together is the covariance:

$$ \operatorname{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]\,E[Y]. $$

Positive covariance means $X$ and $Y$ tend to be above (or below) their means together; negative means one is above while the other is below; zero means linearly uncorrelated. The general identity for variance of a sum is

$$ \operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) + 2\operatorname{Cov}(X, Y), $$

which collapses to the independence formula when $\operatorname{Cov}(X, Y) = 0$. The standardized version of covariance, dividing by $\sigma_X \sigma_Y$, is the famous correlation coefficient. We'll do all of this properly when we get to joint distributions; for now, just notice that the framework you've built here generalises smoothly to many variables.

11. Common pitfalls

Treating a PDF value as a probability

$f_X(x)$ is a probability density — units of probability per unit of $x$, not probability itself. It can exceed $1$. To get a probability, integrate $f_X$ over an interval. If you ever see the inequality $f_X(x) \le 1$ used as a "check," you've conflated PDF with PMF.

Assuming $E[g(X)] = g(E[X])$

This is wrong for any nonlinear $g$. For instance, $E[X^2]$ is almost never equal to $(E[X])^2$ — the difference between them is the variance, and the variance is zero only when $X$ is constant. Jensen's inequality formalises the direction of the gap: for convex $g$, $E[g(X)] \ge g(E[X])$; for concave $g$, the inequality reverses.

$\operatorname{Var}(aX) = a\operatorname{Var}(X)$ — no

The correct identity is $\operatorname{Var}(aX) = a^2 \operatorname{Var}(X)$. The square comes from squaring inside the variance definition. The standard deviation gets the absolute value, $\sigma_{aX} = |a|\sigma_X$. Mixing these up by a factor of $a$ versus $a^2$ is a hand-grenade-sized error.

Multiplying expectations without independence

$E[XY] = E[X]\,E[Y]$ only when $X$ and $Y$ are independent (or, more weakly, uncorrelated). For dependent variables, the equality typically fails — and the gap is exactly $\operatorname{Cov}(X, Y)$. Whenever you reach for the product rule, ask explicitly whether independence holds.

Forgetting that a continuous RV has $P(X = a) = 0$

This isn't a curiosity; it has real consequences. The endpoints in $P(a \le X \le b)$ vs $P(a < X < b)$ don't matter for continuous $X$ — they're all equal. But for discrete $X$ they very much do, because the endpoints can carry positive mass. Mixing the discrete and continuous conventions silently mid-problem is a classic source of off-by-something errors.

Variance coming out negative

Variance is $E[(X - \mu)^2]$, the expectation of a squared quantity. It must be $\ge 0$. If your arithmetic produces a negative variance, there's an error upstream — most often, you used the computational form $E[X^2] - (E[X])^2$ but accidentally took $(E[X^2])$ where you meant $(E[X])^2$, or vice versa.

12. Worked examples

Try each one before opening the solution. The arithmetic is rarely the hard part; the skill is picking the right formula and applying it cleanly.

Example 1 · Mean and variance of a fair die

Setup. $X$ is the result of one roll of a fair six-sided die. PMF $p_X(k) = 1/6$ for $k \in \{1, 2, 3, 4, 5, 6\}$.

Mean.

$$ E[X] = \tfrac{1}{6}(1 + 2 + 3 + 4 + 5 + 6) = \tfrac{21}{6} = \tfrac{7}{2}. $$

$E[X^2]$ via LOTUS.

$$ E[X^2] = \tfrac{1}{6}(1 + 4 + 9 + 16 + 25 + 36) = \tfrac{91}{6}. $$

Variance.

$$ \operatorname{Var}(X) = E[X^2] - (E[X])^2 = \tfrac{91}{6} - \tfrac{49}{4} = \tfrac{182 - 147}{12} = \tfrac{35}{12} \approx 2.917. $$

Standard deviation. $\sigma_X = \sqrt{35/12} \approx 1.708$.

Example 2 · Uniform on $[0, 1]$

Setup. $X$ is uniformly distributed on $[0, 1]$: $f_X(x) = 1$ on that interval, $0$ elsewhere.

Mean.

$$ E[X] = \int_0^1 x \cdot 1\, dx = \tfrac{1}{2}. $$

$E[X^2]$.

$$ E[X^2] = \int_0^1 x^2\, dx = \tfrac{1}{3}. $$

Variance.

$$ \operatorname{Var}(X) = \tfrac{1}{3} - \tfrac{1}{4} = \tfrac{1}{12}. $$

Standard deviation $\sigma_X = 1/\sqrt{12} \approx 0.289$ — about a third of the way from the mean to either edge of the interval.

Example 3 · CDF of an exponential distribution

Setup. $X \sim \text{Exp}(\lambda)$ has $f_X(x) = \lambda e^{-\lambda x}$ for $x \ge 0$.

CDF. For $x \ge 0$,

$$ F_X(x) = \int_0^x \lambda e^{-\lambda t}\, dt = 1 - e^{-\lambda x}. $$

And $F_X(x) = 0$ for $x < 0$. From this you can read off, for example, $P(X > t) = 1 - F_X(t) = e^{-\lambda t}$ — the famous memorylessness formula in disguise.

Sanity check. Differentiating: $\dfrac{d}{dx}(1 - e^{-\lambda x}) = \lambda e^{-\lambda x} = f_X(x)$. The PDF and CDF agree, as they must.

Example 4 · Expected number of fixed points of a random permutation

Setup. Take a uniformly random permutation $\pi$ of $\{1, 2, \ldots, n\}$. Let $X$ be the number of $k$ with $\pi(k) = k$ — the count of fixed points.

Decompose. Write $X = I_1 + I_2 + \cdots + I_n$, where $I_k = 1$ if $\pi(k) = k$ and $0$ otherwise.

Marginal of each indicator. By symmetry, each $k$ is equally likely to be mapped to any of the $n$ positions, so

$$ E[I_k] = P(\pi(k) = k) = \tfrac{1}{n}. $$

Linearity.

$$ E[X] = \sum_{k=1}^{n} E[I_k] = n \cdot \tfrac{1}{n} = 1. $$

The expected number of fixed points is exactly $1$, regardless of $n$. The $I_k$ are not independent — knowing $\pi(1) = 1$ slightly biases the conditional distribution of $\pi(2)$ — but linearity didn't ask.

Example 5 · Variance of a sum: independent vs dependent

Setup A (independent). Let $X$ and $Y$ be independent rolls of fair dice. Then

$$ \operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) = \tfrac{35}{12} + \tfrac{35}{12} = \tfrac{35}{6}. $$

Setup B (perfectly dependent). Now roll only one die and let $X$ be the outcome and $Y = X$ (so $Y$ is the very same roll). Linearity gives $E[X + Y] = 2 \cdot \tfrac{7}{2} = 7$ — same as before. But

$$ \operatorname{Var}(X + Y) = \operatorname{Var}(2X) = 4 \operatorname{Var}(X) = \tfrac{140}{12} = \tfrac{35}{3}, $$

which is exactly twice the independent-case variance. The means added the same way in both setups; the variances did not. Dependence inflated the spread.

Example 6 · LOTUS for a continuous transformation

Setup. $X \sim \text{Uniform}[0, 1]$ and we want $E[X^3]$.

LOTUS.

$$ E[X^3] = \int_0^1 x^3 \cdot 1\, dx = \tfrac{x^4}{4}\Big|_0^1 = \tfrac{1}{4}. $$

Note that we did not first derive the distribution of $Y = X^3$. That would also work — $f_Y(y) = \tfrac{1}{3} y^{-2/3}$ on $[0,1]$ — but it's strictly more effort than LOTUS, and any extra effort here is a step at which something can go wrong.

Sources & further reading

The material on this page is foundational — every probability textbook covers it, and the differences are mostly in pacing and notation. The references below span the spectrum from rapid intuition-building to fully rigorous treatments.

Test your understanding

A quiz that builds from easy to hard. Pick an answer to get instant feedback and a worked explanation. Your progress is saved in this browser — come back anytime to continue.

Question 1 of 20
0 correct