1. What a distribution is
A description of the probabilities for all possible values a random variable $X$ can take. It assigns total probability $1$ across the set of outcomes — never more, never less.
The shape of that description depends on what kind of values $X$ can take.
Discrete: a table
If $X$ takes values from a countable set — like $\{0, 1, 2, \dots\}$ — its distribution is a probability mass function (PMF): a rule $p(x) = P(X = x)$ that gives the probability of each specific value. The list of all those probabilities sums to $1$:
$$ \sum_{x} p(x) = 1 $$You can write it out as a table. For a fair six-sided die:
| $x$ | $1$ | $2$ | $3$ | $4$ | $5$ | $6$ |
|---|---|---|---|---|---|---|
| $P(X = x)$ | $\tfrac{1}{6}$ | $\tfrac{1}{6}$ | $\tfrac{1}{6}$ | $\tfrac{1}{6}$ | $\tfrac{1}{6}$ | $\tfrac{1}{6}$ |
That table is the distribution.
Continuous: a curve
If $X$ takes values from a continuum — like all real numbers, or all positive times — there are infinitely many outcomes, and assigning each a positive probability would make them sum to infinity. So we describe a continuous random variable by a probability density function (PDF) $f(x)$, with the rule that probability comes from area under the curve:
$$ P(a \leq X \leq b) = \int_{a}^{b} f(x)\,dx, \qquad \int_{-\infty}^{\infty} f(x)\,dx = 1 $$The density at a single point isn't a probability — it's a rate of probability per unit of $x$. The probability of any exact value is zero. (We'll harp on this in the pitfalls.)
2. Discrete: Bernoulli, binomial, geometric, Poisson, uniform
A handful of named families cover an enormous share of everyday discrete situations. Bernoulli is a single trial; binomial counts successes in a fixed number of trials; geometric counts trials until the first success; Poisson counts rare events in a window; discrete uniform spreads probability equally across a finite set.
Bernoulli — a single yes/no trial
A Bernoulli random variable models one trial with two outcomes: success (call it $1$) with probability $p$, failure ($0$) with probability $1 - p$.
$$ P(X = 1) = p, \qquad P(X = 0) = 1 - p $$Mean $\mu = p$, variance $\sigma^2 = p(1-p)$. A single coin flip with $p = 0.5$ is the canonical example, but it works for any binary trial: a click or no click, a defective part or not, a hit or a miss.
Binomial — counting successes in $n$ Bernoulli trials
Run $n$ independent Bernoulli trials, each with probability $p$ of success, and count the number $k$ of successes. That count is binomial:
$$ P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k}, \qquad k = 0, 1, \dots, n $$The binomial coefficient $\binom{n}{k}$ counts the number of ways to choose which $k$ of the $n$ trials are the successes; the $p^k (1-p)^{n-k}$ is the probability of any one specific arrangement. Mean $\mu = np$, variance $\sigma^2 = np(1-p)$.
Sketch of the PMF for $n = 10, p = 0.4$ — bars stand at each integer $k$ and their heights add to $1$:
Geometric — trials until the first success
Repeat independent Bernoulli($p$) trials and ask: which trial is the first success? Let $X$ count the number of trials needed (so $X \in \{1, 2, 3, \dots\}$). Then $X$ is geometric:
$$ P(X = k) = (1 - p)^{k - 1} \, p, \qquad k = 1, 2, 3, \dots $$The story is direct: $k - 1$ failures in a row, each with probability $1 - p$, followed by a success with probability $p$. Mean $\mu = 1/p$, variance $\sigma^2 = (1-p)/p^2$. The geometric is the discrete cousin of the exponential and shares its memoryless property: given you've already failed $j$ times, the number of additional trials needed is still geometric with the same $p$.
Poisson — counting rare events in a fixed interval
Sometimes you don't have a fixed $n$ — you have a window of time or space (an hour at a call center, a square meter of fabric, a kilometer of road) and you count how many events fall inside it. If events arrive independently at a constant average rate $\lambda$ per window, the count $X$ is Poisson:
$$ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \qquad k = 0, 1, 2, \dots $$Mean $\mu = \lambda$, variance $\sigma^2 = \lambda$ — the same number for both, which is itself a clue when you suspect a Poisson in real data. The Poisson is also the limit of the binomial when $n \to \infty$, $p \to 0$, and $np \to \lambda$: many opportunities, each tiny, with a fixed average count.
Discrete uniform — every outcome equally likely
If $X$ takes one of $n$ equally likely values $\{x_1, x_2, \dots, x_n\}$, it is discrete uniform:
$$ P(X = x_i) = \frac{1}{n}, \qquad i = 1, 2, \dots, n $$A fair die ($n = 6$, values $\{1, \dots, 6\}$) is the canonical example. For consecutive integers $\{1, 2, \dots, n\}$: mean $\mu = (n+1)/2$, variance $\sigma^2 = (n^2 - 1)/12$.
A single yes/no trial → Bernoulli. A fixed number $n$ of repeated yes/no trials, count successes → binomial. Repeat trials until the first success, count trials → geometric. Counts of rare events in a continuous window with no fixed $n$ → Poisson. Every outcome in a finite set equally likely → discrete uniform.
3. Continuous: uniform, exponential, normal
For continuous random variables we trade tables for densities. Three families do most of the work.
Uniform — flat density on an interval
If every value in $[a, b]$ is equally likely and nothing outside the interval is possible, $X$ is uniform on $[a, b]$:
$$ f(x) = \begin{cases} \dfrac{1}{b - a} & a \leq x \leq b \\[4pt] 0 & \text{otherwise} \end{cases} $$The constant height $1/(b-a)$ is exactly what's needed for the total area to be $1$. Mean $\mu = (a+b)/2$, variance $\sigma^2 = (b-a)^2/12$.
Exponential — waiting time for a Poisson event
If events occur as a Poisson process with rate $\lambda$, the waiting time $X$ until the next event is exponential:
$$ f(x) = \lambda e^{-\lambda x}, \qquad x \geq 0 $$Mean $\mu = 1/\lambda$, variance $\sigma^2 = 1/\lambda^2$. The exponential is "memoryless": knowing you've already waited 10 minutes tells you nothing about how much longer you'll wait. That's not intuitive — but it's the exact property that makes it model independent Poisson arrivals.
Normal (Gaussian) — the bell curve
The normal distribution with mean $\mu$ and standard deviation $\sigma$ has density
$$ f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\!\left( -\frac{(x - \mu)^2}{2\sigma^2} \right) $$Symmetric about $\mu$, with $\sigma$ controlling the width. The standard normal is the special case $\mu = 0, \sigma = 1$ — denoted $Z$ and tabulated in every statistics textbook because every other normal can be rescaled to it via $Z = (X - \mu)/\sigma$.
The 68–95–99.7 rule is worth memorizing: roughly $68\%$ of the probability sits within $\pm 1\sigma$ of the mean, $95\%$ within $\pm 2\sigma$, and $99.7\%$ within $\pm 3\sigma$.
4. The normal distribution — why it's everywhere
The normal isn't just one distribution among many — it's the shape that quietly takes over whenever you add up enough small independent random influences. Heights, measurement errors, sample means, sums of dice rolls — the more independent random nudges contribute, the closer the result lies to a bell curve.
This isn't folklore. It's the Central Limit Theorem (CLT): if $X_1, X_2, \dots, X_n$ are independent random variables (from any reasonable distribution) with mean $\mu$ and finite variance $\sigma^2$, then for large $n$ the standardized sample mean
$$ Z_n = \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} $$converges in distribution to the standard normal $N(0, 1)$. The underlying distribution can be wildly non-normal — uniform, skewed, even discrete — and the average still pulls toward the bell. That's why the normal is overrepresented in measurements of natural quantities: those quantities are sums.
The CLT is important enough that the next topic is devoted entirely to it. This page just plants the flag — the explanation belongs there.
Plenty of natural data isn't normal — incomes, file sizes, city populations, and earthquake magnitudes are famously heavy-tailed (often power-law or log-normal). Reaching for the normal as a default always is a beginner's mistake; reach for it when the data-generating process is "many small independent contributions added together."
5. PDF vs CDF
For any random variable, there are two equivalent ways to write down its distribution: the PDF (or PMF, in the discrete case) and the CDF.
The function $F(x) = P(X \leq x)$ — the probability that $X$ is at most $x$. It's the running total of probability up to $x$.
The CDF accumulates the PDF:
$$ F(x) = \int_{-\infty}^{x} f(t)\,dt \quad \text{(continuous)}, \qquad F(x) = \sum_{t \leq x} p(t) \quad \text{(discrete)} $$So $F$ is just the running area (or running sum) of the PDF. Conversely, in the continuous case you recover the PDF by differentiating: $f(x) = F'(x)$.
Three useful facts:
- $F$ is non-decreasing — probability never goes negative, so the running total can only grow.
- $F(-\infty) = 0$ and $F(\infty) = 1$ — start with no probability, end with all of it.
- $P(a < X \leq b) = F(b) - F(a)$ — interval probabilities are differences of CDF values. This is how statistical tables let you compute probabilities without doing the integral yourself.
| PDF / PMF | CDF | |
|---|---|---|
| What it returns | Density (or mass) at a point | Probability of being $\leq$ that point |
| Continuous shape | Curve (bell, flat, decaying, …) | S-shaped — flat, rises, flat again |
| Discrete shape | Bars at integer values | Step function |
| Probability of $[a, b]$ | $\int_a^b f(x)\,dx$ (area) | $F(b) - F(a)$ (subtraction) |