Topic · Statistics & Probability

Regression & Correlation

Two quantities can move together for many reasons — or for none. Correlation gives you a single number that says how tightly they move in a straight line. Regression goes one step further and writes down that line, turning the relationship into a formula you can predict with. Used carefully, they are two of the most powerful tools in applied statistics. Used carelessly, they invent stories that aren't there.

What you'll leave with

  • A precise definition of the Pearson correlation coefficient $r$ — what it measures, and the things it stubbornly refuses to measure.
  • Visual intuition for what $r = 0.95, 0.5, 0, -0.5, -0.95$ actually look like.
  • The least-squares derivation of the simple-regression line $\hat{y} = b_0 + b_1 x$, and why the formulas $b_1 = r\,(s_y/s_x)$ and $b_0 = \bar{y} - b_1 \bar{x}$ fall out for free.
  • Residuals, the meaning of "good fit," and $R^2$ as the fraction of variance the line accounts for.
  • The matrix form of multiple regression — the normal equations $X^{\!\top}\! X \boldsymbol\beta = X^{\!\top}\!\mathbf{y}$ — and why decomposition methods like QR turn up in the same conversation.
  • Why Anscombe's quartet should haunt anyone who quotes a correlation without looking at the picture.

1. The correlation coefficient $r$

Suppose you have $n$ paired observations $(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$ — heights and weights, hours studied and exam scores, ad spend and revenue. The most basic question is: do they move together? When $x$ is above its mean, is $y$ also above its mean? When $x$ is below, is $y$ below too — or the opposite?

Multiply the two deviations $(x_i - \bar{x})(y_i - \bar{y})$ for each point. The product is positive when both deviations have the same sign (the pair moves together) and negative when they have opposite signs (the pair moves apart). Average those products and you get the covariance:

$$ \operatorname{Cov}(x, y) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y}) $$

Covariance has the right sign, but its size depends on whatever units $x$ and $y$ are measured in. If you switch heights from centimeters to meters, the covariance shrinks by a factor of 100 without the relationship changing at all. That's annoying. The fix is to divide out the standard deviations of $x$ and $y$, which strips away the units and pins the answer between $-1$ and $1$.

Pearson correlation coefficient

For paired data $(x_i, y_i)$ with means $\bar{x}, \bar{y}$ and standard deviations $s_x, s_y$:

$$ r \;=\; \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \,\cdot\, \sum_{i=1}^{n}(y_i - \bar{y})^2}} $$

Equivalently, $r = \operatorname{Cov}(x, y) \,/\, (s_x s_y)$. The value always lies in $[-1, 1]$.

The geometry is worth pausing on. If you standardize each variable — subtract its mean and divide by its standard deviation — then $r$ is literally the average of the products of the standardized scores: $r = \tfrac{1}{n}\sum z_{x,i}\, z_{y,i}$. Same number, expressed in a way that makes the dependence on units disappear.

How to read $r$

$r$Linear association
$+1$Perfect positive — every point on a single line with positive slope
$\approx +0.9$Strong positive trend, modest scatter around the line
$\approx +0.5$Moderate positive trend, very visible scatter
$0$No linear association (but possibly other kinds — see below)
$\approx -0.5$Moderate negative trend
$-1$Perfect negative — every point on a single line with negative slope
$r$ is symmetric and unitless

Swap the roles of $x$ and $y$ — $r$ is unchanged. Scale either variable, shift either variable — still unchanged. It's a pure number that describes the shape of the cloud of points, not the cloud's position or size.

Calibrate your eye

Most people overestimate weak correlations and underestimate strong ones. The "moderate" $r = 0.5$ cloud in the middle of the gallery is much noisier than newcomers expect; the $r = 0.95$ clouds, on the other hand, look essentially deterministic. Spend a minute training your intuition against these pictures before you ever quote a correlation in real work.

3. What $r$ does not measure

The Pearson coefficient is built specifically to detect linear association. It is silent — sometimes embarrassingly silent — about everything else.

Causation

A correlation of $0.95$ between ice-cream sales and drownings does not mean ice cream causes drowning. Both are driven by a third variable: it's summer. $r$ describes statistical co-movement and says nothing about which way the arrow points, or whether either variable causes the other at all. Establishing causation requires experimental control or careful causal-inference machinery — never just a high $r$.

Nonlinear relationships

Consider the deterministic relationship $y = x^2$ for $x \in [-3, 3]$. There is a perfect functional dependence — knowing $x$ tells you $y$ exactly — and yet the Pearson correlation is $0$, because for every $x$ above the mean there is a mirror $-x$ below it, and the positive-and-negative products cancel. The eye sees a parabola; $r$ sees nothing.

$r = 0$ is not "no relationship"

It is "no linear relationship." A U-shape, a sine wave, and a sharp threshold can all produce $r \approx 0$ while being utterly predictable. Always plot the data — that is the only foolproof check.

The presence or absence of clusters

Two well-separated clusters of points — say, men and women on a height-vs-weight plot — can produce an $r$ that describes neither cluster individually. Slicing the data by the group label often reveals very different correlations within each group than you see across the pooled data (an instance of Simpson's paradox).

4. Simple linear regression by least squares

Correlation tells you how tightly two variables hug a straight line. Regression writes that line down. The question is: among all possible lines $\hat{y} = b_0 + b_1 x$ you could draw through the cloud of points, which one is the "best" fit?

The dominant answer for the last 200 years has been least squares: choose $b_0$ and $b_1$ to minimize the sum of squared vertical distances between the data points and the line.

Least-squares objective

Given paired data $(x_i, y_i)$, define the predicted value at $x_i$ as $\hat{y}_i = b_0 + b_1 x_i$. The least-squares regression line is the line that minimizes the residual sum of squares,

$$ \operatorname{RSS}(b_0, b_1) \;=\; \sum_{i=1}^{n}\bigl(y_i - \hat{y}_i\bigr)^2 \;=\; \sum_{i=1}^{n}\bigl(y_i - b_0 - b_1 x_i\bigr)^2. $$

Why squared, and why vertical? Vertical because $y$ is the variable we want to predict from $x$ — horizontal errors would correspond to predicting $x$ from $y$, a different problem. Squared because (just as with variance) the algebra is enormously cleaner: the objective is a smooth quadratic in $b_0$ and $b_1$, so calculus finds the minimum in one shot, with no kinks to worry about.

Deriving the formulas

Minimize $\operatorname{RSS}$ by setting its partial derivatives to zero. For the intercept:

$$ \frac{\partial \operatorname{RSS}}{\partial b_0} \;=\; -2\sum_{i=1}^{n}(y_i - b_0 - b_1 x_i) \;=\; 0. $$

Divide by $-2n$ and rearrange — this just says the residuals average to zero, which forces the line to pass through the point of means $(\bar{x}, \bar{y})$:

$$ b_0 \;=\; \bar{y} - b_1 \bar{x}. \quad\text{(intercept formula)} $$

For the slope, take the partial in $b_1$:

$$ \frac{\partial \operatorname{RSS}}{\partial b_1} \;=\; -2\sum_{i=1}^{n} x_i (y_i - b_0 - b_1 x_i) \;=\; 0. $$

Substitute $b_0 = \bar{y} - b_1\bar{x}$, do a little algebra, and the slope drops out:

$$ b_1 \;=\; \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \;=\; \frac{\operatorname{Cov}(x, y)}{\operatorname{Var}(x)}. $$

This is exactly the same numerator as in the formula for $r$. Dividing top and bottom by $n s_x s_y$ produces a particularly clean version:

$$ \boxed{\,b_1 \;=\; r \cdot \frac{s_y}{s_x},\qquad b_0 \;=\; \bar{y} - b_1\bar{x}\,} $$

That single line packs in the whole story. The slope is the correlation, scaled by the ratio of standard deviations. The line passes through the mean of the data. Correlation and regression are not two unrelated procedures — they are the same calculation, packaged for different purposes.

Sanity check the slope formula

Imagine standardizing both variables, so $s_x = s_y = 1$. Then $b_1 = r$ exactly. In z-score space, the regression slope is the correlation — a beautiful identity that makes the symbol $r$ feel a lot less arbitrary.

5. Residuals and "good fit"

The residual at point $i$ is what's left over after the line has done its best:

$$ e_i \;=\; y_i - \hat{y}_i. $$

Each residual is the vertical distance from a data point to the fitted line. By construction, the least-squares line makes the sum of squared residuals as small as it can possibly be. The residuals also have two structural properties that fall straight out of the derivation:

  • $\sum e_i = 0$ — residuals average to zero (the partial-derivative condition for $b_0$).
  • $\sum x_i e_i = 0$ — residuals are uncorrelated with $x$ (the condition for $b_1$).

Geometrically, the fitted line splits each observation $y_i$ into two perpendicular pieces: the part the model explains ($\hat{y}_i$, lying on the line) and the part it doesn't ($e_i$, sticking up or down).

0 2 4 6 8 10 0 2 4 6 8 10 x y observed (xᵢ, yᵢ) predicted (xᵢ, ŷᵢ) residual eᵢ ŷ = 1.28 + 0.87 x least-squares fit (R² ≈ 0.95)

Eight observations and their least-squares line. The dashed segments are residuals — each one is the vertical gap between an observed point and the line. The fit minimizes the sum of the squares of those segments.

"Good fit" then has a precise visual meaning. The residuals should look like static — scattered randomly above and below the line, with no obvious pattern. If you see curvature in the residuals (they go negative, then positive, then negative as $x$ increases), the relationship probably isn't linear and the model is misspecified. If the residuals fan out (wider on one end than the other), the variance isn't constant and the standard errors you'd quote about $b_1$ will be wrong. The residuals carry as much diagnostic information as the fit itself.

Vertical, not perpendicular

Ordinary least squares minimizes vertical distances because we are predicting $y$ from $x$ and $x$ is treated as known. If you wanted a line that minimizes perpendicular distances — a fundamentally different problem — you would reach for total least squares or principal-component-style methods.

6. $R^2$: the fraction of variance explained

How much of the up-and-down behaviour in $y$ does the line actually capture? Decompose the total variation in $y$ into a "modeled" piece and an "unexplained" piece:

$$ \underbrace{\sum (y_i - \bar{y})^2}_{\text{total (TSS)}} \;=\; \underbrace{\sum (\hat{y}_i - \bar{y})^2}_{\text{explained (ESS)}} \;+\; \underbrace{\sum (y_i - \hat{y}_i)^2}_{\text{unexplained (RSS)}}. $$

This identity is not obvious — it relies on the residuals being orthogonal to the fitted values, which is exactly what the least-squares conditions enforce. Dividing both sides by the total gives the central definition:

Coefficient of determination
$$ R^2 \;=\; \frac{\operatorname{ESS}}{\operatorname{TSS}} \;=\; 1 \;-\; \frac{\operatorname{RSS}}{\operatorname{TSS}}. $$

$R^2$ is the fraction of the variance in $y$ that is explained by the regression. It lives in $[0, 1]$. For simple linear regression, it has an even more striking form: $R^2 = r^2$ exactly.

Some intuition pumps for the number:

  • $R^2 = 1$ — the line passes through every point; residuals are all zero; $y$ is a deterministic linear function of $x$.
  • $R^2 = 0$ — the line is no better than predicting the mean of $y$ for every $x$; the slope is zero; $x$ contains no linear information about $y$.
  • $R^2 = 0.85$ — the line accounts for 85% of the variation in $y$; the remaining 15% lives in the residuals.
"Explained" is statistical, not causal

"$x$ explains 85% of the variance in $y$" means that knowing $x$ shrinks the unexplained scatter to 15% of what it was. It does not mean $x$ causes $y$, and it does not mean predictions will be accurate in a new sample drawn from a different population. $R^2$ is a measure of in-sample fit — useful, but easy to over-interpret.

7. Multiple regression and the normal equations

With more than one predictor, the bookkeeping starts to win if you stay in scalar notation. Stacking everything into matrices turns multiple regression into a single, breathtakingly compact line.

Let $\mathbf{y} \in \mathbb{R}^n$ be the column of responses, and let $X \in \mathbb{R}^{n \times (p+1)}$ be the design matrix: one row per observation, one column per predictor, plus a leading column of ones for the intercept. The vector $\boldsymbol{\beta} \in \mathbb{R}^{p+1}$ collects all the coefficients. The model becomes

$$ \hat{\mathbf{y}} \;=\; X \boldsymbol{\beta}. $$

Least squares now minimizes the squared length of the residual vector $\mathbf{y} - X\boldsymbol{\beta}$:

$$ \min_{\boldsymbol{\beta}} \;\bigl\|\mathbf{y} - X\boldsymbol{\beta}\bigr\|^2. $$

Taking the gradient with respect to $\boldsymbol{\beta}$ and setting it to zero gives the normal equations:

$$ \boxed{\, X^{\!\top}\! X\, \boldsymbol{\beta} \;=\; X^{\!\top}\! \mathbf{y} \,} $$

When $X^{\!\top}\! X$ is invertible (which it is whenever the columns of $X$ are linearly independent), the closed-form solution is

$$ \hat{\boldsymbol{\beta}} \;=\; \bigl(X^{\!\top}\! X\bigr)^{-1} X^{\!\top}\! \mathbf{y}. $$

Geometrically, $\hat{\mathbf{y}} = X\hat{\boldsymbol{\beta}}$ is the orthogonal projection of $\mathbf{y}$ onto the column space of $X$. The residual vector $\mathbf{y} - \hat{\mathbf{y}}$ is what's left over after that projection — and it is, by construction, perpendicular to every column of $X$. That is precisely the geometric statement of the normal equations.

In practice, don't invert $X^{\!\top}\! X$

Forming $(X^{\!\top}\! X)^{-1}$ explicitly is numerically unstable when the predictors are highly correlated, because $X^{\!\top}\! X$ can become nearly singular and inversion amplifies error wildly. Real implementations factor $X$ first — typically via a QR decomposition — and solve the resulting triangular system. The math is the same; the arithmetic is far better behaved.

One predictor recovers the simple-regression formulas. With $p = 1$ predictor and an intercept, $X$ is an $n \times 2$ matrix, $X^{\!\top}\! X$ is a $2 \times 2$ matrix, and a couple of lines of algebra reduce $(X^{\!\top}\! X)^{-1} X^{\!\top}\!\mathbf{y}$ back to the familiar $b_1 = r(s_y/s_x)$, $b_0 = \bar{y} - b_1 \bar{x}$. The matrix formulation is the same idea — just dressed for a bigger room.

8. Anscombe's quartet: always look at the data

In 1973 the statistician Frank Anscombe published four small datasets that have been ruining careless data analysts ever since. All four have:

  • the same mean of $x$ (9) and mean of $y$ (7.5),
  • the same variance of $x$ (11) and variance of $y$ ($\approx 4.12$),
  • the same correlation $r \approx 0.816$,
  • the same regression line $\hat{y} = 3 + 0.5 x$,
  • the same $R^2 \approx 0.67$.

By every summary statistic that regression and correlation provide, the four datasets are identical. And yet only the first looks anything like the linear story those statistics suggest:

  • I. A noisy but genuinely linear cloud. The regression line is a fair description.
  • II. A perfectly clean parabola. A line is the wrong model entirely.
  • III. A perfect line — except for a single outlier that hauls the fitted slope away from the truth.
  • IV. Ten points piled on top of each other at one $x$ value, plus a single high-leverage point that defines the slope all by itself.
Summary statistics lie. Plots tell the truth.

Anscombe's quartet exists to make this point unforgettably. Never quote a correlation or regression line without first looking at the scatterplot. The picture catches nonlinearity, outliers, leverage, and clustering — all of which can quietly hide inside the same numbers.

9. Common pitfalls

Correlation is not causation

Worth repeating because it is endlessly violated in real life. Two variables can correlate strongly because one causes the other, because the other causes the one, because a third variable causes both, because the data was selected in a biased way, or because of pure coincidence in a small sample. None of those are distinguishable from $r$ alone.

Extrapolation outside the data range

The regression line was fitted to the $x$-values you actually observed. Predicting $\hat{y}$ at an $x$ far outside that range assumes the linear relationship continues to hold — an assumption nothing in the data can support. A line fit to children's heights between 5 and 15 will happily predict a 20-foot 60-year-old if you let it.

Influential outliers

Because residuals are squared, a single point far from the bulk of the data can dominate the fit. "Leverage" is the technical term for how much a particular $x$-value can swing the slope. Always plot the residuals; a point with a huge residual or an extreme $x$ deserves a second look before you trust the line it produced.

Nonlinear relationships hidden behind low $r$

$r = 0$ rules out a linear relationship, not a relationship. A scatter that looks like a clean parabola, a sine wave, or a step function can produce $r$ near zero while being completely predictable. Plot first, summarize second.

$R^2$ only ever goes up when you add predictors

Throwing more variables into a multiple regression cannot decrease $R^2$ — even if the new variable is pure noise, the algorithm will use it to fit the noise in the training data. Adjusted $R^2$ and out-of-sample validation are the standard defences. A high $R^2$ on the data you trained on tells you very little about how the model will behave on new data.

Multicollinearity in multiple regression

If two predictors are themselves highly correlated, the matrix $X^{\!\top}\! X$ is near-singular and the individual coefficients $\hat{\boldsymbol\beta}$ become unstable — a tiny change in the data can flip a coefficient's sign without affecting the model's overall fit much at all. The predictions can still be fine; the per-variable interpretations cannot be trusted.

10. Worked examples

Work each one before opening the solution. The point is to feel the formulas in your hands — the recipe is short, but the steps each have to be done carefully.

Example 1 · Compute $r$ for a tiny dataset

Data: $(1, 2), (2, 3), (3, 5), (4, 4)$. So $n = 4$, $\bar{x} = 2.5$, $\bar{y} = 3.5$.

Step 1. Deviations from the mean:

$x_i - \bar{x}$: $-1.5,\, -0.5,\, 0.5,\, 1.5$.
$y_i - \bar{y}$: $-1.5,\, -0.5,\, 1.5,\, 0.5$.

Step 2. Cross-products and squared deviations:

$$ \sum (x_i - \bar{x})(y_i - \bar{y}) = 2.25 + 0.25 + 0.75 + 0.75 = 4. $$ $$ \sum (x_i - \bar{x})^2 = 2.25 + 0.25 + 0.25 + 2.25 = 5. $$ $$ \sum (y_i - \bar{y})^2 = 2.25 + 0.25 + 2.25 + 0.25 = 5. $$

Step 3. Plug into the formula:

$$ r = \frac{4}{\sqrt{5 \cdot 5}} = \frac{4}{5} = 0.8. $$

A strong positive linear association.

Example 2 · Fit the regression line for the same dataset

From Example 1: $\bar{x} = 2.5$, $\bar{y} = 3.5$, $\sum (x_i - \bar{x})(y_i - \bar{y}) = 4$, $\sum (x_i - \bar{x})^2 = 5$.

Step 1. Slope:

$$ b_1 = \frac{4}{5} = 0.8. $$

Step 2. Intercept:

$$ b_0 = \bar{y} - b_1 \bar{x} = 3.5 - 0.8(2.5) = 3.5 - 2.0 = 1.5. $$

Step 3. The fitted line:

$$ \hat{y} = 1.5 + 0.8 x. $$

Predict at $x = 5$: $\hat{y} = 1.5 + 0.8(5) = 5.5$. (Note: $x = 5$ is just outside the observed range $[1, 4]$ — already a mild extrapolation.)

Example 3 · Verify $b_1 = r \cdot s_y / s_x$

Using the same data with $n = 4$:

$$ s_x^2 = \frac{\sum (x_i - \bar{x})^2}{n} = \frac{5}{4} = 1.25 \quad\Longrightarrow\quad s_x \approx 1.118. $$ $$ s_y^2 = \frac{\sum (y_i - \bar{y})^2}{n} = \frac{5}{4} = 1.25 \quad\Longrightarrow\quad s_y \approx 1.118. $$

Then:

$$ r \cdot \frac{s_y}{s_x} = 0.8 \cdot \frac{1.118}{1.118} = 0.8. $$

Matches $b_1$ exactly, as the boxed identity promises. Notice how the $s_x$ and $s_y$ factors cancel here because the two variables happen to have the same standard deviation — in general they don't, and the ratio $s_y / s_x$ scales the dimensionless $r$ into the slope's correct units.

Example 4 · Compute $R^2$ and interpret it

Continuing the dataset, the predicted values $\hat{y}_i = 1.5 + 0.8 x_i$ are:

$\hat{y}$: $2.3,\, 3.1,\, 3.9,\, 4.7$. Residuals $e_i = y_i - \hat{y}_i$: $-0.3,\, -0.1,\, 1.1,\, -0.7$.

RSS: $0.09 + 0.01 + 1.21 + 0.49 = 1.80$.

TSS: $\sum (y_i - \bar{y})^2 = 5$ (from Example 1).

$R^2$:

$$ R^2 = 1 - \frac{\operatorname{RSS}}{\operatorname{TSS}} = 1 - \frac{1.80}{5} = 0.64. $$

Check against $r^2$: $0.8^2 = 0.64$. ✓ The line explains 64% of the variance in $y$.

Example 5 · The matrix form on a tiny problem

Take the same four points $(1, 2), (2, 3), (3, 5), (4, 4)$. The design matrix and response vector are

$$ X = \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{pmatrix}, \qquad \mathbf{y} = \begin{pmatrix} 2 \\ 3 \\ 5 \\ 4 \end{pmatrix}. $$

Compute $X^{\!\top}\! X$ and $X^{\!\top}\! \mathbf{y}$:

$$ X^{\!\top}\! X = \begin{pmatrix} 4 & 10 \\ 10 & 30 \end{pmatrix}, \qquad X^{\!\top}\! \mathbf{y} = \begin{pmatrix} 14 \\ 39 \end{pmatrix}. $$

The normal equations are

$$ \begin{pmatrix} 4 & 10 \\ 10 & 30 \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \end{pmatrix} = \begin{pmatrix} 14 \\ 39 \end{pmatrix}. $$

Solve: the determinant of $X^{\!\top}\! X$ is $4 \cdot 30 - 10 \cdot 10 = 20$, so

$$ \begin{pmatrix} b_0 \\ b_1 \end{pmatrix} = \frac{1}{20}\begin{pmatrix} 30 & -10 \\ -10 & 4 \end{pmatrix}\begin{pmatrix} 14 \\ 39 \end{pmatrix} = \frac{1}{20}\begin{pmatrix} 420 - 390 \\ -140 + 156 \end{pmatrix} = \begin{pmatrix} 1.5 \\ 0.8 \end{pmatrix}. $$

Same answer as Example 2. The matrix machinery is doing exactly the calculation the scalar formulas spelled out — just packaged for arbitrary numbers of predictors.

Example 6 · Why a high $r$ doesn't prove causation

Across U.S. cities, monthly ice-cream sales and monthly drownings show a strong positive correlation — easily $r > 0.8$. Does eating ice cream cause people to drown?

Of course not. Both quantities respond to a hidden third variable: temperature. In hot months, people buy more ice cream, and they also swim more, which produces more drownings. The two effects share a common cause but neither causes the other. Correlation captures only that they move together; it has nothing to say about why.

The cure isn't more correlation. It's experimental design (e.g., a randomized trial), or carefully built causal models that adjust for the hidden third variable. No amount of squeezing the data with summary statistics can substitute for that.

Sources & further reading

The content above is synthesized from established statistics references. If anything reads ambiguously here, the primary sources below are the ground truth — and the "going deeper" links are where to turn when this page has served its purpose.

Test your understanding

A quiz that builds from easy to hard. Pick an answer to get instant feedback and a worked explanation. Your progress is saved in this browser — come back anytime to continue.

Question 1 of 22
0 correct