Supervised, Unsupervised & Reinforcement Learning — Foundations

What you'll leave with

A precise definition of supervised, unsupervised, and reinforcement learning — distinguished by the kind of signal the model gets.
A fourth, modern hybrid — self-supervised learning — and why it powers nearly every foundation model.
The math that all three share: minimize a loss, maximize an expected return — same idea, different objective.
An honest picture of which paradigms still get used in practice, and which mostly survive in textbooks.
The recipe behind modern LLMs (pretraining → SFT → RLHF) and why it cleanly mixes three different paradigms.

1. The signal is what defines the paradigm

From the previous topic: machine learning is "programming by example." The model has parameters $\theta$, and a learning algorithm tunes those parameters to make the model behave correctly on data. Almost all of ML reduces to picking some function $f_\theta$ and minimizing a loss over the data:

$$ \theta^{*} = \arg\min_\theta \; \mathcal{L}\big(f_\theta(\text{data})\big) $$

That formula is silent on one crucial question: what counts as "correct behavior"? What does the loss measure the model against? The answer to that question is the paradigm.

If a human hands you input–output pairs $(x, y)$ and the loss measures whether $f_\theta(x)$ matches $y$ — that's supervised learning.
If you only have inputs $x$ and the loss measures something intrinsic about the data (clusters, density, reconstruction) — that's unsupervised learning.
If the model has to take actions in an environment and the loss is built from numeric rewards that come back — that's reinforcement learning.
And the modern hybrid: if you take unlabeled data, hide part of it, and use the hidden part as the label — that's self-supervised learning. Operationally supervised, but the labels come for free from the data itself.

The rest of this topic is just those four bullets in detail.

2. Supervised learning

Supervised learning

The model is trained on labeled examples — pairs $(x, y)$ where $x$ is an input and $y$ is the desired output. The loss penalizes mismatch between the model's prediction $f_\theta(x)$ and the label $y$. After training, the model can predict $y$ for new inputs $x$ it hasn't seen.

This is the paradigm most people picture when they say "machine learning." A human (or some other oracle) provides the right answer for many examples; the model learns the mapping. The two main flavors differ only in what $y$ looks like:

Classification — $y$ is a discrete label. Examples: spam vs. ham, cat vs. dog, which digit is in this image (10 classes), which of 50,000 ImageNet categories.
Regression — $y$ is a continuous number. Examples: predict tomorrow's temperature, the selling price of a house, the time-to-failure of a machine part.

The objective is the same in both cases: minimize the average loss across the training set.

$$ \mathcal{L}_{\text{sup}}(\theta) \;=\; \frac{1}{N} \sum_{i=1}^{N} \ell\big(f_\theta(x_i),\, y_i\big) $$

For regression, $\ell$ is typically squared error $(\hat{y} - y)^2$. For classification, it's typically cross-entropy. The choice doesn't change the paradigm — both are supervised.

from sklearn.linear_model import LogisticRegression

# X_train: feature vectors. y_train: labels (e.g. 0 = "not spam", 1 = "spam").
model = LogisticRegression()
model.fit(X_train, y_train)        # learns the mapping x → y

# At inference, predict labels for new inputs:
predictions = model.predict(X_new)

The line that makes it supervised is model.fit(X_train, y_train) — the algorithm needs both the inputs and the matching labels. No labels, no training. That's the whole game.

When to reach for supervised

Whenever the thing you want to predict is well-defined and you can collect enough labeled examples to teach it. The bottleneck is almost always the labels — they're expensive, slow, and often subjective. The history of applied ML is largely the history of finding clever ways to get labels.

3. Unsupervised learning

Unsupervised learning

The model is trained on inputs $x$ alone — no labels are provided. The objective is to discover structure intrinsic to the data: groups, axes of variation, density estimates, low-dimensional summaries. Useful when labels don't exist, or when the question you're asking is "what's in this data?" rather than "predict $y$ from $x$."

The three classical tasks under this umbrella:

Clustering — group similar examples together. Algorithms: k-means, DBSCAN, Gaussian mixtures. Examples: customer segmentation, anomaly detection by distance to nearest cluster.
Dimensionality reduction — find a low-dimensional representation that preserves most of the information. Algorithms: PCA, t-SNE, UMAP, autoencoders. Examples: visualizing a 100-feature dataset in 2-D; compressing images before downstream learning.
Density estimation — learn the probability distribution $p(x)$ the data came from. Used for anomaly detection (low-probability inputs are anomalies) and as a building block in generative models.

from sklearn.cluster import KMeans

# X_train: feature vectors. No labels — that's the whole point.
model = KMeans(n_clusters=5)
model.fit(X_train)                 # finds 5 cluster centers

cluster_ids = model.predict(X_new) # which cluster does each new x belong to?

Notice fit(X_train) — single argument. There's no y_train. The model isn't being told what the right answer is; it's being told to find something structural about the data on its own.

A practical reality

Classical unsupervised learning has shrunk in importance over the past decade. K-means and PCA still get used in production — clustering customer behavior, reducing dimensionality before plotting — but the headline-grabbing applications people once expected from unsupervised learning (learning rich representations of language and images without labels) have mostly been claimed by self-supervised learning instead. That's the next section.

4. Self-supervised learning

Self-supervised learning

The model is trained on inputs with no human-provided labels, but the training objective is set up as a supervised problem by generating labels from the inputs themselves. The most common recipe: hide part of the input, ask the model to predict the hidden part from the rest. The "label" was always there — it just had to be revealed.

This is operationally a supervised problem — there's an $(x, y)$ pair, there's a loss measuring prediction against label, gradients flow the usual way. But the labels weren't annotated by a human. They came out of the data. That's the trick that made modern AI possible.

Two recipes dominate:

Next-token prediction (language). Take a piece of text. The model sees the first $n$ tokens; the label is the $(n+1)$-th token. Repeat for every position in every document on the internet. Every modern LLM — GPT, Claude, Gemini, Llama — is pretrained this way.
Masked prediction (vision and language). Take an image (or sentence). Hide a random patch (or word). The model sees the rest; the label is the hidden piece. BERT did this for language; masked autoencoders (MAE) do it for vision.

Why this matters: human-labeled data is bounded by how many humans you can pay. Self-supervised data is bounded only by how much raw data exists in the world, which is much, much larger. Pretraining a 100-billion-parameter language model on labeled data would be impossible. Pretraining on next-token-prediction over the whole internet is what every frontier lab does.

# A single training example built from raw text — no human labeler involved.
text = "The quick brown fox jumps over the lazy dog"
tokens = tokenizer.encode(text)
#  inputs: tokens[:-1]   →  [The, quick, brown, fox, jumps, over, the, lazy]
#  labels: tokens[1:]    →  [quick, brown, fox, jumps, over, the, lazy, dog]

# Loss: standard cross-entropy between model(inputs) and labels.
# Notice: this is mechanically supervised learning. The "label" just came
# from the same string of text the input did.

Strictly speaking, self-supervised learning is a sub-flavor of unsupervised learning — no human-annotated labels were involved, so the original taxonomy puts it under the unsupervised umbrella. In practice it's usually called out separately because its objective looks supervised and its impact has been so much larger than classical unsupervised methods.

5. Reinforcement learning

Reinforcement learning (RL)

An agent interacts with an environment: at each step the agent observes a state, picks an action, the environment responds with a new state and a numeric reward. The model — called a policy — is a function from states to actions, and the training objective is to find the policy that maximizes the total reward collected over time.

RL is structurally different from the previous three. There's no fixed dataset of examples. The data comes from the agent acting — and the agent's actions determine what data shows up next. The "label" for any one action is buried inside a sparse, often-delayed reward signal.

The objective is to find a policy that maximizes the expected discounted return — the total reward collected over time, with future rewards discounted by a factor $\gamma \in [0, 1)$:

$$ \pi^{*} \;=\; \arg\max_{\pi} \; \mathbb{E}_{\pi}\!\left[ \sum_{t=0}^{\infty} \gamma^{t}\, r_{t} \right] $$

Three things make RL distinctively harder than supervised learning:

Exploration vs. exploitation. The agent only sees data from actions it actually takes. To learn that an action is good, it has to try it. To collect reward, it should keep doing the action it already knows is good. Balancing these is a defining tension in RL.
Credit assignment. A reward might come 100 steps after the action that earned it. The algorithm has to figure out which earlier action deserves the credit.
Non-IID data. The agent's behavior changes during training, so the distribution of states it sees keeps shifting. Supervised learning assumes a fixed data distribution; RL doesn't get that luxury.

The classical RL successes are games and robotics — AlphaGo, AlphaZero, Atari-playing agents, robot manipulation. The contemporary one is much closer to home: RLHF (reinforcement learning from human feedback) is how every frontier language model is fine-tuned for helpfulness and harmlessness. The biggest commercial application of RL today lives inside ChatGPT and Claude.

6. The big comparison

Paradigm	What's the signal?	Objective	Canonical example
Supervised	Labels $y$ on input $x$, provided by a human or oracle	Minimize prediction error $\ell(f_\theta(x), y)$	Image classifier trained on ImageNet
Unsupervised	The data $x$ alone — no labels	Find structure: clusters, low-dim axes, density	k-means customer segmentation
Self-supervised	Labels manufactured from the input itself	Predict hidden parts from visible parts	GPT pretraining: predict the next token
Reinforcement	Numeric rewards from an environment, given actions taken	Maximize expected discounted return	AlphaGo learning to play Go from self-play

The thing to anchor on: the model architecture (a neural network, a decision tree, whatever) is largely independent of the paradigm. The training signal is what makes the paradigm. A transformer can be trained supervised (as a classifier), self-supervised (as a language model), or with RL (as RLHF) — same architecture, different signal.

7. How modern systems mix the paradigms

You will rarely meet a production AI system that is "purely supervised" or "purely RL." Real systems stage the paradigms — using each where it's strongest, in sequence. The canonical example is the recipe behind every modern instruction-following language model:

Reading left to right:

Pretraining (self-supervised) gives the model a vast representation of language — the meaning of words, grammar, factual knowledge, the shape of human discourse. This is where the bulk of the compute and data live. The signal is "what comes next?" applied to internet-scale text.
Instruction tuning (supervised) teaches the pretrained model the format of following instructions. Humans write (or curate) good prompt–response pairs; the model learns to mimic them. Comparatively tiny dataset; large effect on usefulness.
RLHF (reinforcement learning) refines the model against preference — humans rank pairs of responses, a reward model is trained on those rankings, and the language model is fine-tuned to maximize that reward. This is where "helpful, harmless, honest" gets shaped.

No single paradigm could produce a Claude or a GPT-4. Self-supervised gives you knowledge but not instruction-following. Pure supervised on instruction data without pretraining would have nothing to draw on. RL alone would have nothing to start from. The recipe needs all three, in that order.

A useful reframe

Don't think of the paradigms as competitors. They're tools with different sweet spots — pick the one whose signal you can actually obtain at the scale you need. The art of modern AI engineering is largely picking the right paradigm for each stage of the pipeline.

8. Common pitfalls

"Self-supervised is just unsupervised"

Strictly, yes — there are no human labels, so it falls under the unsupervised umbrella. Operationally, no — the training loop, loss function, and optimizer look identical to supervised learning. Treating self-supervised as "unsupervised in disguise" hides the most important fact about it: it gets to use the entire supervised-learning toolkit while sidestepping the labeling bottleneck. That's why it scaled.

"RL is just supervised learning with rewards as labels"

It isn't, in three concrete ways. Rewards are sparse (a chess game gives one reward at the end), delayed (the move that won was made dozens of steps earlier), and action-dependent (the agent's behavior determines what data shows up next, breaking the IID assumption supervised learning relies on). Calling RL "supervised with rewards" misses every hard thing about it.

"Unsupervised learning is what you use when labels are expensive"

Half-right. When labels are expensive in 2026, the right move is almost always self-supervised learning, not classical unsupervised. K-means and PCA are fine tools — but if your goal is to learn rich representations of unlabeled text or images, you reach for masked or next-token prediction, not clustering.

"RL is for games and robots"

That used to be the dominant narrative, and the highlight-reel results — AlphaGo, AlphaStar, Atari, dexterous robot hands — still tilt that way. But the biggest commercial impact of RL today is inside every frontier language model. RLHF is the step that turns a raw next-token predictor into a helpful assistant. If you think RL means robotics, you'll miss the version of it that's running on hundreds of millions of devices.

9. Worked examples

For each scenario, identify which paradigm (or combination) applies before opening the explanation.

Example 1 · Predicting house prices from 30 features per house, given a CSV of sold houses with prices

Paradigm: Supervised learning (regression).

Each row has features $x$ (square footage, ZIP code, year built, etc.) and a target label $y$ (the sale price). The model learns $f_\theta(x) \approx y$ by minimizing squared error on the training rows. Continuous target → regression flavor of supervised.

Example 2 · Grouping a million customers into segments based on their purchase histories, no marketing categories given

Paradigm: Unsupervised learning (clustering).

There's no "right answer" the algorithm is being trained to reproduce — no labels exist. The job is to find groups of customers whose behavior is similar. K-means, Gaussian mixtures, or hierarchical clustering all fit. Once segments emerge, a human typically inspects them and assigns meaning ("high-value gift-buyers," "back-to-school shoppers"), but that interpretation step is outside the ML.

Example 3 · Training a language model on the contents of the entire Common Crawl by predicting each next word

Paradigm: Self-supervised learning.

No human labeled anything — but at every position, the next token serves as the "label" for the preceding context. Mechanically the training loop is supervised cross-entropy; the difference is that the labels were extracted from the data, not collected from annotators. This is the pretraining step of every modern LLM.

Example 4 · Teaching a bipedal robot to walk by simulating thousands of attempts, giving it a positive reward for forward distance and a negative reward for falling over

Paradigm: Reinforcement learning.

There's no dataset. The agent acts, the simulator responds with new states and rewards, and the policy is updated to favor actions that earned high return. Sparse reward (forward distance accumulates, falling is punished), the data distribution shifts as the policy improves, and credit has to be assigned across many time steps — all the classic RL ingredients.

Example 5 · A pretrained language model is fine-tuned by collecting human rankings of pairs of its responses, training a reward model on those rankings, and using PPO to push the language model toward higher rewards

Paradigm: A pipeline that ends in reinforcement learning. Specifically: RLHF.

The reward model is trained supervised (input: pair of responses, label: which one humans preferred). The language model is then fine-tuned via RL with that reward model serving as the environment's reward signal. So this example combines supervised learning (for the reward model) with reinforcement learning (for the final language-model fine-tuning).

This is the third stage of the LLM training pipeline from section 7 — and the reason RL belongs in any honest discussion of modern AI, not just historical AlphaGo war stories.

Sources & further reading

The taxonomy in this topic is well-established. The sources below are the canonical introductions to each paradigm and the best modern references on how they get mixed in practice.

Reinforcement Learning: An Introduction Textbook Sutton & Barto · 2nd edition

The canonical RL textbook. Free PDF on the authors' website. Chapter 1 alone is the clearest statement of what makes RL distinct from supervised learning; chapters 2–6 build up the core algorithms. Don't try to learn RL from anywhere else first.
The Elements of Statistical Learning Textbook Hastie, Tibshirani & Friedman · 2nd edition

The reference for supervised and unsupervised learning in the classical (pre-deep-learning) tradition. Free PDF. Use it when you want the rigorous version of regression, classification, clustering, and dimensionality reduction.
Self-Supervised Representation Learning Article Lilian Weng

A clear, well-organized survey of self-supervised techniques across vision and language. Great place to go after this topic if you want to understand the family of pretext tasks (next-token, masked, contrastive, etc.) that power modern foundation models.
Training language models to follow instructions with human feedback (InstructGPT) Paper Ouyang et al., 2022

The paper that introduced the now-standard pretraining → SFT → RLHF pipeline. Read this to see all three paradigms working in concert on one production system. Section 3 (methodology) is the most useful part.
Machine learning Encyclopedia Wikipedia

A well-maintained overview that lays out the paradigm taxonomy and links out to the canonical algorithms inside each. A good cross-check if any vocabulary on this page felt unfamiliar.