AI, ML & Deep Learning — Foundations

What you'll leave with

A precise definition of AI, machine learning, and deep learning — and why they aren't synonyms.
The nested-set picture: $\text{DL} \subset \text{ML} \subset \text{AI}$, with concrete examples sitting in each ring.
Why deep learning came to dominate the conversation, even though most of the underlying ideas are decades old.
The vocabulary fluency to read modern AI writing without getting tripped up by labels.
A useful skepticism: knowing when a system labeled "AI" is doing something genuinely surprising versus something a 1980s researcher would have called "a database with rules on top."

1. Why the vocabulary is confusing

Open any tech article from the last few years and you'll see AI, machine learning, and deep learning swapped in for one another like they're the same thing. They aren't, and the differences matter.

Part of the confusion is historical. The term artificial intelligence was coined in 1956 at the Dartmouth workshop, and for decades it meant "any system that does something we'd consider intelligent" — including hand-written rules, search algorithms, and logical inference. Machine learning showed up as a distinct discipline in the 1960s for systems that learn from data instead of being told what to do. Deep learning is the most recent of the three — it's the subset of machine learning that uses neural networks with many layers, and it's what powers nearly every AI system that made the news after 2012.

The reason these three terms get used as synonyms today is that, in the public imagination, every visible "AI" right now happens to be deep learning. ChatGPT is deep learning. Image generators are deep learning. Self-driving perception is deep learning. So the labels collapse together in everyday speech — but they sit in a strict hierarchy underneath.

A useful test

If you can swap the word "AI" for "deep learning" in a sentence and lose no meaning, the writer probably should have said "deep learning" in the first place. If you can swap it for "any computer program," they shouldn't have said "AI" at all.

2. The nested-set picture

The cleanest way to hold the three terms in your head is as concentric rings — each one strictly contained in the next.

Read the diagram from the inside out and you've named the whole field. Every deep-learning system is a machine-learning system. Every machine-learning system is, in the standard usage, an AI system. The reverse direction doesn't hold — most things that count as AI are not machine learning, and most machine learning isn't deep.

The strict containment matters. When someone says "this is built with AI," they could mean almost anything. When someone says "this is built with deep learning," they've told you something specific: there's a neural network in there, probably with millions or billions of parameters, that learned what to do from a large dataset.

3. AI: the broadest set

Artificial Intelligence (AI)

The study and engineering of systems that perform tasks normally requiring human intelligence — perception, reasoning, planning, language use, decision-making. The definition is deliberately broad: it covers any technique for getting a machine to do something "smart," whether or not learning is involved.

The original definition of AI, from Russell & Norvig's standard textbook, is famously squishy: a system that acts rationally toward a goal, given what it perceives. Under that definition, a thermostat that uses a rule like "if temp < 68°F, turn on the heater" is technically a (very simple) AI agent. So is a chess engine, a flight planner, a spam filter, and ChatGPT.

The historical research that fits under "AI but not machine learning" is mostly built from two ingredients: search (explore possible moves, pick the best one) and symbolic rules (if-then statements composed by hand). A few examples:

Minimax chess engines — explore the game tree to some depth, score each leaf with a hand-crafted evaluation function, propagate the best move back up. Deep Blue, which beat Kasparov in 1997, was overwhelmingly this kind of system.
Expert systems — encyclopedias of "if a patient has X and Y, suspect Z" rules, written by domain experts. MYCIN (1970s) diagnosed bacterial infections this way.
Classical planning — given a set of actions and goal conditions, search for a sequence of actions that achieves the goal. Used in robotics and logistics long before deep learning.
Pathfinding — Dijkstra, A* — the algorithms behind GPS routing and game-world navigation.

None of these systems learn. The rules are baked in. If the world changes, a human has to rewrite them. That's the limitation that made researchers reach for the next ring.

4. Machine learning: programming by example

Machine Learning (ML)

A subfield of AI in which the system's behavior is determined by a learning algorithm running over data, not by hand-written rules. You provide examples; the system figures out the pattern.

Tom Mitchell's classic 1997 definition captures the shift cleanly:

"A computer program is said to learn from experience E with respect to some task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

In other words: the program does better at something because it saw more data. The hand-written rule disappears. In its place: a function with adjustable parameters, and an algorithm that tunes those parameters to make the function fit the data.

Mathematically, almost all of ML boils down to picking a function $f_\theta(x)$ with parameters $\theta$, choosing a loss function $\mathcal{L}$ that measures how wrong the function's outputs are, and finding the parameters that minimize the loss:

$$ \theta^* = \arg\min_\theta \; \mathcal{L}\big(f_\theta(x), y\big) $$

That's the recipe — whether $f_\theta$ is a one-line linear regression or a 100-billion-parameter neural network.

The classical ML toolkit (the stuff that lives in ML-but-not-DL) is rich and still very much in use:

Linear and logistic regression — fit a line (or a soft decision boundary). Still the workhorse of medicine, economics, credit scoring.
Decision trees and random forests — recursive yes/no splits, optionally averaged across many trees. Dominant on tabular data; the technique behind most production fraud-detection systems.
Support vector machines (SVMs) — find the boundary that separates classes with the widest margin. Hot in the 2000s, still strong on small datasets.
k-means and other clustering — unsupervised methods that group similar examples without labels.
Naive Bayes — count word frequencies, apply Bayes' rule. Spam filters for two decades.

Here's a one-screen version of classical ML in code — a linear classifier trained on labeled data:

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

# Turn each email into a vector of word counts
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(emails_train)

# Fit one layer of weights — one number per word in the vocabulary
model = LogisticRegression()
model.fit(X_train, labels_train)

# Predict on a new email
X_test = vectorizer.transform(["Free money, click here!"])
print(model.predict(X_test))   # → ['spam']

That's a complete, useful spam filter in eight lines. It's machine learning, it's AI, and it's not deep learning — there's exactly one layer of learned weights.

5. Deep learning: the inner ring

Deep Learning (DL)

A subfield of machine learning in which the function $f_\theta$ is a neural network with many layers stacked between input and output. Each layer learns a transformation of its input; "depth" refers to the number of those layers, which is typically anywhere from a few to a few hundred.

The key idea, and the reason "deep" earns its name, is representation learning. In classical ML, a human has to pick good features — for spam filtering, you'd decide that word counts matter, that capitalization matters, that exclamation points matter. The model only learns the weights on those hand-chosen features.

In deep learning, the model learns the features too. Early layers learn small, local patterns; middle layers compose them into bigger patterns; later layers compose those into concepts. For images: edges → textures → shapes → objects. For text: characters → words → phrases → meaning. Nobody writes down what each layer should do — they emerge from training.

Here's the spam classifier from above, rebuilt as a small deep network. The shape is recognizably similar; the difference is that the layers compose:

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(input_dim, 256),   # layer 1 — learned features
    nn.ReLU(),
    nn.Linear(256, 128),         # layer 2 — features of features
    nn.ReLU(),
    nn.Linear(128, 64),          # layer 3 — higher-level features
    nn.ReLU(),
    nn.Linear(64, 1),            # layer 4 — final decision
    nn.Sigmoid(),
)

# Train with gradient descent on the same loss as before:
# binary cross-entropy between model(emails) and labels.

The deep version has four layers of learned weights instead of one, and roughly a hundred times the parameters. On a clean tabular task like spam, it doesn't necessarily beat the one-line logistic regression — classical ML is hard to dethrone on small structured data. Where deep learning wins is on perception-heavy tasks (images, audio, raw text) where the right features are hard or impossible to specify by hand.

The famous deep-learning architectures specialize the basic idea for different data shapes:

Convolutional Neural Networks (CNNs) — designed for images. Each layer learns small spatial filters that slide across the image.
Recurrent Neural Networks (RNNs, LSTMs) — designed for sequences. Each step takes the previous step's hidden state plus the new input. Largely superseded by transformers.
Transformers — the architecture behind every modern language model. Uses attention to let every token in the input see every other token.
Diffusion models — the architecture behind every modern image, video, and audio generator. Learns to reverse a noise process.

You'll see each of these in detail in later topics. For now, the only thing you need to keep: deep learning is machine learning where the function is a multi-layer neural network. Nothing more, nothing less.

Why now?

Neural networks have existed since the 1950s. Backpropagation, the algorithm that trains them, has been around since the 1980s. What changed in the 2010s was the convergence of three things: (1) data — the internet produced enough labeled examples to feed big models, (2) compute — GPUs gave us 100× more matrix multiplies per dollar, and (3) architectures like CNNs (2012) and transformers (2017) that scaled cleanly. Deep learning didn't get invented in the 2010s. It got practical.

6. Classifying real systems

The fastest way to internalize the nested-set picture is to take a handful of real systems and place each one. Here are the obvious cases:

System	AI?	ML?	DL?	Why
Deep Blue (1997 chess)	✓	—	—	Search + hand-crafted evaluation. No learning.
Naive-Bayes spam filter	✓	✓	—	Learns word probabilities from labeled email. Not a neural net.
Random-forest fraud detector	✓	✓	—	Hundreds of decision trees, trained on transaction data. Still not deep.
ImageNet classifier (ResNet-50)	✓	✓	✓	50-layer CNN trained on millions of labeled images.
AlphaGo	✓	✓	✓	Deep networks for value/policy, plus Monte Carlo tree search on top.
GPT-4 / Claude	✓	✓	✓	Hundreds of billions of parameters in a transformer. Trained on trillions of tokens.
Google Maps routing	✓	—	—	Dijkstra/A* over a road graph. No model is trained at all.
Linear regression on lab data	✓	✓	—	One layer of learned weights. The simplest possible ML.

Notice that AlphaGo and GPT-4 are checked all the way across, but Deep Blue and Google Maps stop at the first column. The deeper the check goes, the more the system learned its behavior rather than having it specified.

7. How the terms get used in practice

Here's the awkward truth: in 2026, "AI" is mostly marketing for "deep learning." The companies selling products call everything "AI" because that's what the market responds to. The researchers building those products usually say "the model" or "the network" — they reserve "AI" for the field, not for individual systems.

A few patterns to recognize:

"AI-powered" in a product description almost always means a deep neural network is doing the work — usually a transformer if it's a language feature, a CNN or transformer if it's a vision feature.
"Machine learning" in a job description or research paper often still implies the classical toolkit (regression, trees, SVMs) — especially in domains like finance, biology, or operations research where the data is tabular and the models stay shallow.
"GenAI" (generative AI) is a marketing-era term for the deep-learning subset that generates new content — text, images, audio, video — rather than just classifying existing content. Every GenAI system is deep learning, but not every deep-learning system is GenAI (an image classifier doesn't generate anything).
"Foundation models" are the very large, general-purpose deep-learning models — GPT, Claude, Gemini, Llama — pre-trained once on enormous datasets and then adapted to many downstream tasks. We'll devote a whole later topic to them.

The labels matter less than the question "what's actually inside?". When you read about a new system, the useful questions are: Does it learn? From what data? With what architecture? What's the training signal? Those answers tell you what the system can and can't do — far better than any three-letter label.

8. Common pitfalls

"AI" = ChatGPT

The most common conflation, especially in journalism: treating ChatGPT-style language models as if they're the entirety of AI. They aren't. Most of the AI quietly running in production — fraud detection, ad targeting, search ranking, ETA estimation, route planning — is not generative, and a lot of it isn't even deep. Treating "AI" as a synonym for chatbots will lead you to miss what's actually happening in the field.

Treating ML and AI as parallel disciplines

You'll occasionally see phrasing like "AI and machine learning", as if they were two siblings. They aren't — ML is a subset of AI. The phrase is technically wrong, though usually harmless. The version that's actually wrong is "AI versus ML" — there's nothing to compare; one contains the other.

"Deeper is always better"

Deep learning needs a lot of data and a lot of compute to pay off. On small, clean, tabular datasets — say, predicting loan default from 30 features and 10,000 examples — gradient-boosted trees (XGBoost, LightGBM) routinely beat the best neural networks. The right ML model depends on the data, not on which model is the newest.

"Layers" means "depth of understanding"

A 100-layer network is not "more thoughtful" than a 10-layer one in any human sense. The number of layers controls the size and expressiveness of the function the network can compute. It says nothing about whether the network reasons, understands, or knows what it's doing. Anthropomorphizing depth is one of the fastest ways to form bad intuitions about what these systems can and can't do.

9. Worked examples

For each system below, place it in the rings: AI, ML, DL — or some subset. Try to answer before opening the explanation.

Example 1 · A regex-based grammar checker (e.g. early Word's red-underlines)

Verdict: AI (broadly), not ML, not DL.

A grammar checker that flags "their/there/they're" confusion using hand-written regex patterns is a rule-based system. It performs a task that requires (a little) linguistic intelligence, so it fits the broad definition of AI. But nothing is learned from data — every rule was written by a human. So it's outside the ML ring.

Example 2 · A bank's credit-scoring model using gradient-boosted trees

Verdict: AI ✓ · ML ✓ · DL ✗

The model is trained on historical loan-outcome data — credit history, income, debt-to-income ratio, employment status — and learns to predict default probability. That's machine learning. But the model is an ensemble of decision trees, not a neural network, so it sits in the ML ring without entering the DL one.

It's also one of the most common production ML systems in the world. Most of the real money in ML still flows through gradient-boosted trees, not deep nets.

Example 3 · Tesla Autopilot's vision system identifying pedestrians and lane lines

Verdict: AI ✓ · ML ✓ · DL ✓

The perception stack is a deep convolutional and transformer-based network trained on huge amounts of labeled driving footage. It learns directly from raw camera input — no human writes the rules for "this is a pedestrian" or "this is a lane line." Every ring is checked.

The full self-driving stack is more than just perception, though — the planning and control layers on top often include classical algorithms (search, model-predictive control). Real systems mix rings.

Example 4 · AlphaZero playing chess and Go

Verdict: AI ✓ · ML ✓ · DL ✓

AlphaZero learned to play chess and Go from scratch — no opening books, no endgame tables, no hand-crafted evaluation function. It used a deep neural network to evaluate positions and suggest moves, and a Monte Carlo tree search to look ahead. The network was trained by playing millions of games against itself.

It's worth contrasting this with Deep Blue, which beat Kasparov 20 years earlier using only search and hand-crafted evaluation — AI, but not ML at all. The two systems play the same game, but they sit in completely different rings.

Sources & further reading

The terms in this topic have well-established technical definitions; the sources below are the canonical places they're stated and the best-known introductions for each ring.

Artificial Intelligence: A Modern Approach Textbook Russell & Norvig · 4th edition

The standard university AI textbook. Defines AI broadly and covers the full classical-AI toolkit (search, planning, logic) that this topic only sketches. Read the first two chapters if you want the formal version of "what AI is."
Deep Learning Textbook Goodfellow, Bengio & Courville

The canonical introduction to deep learning. The full text is free online. Chapter 1 ("Introduction") is the clearest written explanation of the ML → DL transition and why representation learning matters.
Artificial intelligence Encyclopedia Wikipedia

A well-maintained overview that places the modern deep-learning era in the longer arc of the field. The history section is particularly useful for understanding why "AI" has meant such different things at different times.
But what is a neural network? Video 3Blue1Brown

The best visual introduction to what a neural network actually does, frame by frame. Worth watching the full four-video series if you want concrete intuition for what's inside the "deep learning" ring.
Software 2.0 Article Andrej Karpathy

A short essay that captures the deeper shift behind the rise of deep learning: from writing instructions to defining objectives and letting the optimizer find the program. Reframes what "programming" means in an ML-first world.

What you'll leave with

1. Why the vocabulary is confusing

2. The nested-set picture

3. AI: the broadest set

4. Machine learning: programming by example

5. Deep learning: the inner ring

6. Classifying real systems

7. How the terms get used in practice

8. Common pitfalls

9. Worked examples

Sources & further reading

Test your understanding