1. What a neural network is
A parameterized function $f_\theta : \mathbb{R}^{d_{\text{in}}} \to \mathbb{R}^{d_{\text{out}}}$ built by composing layers. Each layer takes a vector, multiplies it by a matrix of weights, adds a vector of biases, and applies a fixed nonlinear function. The matrices and biases — together called the parameters $\theta$ — are the adjustable numbers a learning algorithm tunes.
That definition contains almost everything that matters. Strip away the architecture names — CNN, transformer, diffusion model — and at the bottom of every modern AI system is the same recipe: stacked linear maps, separated by nonlinearities, parameterized by numbers you adjust to make the output match the target.
The phrase "neural network" is partly a historical accident — the original 1940s work drew a loose analogy to brain neurons. The math is its own thing now. Modern neural networks owe almost nothing to actual neuroscience beyond the name. Treat the biology metaphor as a label, not a model.
What makes neural networks powerful is composability. The single-neuron computation is straightforward, almost trivial. But when you stack thousands of them across dozens of layers and let training adjust the weights, the resulting function can approximate language, vision, code, music, motor control — all by tuning the same kind of numbers. This topic builds up that composition step by step.
2. The neuron: weighted sum plus activation
The basic unit is the neuron (sometimes called a unit or node). It takes a vector of inputs and produces a single number. Two steps:
- Compute a weighted sum of the inputs, plus a bias.
- Pass the result through a nonlinear function called an activation.
Formally, for input $\mathbf{x} = (x_1, x_2, \ldots, x_n) \in \mathbb{R}^n$, weights $\mathbf{w} = (w_1, w_2, \ldots, w_n) \in \mathbb{R}^n$, bias $b \in \mathbb{R}$, and activation $\sigma : \mathbb{R} \to \mathbb{R}$:
$$ a \;=\; \sigma\!\left(\sum_{i=1}^{n} w_i\, x_i \,+\, b\right) \;=\; \sigma\!\left(\mathbf{w}^\top \mathbf{x} + b\right) $$The preactivation $z = \mathbf{w}^\top \mathbf{x} + b$ is the linear part. The activation function $\sigma$ does the nonlinear work. We'll cover the common choices of $\sigma$ in section 6; for now, the exact form doesn't matter as long as you remember it's nonlinear.
Geometrically, the weight vector $\mathbf{w}$ defines a direction in input space, and the bias $b$ shifts a decision threshold. The neuron answers, roughly, "how much does $\mathbf{x}$ point in the direction $\mathbf{w}$, and is it above or below the threshold $-b$?"
import numpy as np
def relu(z):
return np.maximum(0, z)
# A neuron with 3 inputs.
w = np.array([0.5, -1.2, 0.3]) # weights — learned
b = 0.1 # bias — learned
x = np.array([1.0, 2.0, 3.0]) # input
z = w @ x + b # preactivation: w·x + b
a = relu(z) # activation
print(z, a) # → -1.3, 0.0 (ReLU zeros it out)
The neuron is a one-line computation. The interesting behavior only emerges when you stack many of them.
3. From neuron to layer
A layer is many neurons computing in parallel from the same input. Each neuron has its own weight vector and its own bias, but they all see the same input $\mathbf{x}$. The outputs of the layer's neurons form a new vector — the layer's output.
If we have $m$ neurons in the layer, stack their weight vectors as the rows of a matrix $W \in \mathbb{R}^{m \times n}$ and their biases as a vector $\mathbf{b} \in \mathbb{R}^m$. The layer output is:
$$ \mathbf{a} \;=\; \sigma\!\left(W \mathbf{x} + \mathbf{b}\right) $$The activation $\sigma$ is applied element-wise: each component of the preactivation vector $W\mathbf{x} + \mathbf{b}$ goes through $\sigma$ independently. So a layer with $m$ neurons that takes an $n$-dimensional input produces an $m$-dimensional output, using $m \cdot n$ weights plus $m$ biases — a total of $m(n+1)$ parameters.
A layer is a learned change of representation. It takes a vector in one space, mixes the coordinates linearly (the $W\mathbf{x}+\mathbf{b}$ part), and bends the result through a nonlinearity. Stack enough of these and you can transform raw pixels into "the probability this image is a cat" through a sequence of intermediate representations the model invents for itself.
4. From layer to network
A neural network is layers stacked: the output of one layer becomes the input of the next. Label the layers $1, 2, \ldots, L$. Let $\mathbf{h}^{(0)} = \mathbf{x}$ be the input. Then for each layer $l$:
$$ \mathbf{h}^{(l)} \;=\; \sigma\!\left(W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\right) $$and the network's output is $\mathbf{h}^{(L)}$. Each layer has its own weights $W^{(l)}$ and biases $\mathbf{b}^{(l)}$ — different shapes for different layers depending on the widths.
This particular architecture — every neuron in one layer connected to every neuron in the next — is called a multi-layer perceptron (MLP), or sometimes a fully-connected or dense network. It's the simplest non-trivial neural architecture, and the building block inside more elaborate ones (transformers contain MLPs; so do CNNs).
The diagram is the textbook picture. Three inputs feed every unit in the first hidden layer; that layer's four outputs feed every unit in the second hidden layer; and so on. The arrows are the weights — each line represents one number in the corresponding $W^{(l)}$ matrix.
Here's the whole forward computation in one short block:
import numpy as np
def relu(z):
return np.maximum(0, z)
# A 3 → 4 → 4 → 2 MLP. Weights are normally learned;
# we show random ones just to make the shapes concrete.
W1 = np.random.randn(4, 3); b1 = np.zeros(4)
W2 = np.random.randn(4, 4); b2 = np.zeros(4)
W3 = np.random.randn(2, 4); b3 = np.zeros(2)
def forward(x):
h1 = relu(W1 @ x + b1)
h2 = relu(W2 @ h1 + b2)
y = W3 @ h2 + b3 # often no activation on the final layer
return y
print(forward(np.array([1.0, 2.0, 3.0])))
The "forward pass" — turning an input vector into the network's output — is exactly this sequence of matrix multiplies and elementwise activations. The next topic devotes itself to it in more detail; for now, just notice that the entire computation is bookkeeping over matrices and a chosen nonlinearity.
Modern frameworks like PyTorch hide the matrix juggling behind a small declarative API:
import torch.nn as nn
model = nn.Sequential(
nn.Linear(3, 4), nn.ReLU(),
nn.Linear(4, 4), nn.ReLU(),
nn.Linear(4, 2),
)
# model(x) does exactly the same forward pass.
# model.parameters() exposes every W and b for training.
5. Why nonlinearities matter
Here's a question that every reader should be able to answer: what happens if you drop the activation functions and stack only linear layers?
Suppose we have two linear layers with no nonlinearity in between:
$$ \mathbf{h} = W_1 \mathbf{x} + \mathbf{b}_1, \qquad \mathbf{y} = W_2 \mathbf{h} + \mathbf{b}_2 $$Substitute the first into the second:
$$ \mathbf{y} \;=\; W_2 (W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 \;=\; \underbrace{(W_2 W_1)}_{W'} \mathbf{x} + \underbrace{(W_2 \mathbf{b}_1 + \mathbf{b}_2)}_{\mathbf{b}'} $$The composition of two linear maps is a single linear map. No matter how many linear layers you stack, the whole stack collapses to one. A 100-layer network with no nonlinearities has exactly the expressive power of a 1-layer network — it can only represent linear functions of the input.
The nonlinearity $\sigma$ is the thing that prevents this collapse. Once you apply $\sigma$ between layers, the composition is no longer just another matrix multiply — it's a genuinely richer function. Nonlinearities are what give neural networks expressive power. Without them, you have a fancy linear regression.
"Stacked linear layers collapse" is one of the most useful sentences in deep learning. It's why every nontrivial architecture has activations between its layers, why "linear layer" alone is rarely a complete answer to anything, and why the choice of $\sigma$ ends up mattering so much.
6. Activation functions
The activation function $\sigma$ is the nonlinear ingredient. It's applied element-wise, so it's a function from $\mathbb{R}$ to $\mathbb{R}$. A handful of choices dominate:
Sigmoid
$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$Range $(0, 1)$. Historically the default — its $S$-curve resembles a smoothed step function. Why it fell out of favor: when $|z|$ is large, the gradient $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ becomes tiny. In deep networks this causes the vanishing gradient problem during training. Still used at the output of a binary classifier (where its $(0, 1)$ range represents a probability).
Tanh
$$ \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} $$Range $(-1, 1)$. A rescaled sigmoid, centered at zero. Often outperforms sigmoid in hidden layers because the zero-centered range plays better with optimization. Used in some RNNs and older architectures; mostly replaced by ReLU in modern feedforward networks.
ReLU (Rectified Linear Unit)
$$ \text{ReLU}(z) = \max(0, z) $$Range $[0, \infty)$. Simple, cheap, and trains fast. The gradient is either $0$ (when $z \le 0$) or $1$ (when $z > 0$) — no vanishing gradient when the unit is active. This is the default activation in most architectures since around 2012.
GELU (Gaussian Error Linear Unit)
$$ \text{GELU}(z) = z \cdot \Phi(z), \quad \text{where } \Phi(z) = P(\mathcal{N}(0,1) \le z) $$A smooth variant of ReLU that the input gets weighted by the probability it would be retained under a Gaussian. GELU is the standard activation inside transformer MLP blocks — GPT, BERT, Llama all use it (or its close cousin SiLU/swish).
| Activation | Range | Where you'll see it |
|---|---|---|
| Sigmoid | $(0, 1)$ | Binary-classifier output, RNN gates |
| Tanh | $(-1, 1)$ | Older RNNs, some specific layers |
| ReLU | $[0, \infty)$ | Default in CNNs and most MLPs |
| GELU / SiLU | $\approx [-0.17, \infty)$ | Transformer MLP blocks (GPT, Claude, Llama) |
7. Width vs depth
Two ways to make a network bigger:
- Width — increase the number of neurons in each layer. A layer with 1024 units is wider than a layer with 64 units.
- Depth — increase the number of layers. A 24-layer network is deeper than a 4-layer one.
For a long time, the conventional wisdom was that one moderately wide hidden layer could do anything (see section 8 — the universal approximation theorem). In practice, depth turned out to matter much more.
A deep network learns features at multiple levels of abstraction — early layers latch onto small, local patterns; later layers compose them into higher-level features; the final layers make the actual decision. For perceptual data (images, audio, raw text), this layered composition is dramatically more parameter-efficient than packing everything into one very wide layer.
The modern empirical answer is: both, in carefully chosen ratios. Frontier models are deep (tens to hundreds of layers) and wide (tens of thousands of neurons per layer). The total parameter count — billions to trillions — comes from the product.
Naively deeper networks train worse because gradients shrink as they propagate back through many layers (the vanishing gradient problem). The deep learning revolution of the 2010s wasn't really about going deeper — it was about inventing the tricks (ReLU activations, batch normalization, residual connections, careful initialization) that made deeper networks trainable. Once those tricks landed, depth started paying off.
8. Universal approximation
One of the foundational theoretical results about neural networks:
A feedforward network with a single hidden layer and a non-polynomial activation function can approximate any continuous function on a compact subset of $\mathbb{R}^n$ to arbitrary accuracy, given enough hidden units.
This is the Universal Approximation Theorem, first proven by Cybenko (1989) and Hornik (1989). On the face of it, it makes neural networks sound trivially powerful: any function you want, one hidden layer is enough.
The catch is buried in "given enough hidden units." The theorem doesn't say how many units you need. For most real-world functions, the answer is exponentially many — an unworkably wide one-layer network that's impossible to train or fit in memory.
Depth changes the picture. Many functions that require exponentially many units in a one-hidden-layer network can be represented with polynomially many units in a deeper network. Depth is a representation-efficiency win, not just an aesthetic preference.
Two further caveats worth carrying in your head:
- "Can approximate" ≠ "will learn from data." The theorem says some assignment of weights exists. It says nothing about whether gradient descent on a finite dataset will find it.
- "Continuous functions on compact sets" is broad but not unlimited. Real-world data has structure (locality, hierarchy, invariances) that architectures like CNNs and transformers exploit. UAT applies in principle; in practice, architecture matters enormously.
Treat UAT as a sanity check ("neural networks aren't fundamentally restricted in what they can represent") rather than a practical guide.
9. Common pitfalls
They don't. The biological analogy was a loose inspiration in the 1940s, and the name stuck — but the math has nothing to do with actual neuroscience. A neural network neuron is a weighted sum followed by an elementwise function. Real neurons have rich temporal dynamics, spike-based communication, and very different learning rules. Carrying the biology metaphor too far leads to wrong intuitions about what these systems can or can't do.
It doesn't — see section 5. Two linear layers in a row compose into a single linear layer. Depth only does anything if there's a nonlinearity between each pair of layers. This is the single most important sentence in introductory deep learning, and it's easy to miss on a first read.
ReLU is a great default, but not universal. Modern transformers use GELU or SiLU because their smoothness gives slightly better gradient flow in very deep architectures. RNNs and LSTMs still use tanh and sigmoid for their gating mechanisms because the bounded ranges play a specific role. Picking the activation is one of the more under-discussed hyperparameter choices.
The theorem only says that somewhere in the weight space of a sufficiently wide one-layer network, there exists a configuration that approximates your target function. Whether you can find it from data via gradient descent, in finite compute, with a sane number of parameters, is a completely different question — and the answer often hinges on architecture choices (convolutions for images, attention for sequences, residuals for depth). UAT is an existence proof, not an engineering guide.
10. Worked examples
Try each one before opening the explanation.
Example 1 · Forward pass through a single neuron with ReLU
A neuron with weights $\mathbf{w} = (0.5, -1.0, 2.0)$, bias $b = -1$, and ReLU activation receives input $\mathbf{x} = (2, 1, 1)$. Compute its output.
Preactivation:
$$ z = \mathbf{w}^\top \mathbf{x} + b = (0.5)(2) + (-1.0)(1) + (2.0)(1) + (-1) = 1 - 1 + 2 - 1 = 1 $$Activation:
$$ a = \text{ReLU}(1) = \max(0, 1) = 1 $$The neuron outputs $1$. If the bias had been $-3$ instead, the preactivation would have been $-1$ and the ReLU would have zeroed it out — the neuron would be "off" for that input.
Example 2 · Why nonlinearity is required for XOR
The XOR function takes two binary inputs and returns $1$ if exactly one is $1$:
$\text{XOR}(0,0)=0, \quad \text{XOR}(0,1)=1, \quad \text{XOR}(1,0)=1, \quad \text{XOR}(1,1)=0.$
No linear function — no setting of weights $w_1, w_2$ and bias $b$ — can output the right answer at all four points. (If you plot the four inputs in the plane, you can't separate the "1" outputs from the "0" outputs with any straight line.)
A network with one hidden layer of two ReLU units can represent XOR. One canonical solution:
$$ h_1 = \text{ReLU}(x_1 + x_2 - 0.5), \quad h_2 = \text{ReLU}(x_1 + x_2 - 1.5) $$ $$ y = h_1 - 2 h_2 $$Check: at $\mathbf{x} = (1, 1)$, $h_1 = \text{ReLU}(1.5) = 1.5$, $h_2 = \text{ReLU}(0.5) = 0.5$, so $y = 1.5 - 1.0 = 0.5$. At $\mathbf{x} = (1, 0)$, $h_1 = \text{ReLU}(0.5) = 0.5$, $h_2 = \text{ReLU}(-0.5) = 0$, so $y = 0.5$. (A threshold around $0.25$ would map these to the correct binary outputs.)
This is the classic motivation for hidden layers: the hidden representation lets the model construct features (like "the sum is between 0.5 and 1.5") that aren't linearly available from the raw inputs.
Example 3 · Show that stacked linear layers collapse
Consider two layers without activations:
$$ \mathbf{h} = W_1 \mathbf{x} + \mathbf{b}_1, \qquad \mathbf{y} = W_2 \mathbf{h} + \mathbf{b}_2 $$Substitute the first into the second:
$$ \mathbf{y} = W_2(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = (W_2 W_1)\mathbf{x} + W_2 \mathbf{b}_1 + \mathbf{b}_2 $$Define $W' = W_2 W_1$ and $\mathbf{b}' = W_2 \mathbf{b}_1 + \mathbf{b}_2$. Then $\mathbf{y} = W' \mathbf{x} + \mathbf{b}'$ — exactly the form of a single linear layer. By induction, any number of stacked linear layers reduces to one linear layer. Depth without nonlinearity buys you nothing.
Example 4 · Count the parameters in a small MLP
Consider a $3 \to 4 \to 4 \to 2$ MLP (the one in the diagram in section 4). Each layer's parameter count is (units) × (input width) + (units):
- Layer 1 (3 → 4): $4 \cdot 3 + 4 = 16$ parameters
- Layer 2 (4 → 4): $4 \cdot 4 + 4 = 20$ parameters
- Layer 3 (4 → 2): $2 \cdot 4 + 2 = 10$ parameters
Total: 46 parameters.
Scale this up — a transformer with hidden dimension $d = 4096$ has individual MLP blocks with on the order of $4 \cdot d^2 \approx 67$ million parameters each. Multiply by 30–100 layers and you reach the parameter counts that frontier LLMs are known for.