The Transformer
The architecture that swallowed the field. Attention, tokenization, positional encoding, the KV cache — and the trick that lets the same model handle text, code, images, and audio.
Attention
The mechanism that lets every token see every other token, and the scaled dot-product math behind it.
The Transformer Architecture
Attention, MLPs, residuals, and norms — assembled into the workhorse model of the era.
Tokenization
How raw text becomes the integers a model actually consumes — BPE, vocabularies, and the failure modes.
Positional Encoding & RoPE
How a model that processes tokens in parallel knows their order.
The KV Cache
The trick that makes autoregressive generation linear instead of quadratic.
Mixture of Experts
Routing each token through a small subset of the model's parameters — and the engineering it costs.
Beyond Transformers
State-space models, Mamba, and the architectures challenging the throne.