Chapter · AI

The Transformer

The architecture that swallowed the field. Attention, tokenization, positional encoding, the KV cache — and the trick that lets the same model handle text, code, images, and audio.

Topics

Topic 1

Attention

The mechanism that lets every token see every other token, and the scaled dot-product math behind it.

Planned

Topic 2

The Transformer Architecture

Attention, MLPs, residuals, and norms — assembled into the workhorse model of the era.

Planned

Topic 3

Tokenization

How raw text becomes the integers a model actually consumes — BPE, vocabularies, and the failure modes.

Planned

Topic 4

Positional Encoding & RoPE

How a model that processes tokens in parallel knows their order.

Planned

Topic 5

The KV Cache

The trick that makes autoregressive generation linear instead of quadratic.

Planned

Topic 6

Mixture of Experts

Routing each token through a small subset of the model's parameters — and the engineering it costs.

Planned

Topic 7

Beyond Transformers

State-space models, Mamba, and the architectures challenging the throne.

Planned