Chapter · AI

Multimodal & Generative Models

Beyond text — the diffusion models, VAEs, and architectures that generate images, video, audio, and 3D. And how the same ideas come back as vision-language models and world models.

Topics

Topic 1

Diffusion Models

Generation as iterative denoising — the dominant paradigm for images, audio, and video.

Planned

Topic 2

VAEs & GANs

The pre-diffusion generative landscape, and why these ideas still matter inside modern systems.

Planned

Topic 3

Image Generation

From DALL·E to Stable Diffusion to Flux — the families and what distinguishes them.

Planned

Topic 4

Video Generation

The temporal extension — why it's so much harder, and the architectures making it work.

Planned

Topic 5

Audio & Speech Generation

TTS, music, and the techniques that make synthesized audio sound real.

Planned

Topic 6

Vision-Language Models

Teaching a language model to see — architectures, training, and what they can and can't do.

Planned

Topic 7

World Models & Robotics

Models that learn to predict environments well enough to plan inside them.

Planned