Multimodal & Generative Models
Beyond text — the diffusion models, VAEs, and architectures that generate images, video, audio, and 3D. And how the same ideas come back as vision-language models and world models.
Diffusion Models
Generation as iterative denoising — the dominant paradigm for images, audio, and video.
VAEs & GANs
The pre-diffusion generative landscape, and why these ideas still matter inside modern systems.
Image Generation
From DALL·E to Stable Diffusion to Flux — the families and what distinguishes them.
Video Generation
The temporal extension — why it's so much harder, and the architectures making it work.
Audio & Speech Generation
TTS, music, and the techniques that make synthesized audio sound real.
Vision-Language Models
Teaching a language model to see — architectures, training, and what they can and can't do.
World Models & Robotics
Models that learn to predict environments well enough to plan inside them.