A reading list and collection of resources for building up to frame-autoregressive video generation and world models. The progression goes: flow matching basics on 2D data, then flow matching for images with a DiT, then flow matching for video with causal attention, and finally KV caching for efficient frame-autoregressive generation.
🟢 read — essential reading · 🟡 skim — worth skimming · 🟣 bonus — for deeper exploration · 🔵 code — implementation
Learn the fundamentals of flow matching on a simple 2D dataset (two moons): velocity prediction, training, Euler sampling, noise schedules (uniform, SD3), and classifier-free guidance.
Build a Diffusion Transformer (DiT) from scratch and train it on MNIST with flow matching, classifier-free guidance, and the Muon optimizer.
Extend the DiT to video: block-causal attention masks, per-frame conditioning via modulate/gate, action conditioning, and diffusion forcing training.
Add KV caching to the video DiT for efficient autoregressive generation: VideoKVCache with finalize/denoise modes, CachedVideoAttention with RoPE recomputation, and sliding-window eviction.