Introduction to frame-autoregressive video models

A reading list and collection of resources for building up to frame-autoregressive video generation and world models. The progression goes: flow matching basics on 2D data, then flow matching for images with a DiT, then flow matching for video with causal attention, and finally KV caching for efficient frame-autoregressive generation.

🟢 read — essential reading · 🟡 skim — worth skimming · 🟣 bonus — for deeper exploration · 🔵 code — implementation

1. Rectified flow matching basics

Learn the fundamentals of flow matching on a simple 2D dataset (two moons): velocity prediction, training, Euler sampling, noise schedules (uniform, SD3), and classifier-free guidance.

2. Flow matching on MNIST

Build a Diffusion Transformer (DiT) from scratch and train it on MNIST with flow matching, classifier-free guidance, and the Muon optimizer.

3. Frame-autoregressive Pong

Extend the DiT to video: block-causal attention masks, per-frame conditioning via modulate/gate, action conditioning, and diffusion forcing training.

4. KV caching for frame-autoregressive inference

Add KV caching to the video DiT for efficient autoregressive generation: VideoKVCache with finalize/denoise modes, CachedVideoAttention with RoPE recomputation, and sliding-window eviction.