A reading list and collection of resources for building up to frame-autoregressive video generation and world models. The progression goes: autoregressive transformers, then flow matching for images, then flow matching for video, and finally combining both with KV caching for efficient frame-autoregressive generation.
read essential reading skim worth skimming bonus for deeper exploration code implementation
Build a decoder-only transformer from scratch and understand KV caching for efficient autoregressive generation.
Learn diffusion and flow matching by generating images on MNIST. The references below trace the evolution from the original diffusion formulation to modern rectified flow matching.
Move from images to video: generate pong games using flow-matching-based video models.
Combine KV caching (from step 1) with flow-matching video generation (from step 3) for efficient frame-autoregressive world models.