Introduction to frame-autoregressive video models

A reading list and collection of resources for building up to frame-autoregressive video generation and world models. The progression goes: autoregressive transformers, then flow matching for images, then flow matching for video, and finally combining both with KV caching for efficient frame-autoregressive generation.

read essential reading   skim worth skimming   bonus for deeper exploration   code implementation

1. Decoder-only transformer + KV caching

Build a decoder-only transformer from scratch and understand KV caching for efficient autoregressive generation.

2. MNIST & class-conditional MNIST using flow matching

Learn diffusion and flow matching by generating images on MNIST. The references below trace the evolution from the original diffusion formulation to modern rectified flow matching.

3. Pong using flow matching

Move from images to video: generate pong games using flow-matching-based video models.

4. KV caching for frame-autoregressive transformers

Combine KV caching (from step 1) with flow-matching video generation (from step 3) for efficient frame-autoregressive world models.