Introduction to frame-autoregressive video models

A reading list and collection of resources for building up to frame-autoregressive video generation and world models. The progression goes: flow matching basics on 2D data, then flow matching for images with a DiT, then flow matching for video with causal attention, and finally KV caching for efficient frame-autoregressive generation.

🟢 read — essential reading · 🟡 skim — worth skimming · 🟣 bonus — for deeper exploration · 🔵 code — implementation

1. Rectified flow matching basics

Learn the fundamentals of flow matching on a simple 2D dataset (two moons): velocity prediction, training, Euler sampling, noise schedules (uniform, SD3), and classifier-free guidance.

📓 Part 1: Rectified Flow Matching Basics
🟢 read: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis — the SD3 paper that popularized rectified flow matching
🟣 bonus: Denoising Diffusion Probabilistic Models — the original DDPM paper
🟣 bonus: Denoising Diffusion Implicit Models (DDIM) — deterministic sampling
🟣 bonus: Latent Diffusion Models — run diffusion in a learned latent space

2. Flow matching on MNIST

Build a Diffusion Transformer (DiT) from scratch and train it on MNIST with flow matching, classifier-free guidance, and the Muon optimizer.

📓 Part 2: Flow Matching on MNIST
🟢 read: Scalable Diffusion Models with Transformers (DiT) — the transformer architecture and conditioning mechanism used throughout this tutorial
🔵 code: Rectified flow matching on MNIST — minimal implementation on MNIST
🔵 code: minRF: Rectified flow matching on CIFAR — minimal implementation on CIFAR

3. Frame-autoregressive Pong

Extend the DiT to video: block-causal attention masks, per-frame conditioning via modulate/gate, action conditioning, and diffusion forcing training.

📓 Part 3: Frame-Autoregressive Pong
🟡 skim: Diffusion Forcing — bridges autoregressive and diffusion models for sequence generation
🟡 skim: WAN: Scalable Bidirectional Text-to-Video Generation — bidirectional text-to-video generation
🟣 bonus: CausVid — causal video generation (advanced)
🟣 bonus: Self-Forcing — training autoregressive video models without teacher forcing (advanced)
🔵 code: Minimal diffusion forcing implementation — the reference implementation this tutorial follows

4. KV caching for frame-autoregressive inference

Add KV caching to the video DiT for efficient autoregressive generation: VideoKVCache with finalize/denoise modes, CachedVideoAttention with RoPE recomputation, and sliding-window eviction.

📓 Part 4: KV Caching for FAR Inference
🟡 skim: Coding the KV Cache in LLMs from Scratch — Sebastian Raschka's hands-on walkthrough of KV caching
🔵 code: minGPT / nanoGPT — Karpathy's clean, minimal GPT implementations
ARENA Chapter 1.1: Transformer from Scratch — step-by-step implementation of a transformer