Chris Wendler & Claude Opus 4.6, 03/05/26
We are training a video generation model for Doom gameplay. The pipeline records PvP deathmatch episodes as MP4 video, then encodes each frame into latent space using a pretrained image autoencoder (DC-AE-Lite f32c32) so the video model can train directly on latents. While setting up the encoding pipeline across 5 compute nodes, we noticed something surprising: the encoded latent dataset is barely smaller than the original MP4 recordings. This note works out why, and under what conditions latent representations actually save storage.
Our autoencoder (DC-AE-Lite f32c32) compresses each \(3 \times 480 \times 640\) RGB frame into a \((32, 15, 20)\) float16 latent tensor. That is:
The recordings are H.264-compressed MP4 video at 35 fps, stored alongside action and reward arrays in WebDataset tar shards. The dataset contains ~2640 PvP episodes (~340 hours of video across both player perspectives).
We measured the storage cost per minute of video for both formats:
| Format | Total size | MB/min | Details |
|---|---|---|---|
| MP4 recordings | 902 GB | ~46 | H.264 video + actions + rewards |
| Latent shards | ~812 GB (projected) | ~40 | Uncompressed float16 numpy arrays |
The latent representation is only ~13% smaller than the compressed MP4. This is initially surprising — we compressed the spatial dimensions by \(32\times\) in each direction, a factor of \(1024\times\) in area. Where did the savings go?
Let \(H, W\) be the video height and width in pixels, \(F\) the frame rate in fps, and \(T\) the duration in seconds. For the autoencoder, let \(s\) be the spatial compression factor, \(c\) the number of latent channels, and \(p\) the bytes per latent value (e.g., \(p = 2\) for float16). For MP4, let \(b\) be the effective bits per pixel, an empirical quantity that captures how efficiently H.264 compresses the content. Finally, let \(t\) denote an optional temporal compression factor (\(t = 1\) means no temporal compression).
The storage cost for uncompressed latents is exact and deterministic:
\[S_{\text{latent}} = c \cdot \frac{H}{s} \cdot \frac{W}{s} \cdot p \cdot F \cdot T\]
For MP4:
\[S_{\text{mp4}} = \frac{b \cdot H \cdot W \cdot F \cdot T}{8}\]
Their ratio is:
\[R = \frac{S_{\text{latent}}}{S_{\text{mp4}}} = \frac{c \cdot p \cdot 8}{b \cdot s^2}\]
Note that \(F\), \(T\), \(H\), and \(W\) all cancel. The storage ratio depends only on the VAE design parameters \((c, s, p)\) and the MP4 compression quality \(b\).
From our measured data, \(b \approx 0.57\) bits/pixel for Doom gameplay at 480p. Plugging in:
\[R = \frac{32 \cdot 2 \cdot 8}{0.57 \cdot 32^2} = \frac{512}{583.7} \approx 0.88\]
This matches our empirical ratio of \(40.3 / 46.0 = 0.88\).
The autoencoder compresses spatial area by \(s^2 = 1024\times\), but expands the channel dimension from 3 to \(c = 32\) (a \(10.7\times\) increase) and the MP4 was already heavily compressed. The net "raw" compression from the autoencoder is:
\[\text{AE compression} = \frac{3 \cdot H \cdot W \cdot 1\text{ (uint8)}}{c \cdot \frac{H}{s} \cdot \frac{W}{s} \cdot p} = \frac{3 \cdot s^2}{c \cdot p} = \frac{3 \cdot 1024}{32 \cdot 2} = 48\times\]
So relative to raw uncompressed video (3 bytes/pixel), the autoencoder achieves \(48\times\) compression. But H.264 achieves roughly:
\[\text{H.264 compression} = \frac{3 \cdot 8}{b} = \frac{24}{0.57} \approx 42\times\]
The autoencoder and H.264 achieve nearly identical compression ratios through completely different mechanisms. H.264 exploits spatial and temporal redundancy with entropy coding. The autoencoder learns a spatial-only nonlinear transform. The result: a near-tie at 480p.
With a 3D VAE that also compresses along the time axis by factor \(t\):
\[R_{\text{temporal}} = \frac{c \cdot p \cdot 8}{b \cdot s^2 \cdot t}\]
This is the most significant lever. For our setup, even \(t = 2\) would bring the ratio from 0.88 to 0.44 — cutting latent storage in half.
How does our DC-AE compare to Cosmos CV8x8x8, a 3D video tokenizer with temporal compression? We compute the storage ratio \(R\) and per-frame latent size for both at 480p (\(b = 0.57\)):
| VAE | \(s\) | \(c\) | \(t\) | \(p\) | Latent shape (per frame) | Compression vs raw | \(R\) vs MP4 | MB/min (480p, 35fps) |
|---|---|---|---|---|---|---|---|---|
| DC-AE f32c32 (ours) | 32 | 32 | 1 | 2 | \(32 \times 15 \times 20\) | \(48\times\) | 0.88 | 40.3 |
| DC-AE f32c32 + fp8 | 32 | 32 | 1 | 1 | \(32 \times 15 \times 20\) | \(96\times\) | 0.44 | 20.2 |
| Cosmos CV8x8x8 | 8 | 16 | 8 | 2 | \(16 \times 60 \times 80\) | \(96\times\) | 0.44 | 20.2 |
| Cosmos CV8x8x8 + fp8 | 8 | 16 | 8 | 1 | \(16 \times 60 \times 80\) | \(192\times\) | 0.22 | 10.1 |
Cosmos and DC-AE+fp8 land at the same overall compression ratio (\(96\times\) vs raw, \(R = 0.44\)) despite very different designs. DC-AE compresses space aggressively (\(32\times\)) but keeps all frames; Cosmos compresses space modestly (\(8\times\)) but removes 7 out of 8 temporal frames. The result is the same bytes on disk, but with very different latent geometries: DC-AE gives tiny spatial grids (\(15 \times 20\)) at full temporal resolution, while Cosmos gives larger spatial grids (\(60 \times 80\)) at \(\frac{1}{8}\) temporal resolution.
For a video diffusion model, the Cosmos-style tradeoff is generally preferable: the \(8\times\) shorter sequence length dramatically reduces the quadratic cost of attention, which more than compensates for the larger per-frame spatial dimensions.
While the ratio \(R\) appears resolution-independent in the formula, this is misleading. In practice, H.264's effective bits-per-pixel \(b\) decreases at higher resolutions because there is more spatial redundancy to exploit (neighboring pixels in a Doom corridor are very similar). Meanwhile, the latent size scales exactly as \((H/s) \times (W/s)\) with no such benefit.
We estimate \(b\) for Doom gameplay content at various resolutions (480p is measured, others estimated from typical H.264 behavior on game content):
| Resolution | H×W | Latent dims | \(b\) (bpp) | MP4 MB/min | Latent MB/min | \(R\) | \(t{=}4\) MB/min | \(R_{t=4}\) | \(t{=}8\) MB/min | \(R_{t=8}\) | \(t{=}8\)+fp8 MB/min | \(R_{t=8,\text{fp8}}\) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 240p | 240×426 | 8×14 | 0.80 | 21.5 | 15.1 | 0.70 | 3.8 | 0.18 | 1.9 | 0.09 | 0.9 | 0.04 |
| 360p | 360×640 | 12×20 | 0.68 | 41.1 | 32.3 | 0.78 | 8.1 | 0.20 | 4.0 | 0.10 | 2.0 | 0.05 |
| 480p | 480×640 | 15×20 | 0.57 | 46.0 | 40.3 | 0.88 | 10.1 | 0.22 | 5.0 | 0.11 | 2.5 | 0.05 |
| 720p | 720×1280 | 23×40 | 0.40 | 96.8 | 123.6 | 1.28 | 30.9 | 0.32 | 15.5 | 0.16 | 7.7 | 0.08 |
| 1080p | 1080×1920 | 34×60 | 0.30 | 163.3 | 274.2 | 1.68 | 68.5 | 0.42 | 34.3 | 0.21 | 17.1 | 0.10 |
| 1440p | 1440×2560 | 45×80 | 0.22 | 212.9 | 483.8 | 2.27 | 121.0 | 0.57 | 60.5 | 0.28 | 30.2 | 0.14 |
| 4K | 2160×3840 | 68×120 | 0.17 | 370.1 | 1096.7 | 2.96 | 274.2 | 0.74 | 137.1 | 0.37 | 68.5 | 0.19 |
At 720p and above, raw float16 latents are larger than the MP4 recordings. At 4K, latents would be nearly \(3\times\) the size of MP4. The crossover happens because H.264 gets relatively more efficient at higher resolution while latent size scales proportionally.
Setting \(R_{\text{temporal}} < 1\) and solving for \(t\):
\[t > \frac{c \cdot p \cdot 8}{b \cdot s^2}\]
| Resolution | Minimum temporal compression to beat MP4 |
|---|---|
| 240p – 480p | Already smaller (\(R < 1\)) |
| 720p | \(t \geq 1.3\times\) |
| 1080p | \(t \geq 1.7\times\) |
| 1440p | \(t \geq 2.3\times\) |
| 4K | \(t \geq 3.0\times\) |
Note that these thresholds are very modest — real 3D VAEs (e.g., Cosmos CV8x8x8) typically use \(t = 4\) or \(t = 8\). Any temporal compression beyond these thresholds is pure savings.
Our full PvP dataset at various hypothetical resolutions (~340 hours of video):
| Resolution | MP4 | Latent (fp16) | Latent (\(t{=}8\), fp8) |
|---|---|---|---|
| 240p | 0.44 TB | 0.31 TB | 19 GB |
| 480p | 0.94 TB | 0.82 TB | 51 GB |
| 720p | 1.97 TB | 2.52 TB | 158 GB |
| 1080p | 3.33 TB | 5.59 TB | 350 GB |
| 4K | 7.55 TB | 22.37 TB | 1.40 TB |
The best-case scenario (temporal compression \(t=8\) + fp8 precision) would reduce our current 480p dataset from 820 GB to 51 GB — small enough to fit entirely in RAM during training.
We benchmarked several post-hoc compression strategies on our existing latent shards:
| Method | MB/min | % of raw | Lossless? |
|---|---|---|---|
| Raw float16 (baseline) | 40.3 | 100% | yes |
| zstd | 37.1 | 92.1% | yes |
| Delta encoding + zstd | 36.6 | 90.8% | ~yes |
| Delta + int8 + zstd | 12.7 | 31.5% | no (MSE ~100k) |
Lossless compression yields only ~9% savings. The frame-to-frame delta values are still high-entropy float16 numbers that generic compressors struggle with. Lossy int8 quantization achieves large savings (69%) but introduces unacceptable reconstruction error because the quantization must cover a wide dynamic range of delta values.
Per-channel int8 quantization of deltas (clipping outliers at \(n\sigma\)) improves things somewhat but still accumulates drift over thousands of frames through the cumulative sum reconstruction:
| Clip threshold | % of raw | MSE | Max error |
|---|---|---|---|
| \(3\sigma\) | 45.1% | 102.2 | 227.7 |
| \(4\sigma\) | 42.8% | 35.7 | 232.2 |
| \(6\sigma\) | 39.2% | 18.1 | 216.1 |
| \(8\sigma\) | 36.5% | 13.8 | 94.4 |
A survey of the literature confirms that nobody in practice compresses pre-encoded latents further. Projects like Open-Sora, Cosmos, and Stable Video Diffusion all store raw latent tensors. The consensus approach is to eat the disk cost.
The fundamental issue is that a spatial-only image autoencoder and H.264 are solving overlapping problems. Both reduce spatial redundancy — the autoencoder through a learned nonlinear transform, H.264 through block-based DCT and motion compensation. They arrive at remarkably similar compression ratios (~42–48× vs raw) through completely different mechanisms.
The autoencoder's real value is not storage efficiency but representational: the latent space is structured for generation, not just compression. Training a diffusion model on \(15 \times 20 \times 32\) latent frames is vastly cheaper than on \(480 \times 640 \times 3\) pixel frames, regardless of how they're stored on disk.
For future work, the most impactful change would be switching to a 3D VAE with temporal compression (e.g., Cosmos-style \(t = 8\)). This would:
The storage savings are a side benefit — the real win is the \(8\times\) shorter sequences for the video model.
However, full temporal compression introduces a problem for interactive applications like game world models: during inference, we want to generate and decode one frame at a time (e.g., to render the next frame given a player action), but a 3D VAE with \(t = 8\) can only decode in chunks of 8 frames. This latency is unacceptable for real-time interaction.
A practical middle ground would be to use temporal compression for storage but train an additional single-frame decoder that can map each temporally compressed latent frame directly to pixels. The key idea: we don't undo the temporal compression at inference. Instead, we teach a decoder to work with the compressed representation natively.
Concretely, the architecture has two encoding paths and two decoding paths:
During dataset creation, we use the full encoder to produce temporally compressed latents for storage, getting the full \(s^2 \cdot t\) compression benefit. The world model is trained on sequences of these compressed latent frames.
During inference, the world model outputs one compressed latent frame per step, and the single-frame decoder maps it directly to pixels. There is no need to buffer \(t\) frames or run the temporal decompressor. This gives us:
The single-frame decoder would be trained alongside the full autoencoder, with a reconstruction loss that encourages each compressed latent frame to contain enough information to independently decode to a plausible pixel frame. Since each compressed latent already summarizes \(t\) frames of context, this is not as lossy as it might seem — the compressed representation is richer per frame than uncompressed per-frame latents (more temporal context is baked in). The decoded frame would correspond to a representative frame from the \(t\)-frame window (e.g., the middle or last frame).