On the storage cost of latent video representations

Chris Wendler & Claude Opus 4.6, 03/05/26

We are training a video generation model for Doom gameplay. The pipeline records PvP deathmatch episodes as MP4 video, then encodes each frame into latent space using a pretrained image autoencoder (DC-AE-Lite f32c32) so the video model can train directly on latents. While setting up the encoding pipeline across 5 compute nodes, we noticed something surprising: the encoded latent dataset is barely smaller than the original MP4 recordings. This note works out why, and under what conditions latent representations actually save storage.

Setup

Our autoencoder (DC-AE-Lite f32c32) compresses each \(3 \times 480 \times 640\) RGB frame into a \((32, 15, 20)\) float16 latent tensor. That is:

Spatial compression: \(s = 32\times\), reducing \(480 \times 640\) to \(15 \times 20\)
Channel expansion: 3 RGB channels → \(c = 32\) latent channels
Precision: \(p = 2\) bytes (float16)
Temporal compression: none (\(t = 1\)), each frame is encoded independently

The recordings are H.264-compressed MP4 video at 35 fps, stored alongside action and reward arrays in WebDataset tar shards. The dataset contains ~2640 PvP episodes (~340 hours of video across both player perspectives).

Empirical measurement

We measured the storage cost per minute of video for both formats:

Format	Total size	MB/min	Details
MP4 recordings	902 GB	~46	H.264 video + actions + rewards
Latent shards	~812 GB (projected)	~40	Uncompressed float16 numpy arrays

The latent representation is only ~13% smaller than the compressed MP4. This is initially surprising — we compressed the spatial dimensions by \(32\times\) in each direction, a factor of \(1024\times\) in area. Where did the savings go?

Storage cost equations

Let \(H, W\) be the video height and width in pixels, \(F\) the frame rate in fps, and \(T\) the duration in seconds. For the autoencoder, let \(s\) be the spatial compression factor, \(c\) the number of latent channels, and \(p\) the bytes per latent value (e.g., \(p = 2\) for float16). For MP4, let \(b\) be the effective bits per pixel, an empirical quantity that captures how efficiently H.264 compresses the content. Finally, let \(t\) denote an optional temporal compression factor (\(t = 1\) means no temporal compression).

The storage cost for uncompressed latents is exact and deterministic:

\[S_{\text{latent}} = c \cdot \frac{H}{s} \cdot \frac{W}{s} \cdot p \cdot F \cdot T\]

For MP4:

\[S_{\text{mp4}} = \frac{b \cdot H \cdot W \cdot F \cdot T}{8}\]

Their ratio is:

\[R = \frac{S_{\text{latent}}}{S_{\text{mp4}}} = \frac{c \cdot p \cdot 8}{b \cdot s^2}\]

Note that \(F\), \(T\), \(H\), and \(W\) all cancel. The storage ratio depends only on the VAE design parameters \((c, s, p)\) and the MP4 compression quality \(b\).

From our measured data, \(b \approx 0.57\) bits/pixel for Doom gameplay at 480p. Plugging in:

\[R = \frac{32 \cdot 2 \cdot 8}{0.57 \cdot 32^2} = \frac{512}{583.7} \approx 0.88\]

This matches our empirical ratio of \(40.3 / 46.0 = 0.88\).

Why are the savings so small?

The autoencoder compresses spatial area by \(s^2 = 1024\times\), but expands the channel dimension from 3 to \(c = 32\) (a \(10.7\times\) increase) and the MP4 was already heavily compressed. The net "raw" compression from the autoencoder is:

\[\text{AE compression} = \frac{3 \cdot H \cdot W \cdot 1\text{ (uint8)}}{c \cdot \frac{H}{s} \cdot \frac{W}{s} \cdot p} = \frac{3 \cdot s^2}{c \cdot p} = \frac{3 \cdot 1024}{32 \cdot 2} = 48\times\]

So relative to raw uncompressed video (3 bytes/pixel), the autoencoder achieves \(48\times\) compression. But H.264 achieves roughly:

\[\text{H.264 compression} = \frac{3 \cdot 8}{b} = \frac{24}{0.57} \approx 42\times\]

The autoencoder and H.264 achieve nearly identical compression ratios through completely different mechanisms. H.264 exploits spatial and temporal redundancy with entropy coding. The autoencoder learns a spatial-only nonlinear transform. The result: a near-tie at 480p.

Adding temporal compression

With a 3D VAE that also compresses along the time axis by factor \(t\):

\[R_{\text{temporal}} = \frac{c \cdot p \cdot 8}{b \cdot s^2 \cdot t}\]

This is the most significant lever. For our setup, even \(t = 2\) would bring the ratio from 0.88 to 0.44 — cutting latent storage in half.

Comparing VAE architectures

How does our DC-AE compare to Cosmos CV8x8x8, a 3D video tokenizer with temporal compression? We compute the storage ratio \(R\) and per-frame latent size for both at 480p (\(b = 0.57\)):

VAE	\(s\)	\(c\)	\(t\)	\(p\)	Latent shape (per frame)	Compression vs raw	\(R\) vs MP4	MB/min (480p, 35fps)
DC-AE f32c32 (ours)	32	32	1	2	\(32 \times 15 \times 20\)	\(48\times\)	0.88	40.3
DC-AE f32c32 + fp8	32	32	1	1	\(32 \times 15 \times 20\)	\(96\times\)	0.44	20.2
Cosmos CV8x8x8	8	16	8	2	\(16 \times 60 \times 80\)	\(96\times\)	0.44	20.2
Cosmos CV8x8x8 + fp8	8	16	8	1	\(16 \times 60 \times 80\)	\(192\times\)	0.22	10.1

Cosmos and DC-AE+fp8 land at the same overall compression ratio (\(96\times\) vs raw, \(R = 0.44\)) despite very different designs. DC-AE compresses space aggressively (\(32\times\)) but keeps all frames; Cosmos compresses space modestly (\(8\times\)) but removes 7 out of 8 temporal frames. The result is the same bytes on disk, but with very different latent geometries: DC-AE gives tiny spatial grids (\(15 \times 20\)) at full temporal resolution, while Cosmos gives larger spatial grids (\(60 \times 80\)) at \(\frac{1}{8}\) temporal resolution.

For a video diffusion model, the Cosmos-style tradeoff is generally preferable: the \(8\times\) shorter sequence length dramatically reduces the quadratic cost of attention, which more than compensates for the larger per-frame spatial dimensions.

Resolution dependence

While the ratio \(R\) appears resolution-independent in the formula, this is misleading. In practice, H.264's effective bits-per-pixel \(b\) decreases at higher resolutions because there is more spatial redundancy to exploit (neighboring pixels in a Doom corridor are very similar). Meanwhile, the latent size scales exactly as \((H/s) \times (W/s)\) with no such benefit.

We estimate \(b\) for Doom gameplay content at various resolutions (480p is measured, others estimated from typical H.264 behavior on game content):

Resolution	H×W	Latent dims	\(b\) (bpp)	MP4 MB/min	Latent MB/min	\(R\)	\(t{=}4\) MB/min	\(R_{t=4}\)	\(t{=}8\) MB/min	\(R_{t=8}\)	\(t{=}8\)+fp8 MB/min	\(R_{t=8,\text{fp8}}\)
240p	240×426	8×14	0.80	21.5	15.1	0.70	3.8	0.18	1.9	0.09	0.9	0.04
360p	360×640	12×20	0.68	41.1	32.3	0.78	8.1	0.20	4.0	0.10	2.0	0.05
480p	480×640	15×20	0.57	46.0	40.3	0.88	10.1	0.22	5.0	0.11	2.5	0.05
720p	720×1280	23×40	0.40	96.8	123.6	1.28	30.9	0.32	15.5	0.16	7.7	0.08
1080p	1080×1920	34×60	0.30	163.3	274.2	1.68	68.5	0.42	34.3	0.21	17.1	0.10
1440p	1440×2560	45×80	0.22	212.9	483.8	2.27	121.0	0.57	60.5	0.28	30.2	0.14
4K	2160×3840	68×120	0.17	370.1	1096.7	2.96	274.2	0.74	137.1	0.37	68.5	0.19

At 720p and above, raw float16 latents are larger than the MP4 recordings. At 4K, latents would be nearly \(3\times\) the size of MP4. The crossover happens because H.264 gets relatively more efficient at higher resolution while latent size scales proportionally.

When do latents beat MP4?

Setting \(R_{\text{temporal}} < 1\) and solving for \(t\):

\[t > \frac{c \cdot p \cdot 8}{b \cdot s^2}\]

Resolution	Minimum temporal compression to beat MP4
240p – 480p	Already smaller (\(R < 1\))
720p	\(t \geq 1.3\times\)
1080p	\(t \geq 1.7\times\)
1440p	\(t \geq 2.3\times\)
4K	\(t \geq 3.0\times\)

Note that these thresholds are very modest — real 3D VAEs (e.g., Cosmos CV8x8x8) typically use \(t = 4\) or \(t = 8\). Any temporal compression beyond these thresholds is pure savings.

Projected dataset sizes

Our full PvP dataset at various hypothetical resolutions (~340 hours of video):

Resolution	MP4	Latent (fp16)	Latent (\(t{=}8\), fp8)
240p	0.44 TB	0.31 TB	19 GB
480p	0.94 TB	0.82 TB	51 GB
720p	1.97 TB	2.52 TB	158 GB
1080p	3.33 TB	5.59 TB	350 GB
4K	7.55 TB	22.37 TB	1.40 TB

The best-case scenario (temporal compression \(t=8\) + fp8 precision) would reduce our current 480p dataset from 820 GB to 51 GB — small enough to fit entirely in RAM during training.

Can we compress latents after the fact?

We benchmarked several post-hoc compression strategies on our existing latent shards:

Method	MB/min	% of raw	Lossless?
Raw float16 (baseline)	40.3	100%	yes
zstd	37.1	92.1%	yes
Delta encoding + zstd	36.6	90.8%	~yes
Delta + int8 + zstd	12.7	31.5%	no (MSE ~100k)

Lossless compression yields only ~9% savings. The frame-to-frame delta values are still high-entropy float16 numbers that generic compressors struggle with. Lossy int8 quantization achieves large savings (69%) but introduces unacceptable reconstruction error because the quantization must cover a wide dynamic range of delta values.

Per-channel int8 quantization of deltas (clipping outliers at \(n\sigma\)) improves things somewhat but still accumulates drift over thousands of frames through the cumulative sum reconstruction:

Clip threshold	% of raw	MSE	Max error
\(3\sigma\)	45.1%	102.2	227.7
\(4\sigma\)	42.8%	35.7	232.2
\(6\sigma\)	39.2%	18.1	216.1
\(8\sigma\)	36.5%	13.8	94.4

A survey of the literature confirms that nobody in practice compresses pre-encoded latents further. Projects like Open-Sora, Cosmos, and Stable Video Diffusion all store raw latent tensors. The consensus approach is to eat the disk cost.

Discussion

The fundamental issue is that a spatial-only image autoencoder and H.264 are solving overlapping problems. Both reduce spatial redundancy — the autoencoder through a learned nonlinear transform, H.264 through block-based DCT and motion compensation. They arrive at remarkably similar compression ratios (~42–48× vs raw) through completely different mechanisms.

The autoencoder's real value is not storage efficiency but representational: the latent space is structured for generation, not just compression. Training a diffusion model on \(15 \times 20 \times 32\) latent frames is vastly cheaper than on \(480 \times 640 \times 3\) pixel frames, regardless of how they're stored on disk.

For future work, the most impactful change would be switching to a 3D VAE with temporal compression (e.g., Cosmos-style \(t = 8\)). This would:

Reduce dataset storage by \(8\times\) (820 GB → ~100 GB)
Reduce the sequence length the video model needs to process by \(8\times\)
Make latent storage decisively smaller than MP4 even at high resolutions

The storage savings are a side benefit — the real win is the \(8\times\) shorter sequences for the video model.

However, full temporal compression introduces a problem for interactive applications like game world models: during inference, we want to generate and decode one frame at a time (e.g., to render the next frame given a player action), but a 3D VAE with \(t = 8\) can only decode in chunks of 8 frames. This latency is unacceptable for real-time interaction.

A dual-decoder compromise

A practical middle ground would be to use temporal compression for storage but train an additional single-frame decoder that can map each temporally compressed latent frame directly to pixels. The key idea: we don't undo the temporal compression at inference. Instead, we teach a decoder to work with the compressed representation natively.

Concretely, the architecture has two encoding paths and two decoding paths:

Full encoder (for dataset creation): 2D spatial encoder (\(3 \times H \times W \rightarrow c \times \frac{H}{s} \times \frac{W}{s}\)) followed by a learned temporal compressor that folds \(t\) consecutive latent frames into one, yielding the compact representation stored on disk.
Full decoder (for training/evaluation): temporal decompressor that expands back to \(t\) per-frame latents, then 2D spatial decoder to pixels. Used during autoencoder training to ensure reconstruction quality.
Single-frame decoder (for inference): a learned module that maps one temporally compressed latent frame directly to one pixel frame, \(\hat{c} \times \frac{H}{s'} \times \frac{W}{s'} \rightarrow 3 \times H \times W\), without expanding the temporal dimension. This is the only decoder needed at inference time.

During dataset creation, we use the full encoder to produce temporally compressed latents for storage, getting the full \(s^2 \cdot t\) compression benefit. The world model is trained on sequences of these compressed latent frames.

During inference, the world model outputs one compressed latent frame per step, and the single-frame decoder maps it directly to pixels. There is no need to buffer \(t\) frames or run the temporal decompressor. This gives us:

Compact storage and short training sequences (from temporal compression)
Frame-by-frame decoding at inference with no temporal buffering
The world model operates entirely in the compressed latent space — it never sees per-frame latents at all

The single-frame decoder would be trained alongside the full autoencoder, with a reconstruction loss that encourages each compressed latent frame to contain enough information to independently decode to a plausible pixel frame. Since each compressed latent already summarizes \(t\) frames of context, this is not as lossy as it might seem — the compressed representation is richer per frame than uncompressed per-frame latents (more temporal context is baked in). The decoded frame would correspond to a representative frame from the \(t\)-frame window (e.g., the middle or last frame).