How do AI models generate videos?

Summary

The article explains the current technical stack behind AI video generation, centred on latent diffusion transformers. It breaks down diffusion models (which learn to reverse noise to create images), latent diffusion (working in compressed “latent” space to cut compute), and the role of transformers to maintain temporal consistency across frames. Recent advances—like Veo 3—also generate synced audio and video by compressing both modalities together. The piece highlights practical results, energy costs, and issues around training data bias.

Key Points

Modern video generators use latent diffusion transformers: diffusion models + transformers working in a compressed latent space.
Diffusion models learn to reverse noise; paired with a text-guiding model they turn random static into images matching a prompt.
Latent diffusion compresses frames into a smaller representation, making video generation far more compute-efficient than raw-pixel diffusion.
Transformers provide temporal consistency by treating spatio-temporal chunks of video as sequences, preventing objects from popping between frames.
Veo 3 and similar models produce audio and video together by compressing them into a single diffusion-ready representation, enabling synced soundtracks.
Training depends on huge scraped datasets, which introduces bias and content issues; and video generation still consumes substantially more energy than image or text generation.

Content summary

The article walks a non-specialist reader through how diffusion models work: they take an image (or frame) progressively corrupted with noise during training, then learn to reverse that corruption. For videos, diffusion must operate over sequences of frames. To reduce the astronomical compute cost, models usually operate in latent space where frames are compressed into codes; after generation, those codes are decoded back into watchable video. Transformers are then used to model long-range dependencies across those codes so motion, lighting and objects remain coherent across frames. A recent milestone is generating audio together with video, achieved by joint compression so audio and images are produced in lockstep. The piece also flags ecological and ethical issues—high energy use and biased training data.

Context and relevance

This explainer sits at the intersection of generative-AI progress and practical concerns: creative possibilities (apps now let casual users make videos), technical innovation (latent diffusion + transformers), and societal worries (deepfakes, dataset provenance, and energy footprint). It helps readers understand why demo clips look so good sometimes, why outputs can still be hit-or-miss, and why generating video costs so much more than generating text or images.

Why should I read this?

Quick answer: if you watch, make or worry about AI video (deepfakes, ads, social media), this gives you the nuts-and-bolts without drowning in jargon. It tells you what’s changed recently, why results are better, why models still trip up, and why the carbon bill matters — all in plain terms. Saves you the time of slogging through papers.

Author style

Punchy: clear, to-the-point explanations that cut through hype. The article is useful for anyone who wants to grasp the tech behind the clips you see online and why the field is moving so quickly — and why it still has major downsides.

Source

Source: https://www.technologyreview.com/2025/09/12/1123562/how-do-ai-models-generate-videos/