Yale Song1, Yiwen Song1, Nick Losier1, Nathan Hodson1, Ye Jin1, Rhyard Zhu1,
Yan Xu1, Daniel Vlasic1, Carina Claassen1, Jasmine Leon1, Khanh G. LeViet1,
Zack Chomyn1, Joe Timmons1, Brett Slatkin1, Scott Penberthy1, and Tomas Pfister1
Co-Director generates long-form video in a variety of cinematic styles by exploring diverse creative directions while maintaining rigorous semantic coherence.
Note: All variations share the exact same input prompt. Co-Director autonomously "re-shoots" the narrative by navigating the different creative configurations shown above.
Most generative video tools are great at creating a single "cool shot," but they struggle to tell a consistent story. Small errors in the script often cascade into "identity drift" or broken logic by the final frame. At Google, we wanted to move beyond the linear "waterfall" approach to video generation. We are excited to share Co-Director, a multi-agent framework that formalizes video storytelling as a global optimization problem. Instead of just stringing clips together, it functions like a professional film crew to ensure visual and narrative coherence from start to finish. What’s happening under the hood? The Orchestrator: Uses "Multi-Armed Bandit" logic to steer the creative direction. It balances exploring bold new narrative ideas with "exploiting" configurations that it knows will land with the target audience. Hierarchical Factoring: We don't just give the AI a generic prompt. We disentangle the "vibe" into three axes: Creative Strategy (the message), Narrative Mode (the delivery), and Aesthetic Archetype (the visual look). Local Self-Refinement: A multimodal feedback loop acts as a built-in auditor, catching and correcting inconsistencies—like a character’s hair color changing or a product teleporting between scenes—before the final render. To test this, we developed GenAD-Bench: a rigorous new benchmark featuring 400 scenarios across 200 fictional products. By mastering the strict constraints of advertising—brand fidelity, demographics, and tight runtimes—Co-Director proves it can handle almost any cinematic storytelling task.
While diffusion models generate high-fidelity video clips, transforming them into coherent storytelling engines remains challenging. Current agentic pipelines automate this via chained modules but suffer from semantic drift and cascading failures due to independent, handcrafted prompting. We present Co-Director, a hierarchical multi-agent framework formalizing video storytelling as a global optimization problem. To ensure semantic coherence, we introduce hierarchical parameterization: a multi-armed bandit globally identifies promising creative directions, while a local multimodal self-refinement loop mitigates identity drift and ensures sequence-level consistency. This balances the exploration of novel narrative strategies with the exploitation of effective creative configurations. For evaluation, we introduce GenAD-Bench, a 400-scenario dataset of fictional products for personalized advertising. Experiments demonstrate that Co-Director significantly outperforms state-of-the-art baselines, offering a principled approach that seamlessly generalizes to broader cinematic narratives.
Figure 1: Co-Director Multi-Agent Pipeline Overview (Hover over components to see details)
We demonstrate Co-Director on digital advertising because it represents a rigorous "creative stress test" for generative AI. While abstract or experimental cinema often allows for loose visual interpretation, professional advertising requires high-fidelity creativity bound by absolute precision: brand identities must remain inviolable, product features must be visually consistent across disparate scenes, and the artistic direction must align with specific demographic expectations—all within a compressed narrative window.
Note: Videos are played at 2x speed; hover over a video to play at 1x speed with audio.
BarBaz-Scrub
FooTrent-Case
Wandom-Yarn
FredBaz-Jam
FredBaz-Mix
Grault-Gown
EveFoo-Scoot
MaloyBaz-Mask
Xyzzy-Bear
GenAD-Bench is a rigorous benchmark designed to evaluate end-to-end generative workflows. It features 400 unique scenarios across 200 fictional products and 50 brands. By utilizing fictional entities, the benchmark prevents models from defaulting to memorized training priors, ensuring performance reflects true reasoning.
We evaluate Co-Director against monolithic models and prior agentic frameworks (AniMaker, MovieAgent) using a multi-dimensional MLLM-as-a-Judge suite. Our MAB formulation ensures efficient convergence toward optimal configurations.
This table compares how Co-Director performs against industry-leading models using automated measurements. It shows that our method consistently achieves higher scores in preserving brand identities and creating videos that truly resonate with specific target audiences.
| Method | VAF ↑ | DA ↑ | MA ↑ | VQ ↑ | Avg. ↑ |
|---|---|---|---|---|---|
| Proprietary Models | |||||
| Creatify | 23.2 | 16.2 | 19.5 | 29.5 | 22.1 |
| HeyGen | 42.9 | 59.3 | 39.5 | 45.0 | 46.7 |
| Kling 3.0 Omni | 62.0 | 70.3 | 56.0 | 45.3 | 58.4 |
| Veo 3.1 | 60.0 | 80.8 | 63.2 | 50.5 | 63.6 |
| Wan 2.6 | 67.0 | 71.5 | 62.5 | 58.9 | 65.0 |
| Open-Source Models | |||||
| LTX-2.3 | 23.5 | 56.0 | 30.6 | 24.4 | 33.6 |
| AniMaker | 53.1 | 81.3 | 60.3 | 53.9 | 62.2 |
| MovieAgent | 61.2 | 81.3 | 66.4 | 52.4 | 65.3 |
| Base Agentic Pipeline (T=1) | 68.5 | 78.4 | 67.1 | 59.9 | 68.5 |
| Random Search Baseline (T=4) | 77.0 | 85.8 | 75.6 | 64.1 | 75.7 |
| Co-Director (T=4) | 82.1 | 91.4 | 82.0 | 70.2 | 81.4 |
All metrics are scaled [0, 100]. VAF: Visual Asset Fidelity, DA: Demographic Alignment, MA: Marketing Appeal, VQ: Visual Quality.
To ensure the results actually look good to people, we asked human judges to rate the videos on a scale of 1 to 5. These scores confirm that humans prefer our results over the baselines.
| Method | VAF | DA | MA | VQ | Avg. |
|---|---|---|---|---|---|
| AniMaker | 3.34 | 3.71 | 2.72 | 2.50 | 3.07 |
| MovieAgent | 3.64 | 4.06 | 2.85 | 2.31 | 3.22 |
| Wan 2.6 | 3.84 | 3.65 | 3.13 | 3.29 | 3.48 |
| Kling 3.0 Omni | 3.93 | 4.13 | 3.07 | 3.21 | 3.59 |
| Veo 3.1 | 3.90 | 4.20 | 3.40 | 3.32 | 3.71 |
| Co-Director (Ours) | 4.22 | 4.41 | 3.65 | 3.58 | 3.96 |
Average of 5 human raters per video on a [1–5] scale.
Co-Director's Multi-Armed Bandit navigates a factored creative action space, selecting Aesthetic Archetypes that dictate the sensory realization of the video. This allows the system to "re-shoot" the same product prompt to match different moods and strategic intents.
"The Pop Look": High-key lighting with no dark shadows; clean, vibrant, and exciting.
"The Movie Look": Dramatic chiaroscuro lighting and deep shadows; expensive and legendary.
"The Gallery Look": Macro-zooms and clean backgrounds to highlight tiny textures in a museum-like setting.
"The Raw Look": Dark, moody lighting with handheld or drone motion; authentic and intense.
Example: Grault-Heel by Grault Design (Stereotypical)
Example: Con-Kibble by Consectetur Co. (Unconventional)
To demonstrate generality beyond advertising, we evaluate Co-Director on ViStoryBench. Our framework maintains robust temporal consistency and character identity across long-horizon sequences, even when constrained to rigid, pre-defined narrative structures.
@misc{song2026codirectoragenticgenerativevideo,
title={Co-Director: Agentic Generative Video Storytelling},
author={Yale Song and Yiwen Song and Nick Losier and Nathan Hodson and Ye Jin and Rhyard Zhu and Yan Xu and Daniel Vlasic and Carina Claassen and Jasmine Leon and Khanh G. LeViet and Zack Chomyn and Joe Timmons and Brett Slatkin and Scott Penberthy and Tomas Pfister},
year={2026},
eprint={2604.24842},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2604.24842},
}