Robot icon

Co-Director: Agentic Generative Video Storytelling

Yale Song1, Yiwen Song1, Nick Losier1, Nathan Hodson1, Ye Jin1, Rhyard Zhu1,

Yan Xu1, Daniel Vlasic1, Carina Claassen1, Jasmine Leon1, Khanh G. LeViet1,

Zack Chomyn1, Joe Timmons1, Brett Slatkin1, Scott Penberthy1, and Tomas Pfister1

1Google Inc.
Paper Code
Data (Coming Soon)

Co-Director generates long-form video in a variety of cinematic styles by exploring diverse creative directions while maintaining rigorous semantic coherence.

Creative Strategy: Informational | Narrative Mode: Analytical | Aesthetic Archetype: Minimalist Focus

Note: All variations share the exact same input prompt. Co-Director autonomously "re-shoots" the narrative by navigating the different creative configurations shown above.

Input Prompt

System Text Input

Most generative video tools are great at creating a single "cool shot," but they struggle to tell a consistent story. Small errors in the script often cascade into "identity drift" or broken logic by the final frame. At Google, we wanted to move beyond the linear "waterfall" approach to video generation. We are excited to share Co-Director, a multi-agent framework that formalizes video storytelling as a global optimization problem. Instead of just stringing clips together, it functions like a professional film crew to ensure visual and narrative coherence from start to finish. What’s happening under the hood? The Orchestrator: Uses "Multi-Armed Bandit" logic to steer the creative direction. It balances exploring bold new narrative ideas with "exploiting" configurations that it knows will land with the target audience. Hierarchical Factoring: We don't just give the AI a generic prompt. We disentangle the "vibe" into three axes: Creative Strategy (the message), Narrative Mode (the delivery), and Aesthetic Archetype (the visual look). Local Self-Refinement: A multimodal feedback loop acts as a built-in auditor, catching and correcting inconsistencies—like a character’s hair color changing or a product teleporting between scenes—before the final render. To test this, we developed GenAD-Bench: a rigorous new benchmark featuring 400 scenarios across 200 fictional products. By mastering the strict constraints of advertising—brand fidelity, demographics, and tight runtimes—Co-Director proves it can handle almost any cinematic storytelling task.

Reference Visuals

Co-Director Logo
System Pipeline

Abstract

While diffusion models generate high-fidelity video clips, transforming them into coherent storytelling engines remains challenging. Current agentic pipelines automate this via chained modules but suffer from semantic drift and cascading failures due to independent, handcrafted prompting. We present Co-Director, a hierarchical multi-agent framework formalizing video storytelling as a global optimization problem. To ensure semantic coherence, we introduce hierarchical parameterization: a multi-armed bandit globally identifies promising creative directions, while a local multimodal self-refinement loop mitigates identity drift and ensures sequence-level consistency. This balances the exploration of novel narrative strategies with the exploitation of effective creative configurations. For evaluation, we introduce GenAD-Bench, a 400-scenario dataset of fictional products for personalized advertising. Experiments demonstrate that Co-Director significantly outperforms state-of-the-art baselines, offering a principled approach that seamlessly generalizes to broader cinematic narratives.

Co-Director Multi-Agent Architecture

Co-Director System Pipeline

Figure 1: Co-Director Multi-Agent Pipeline Overview (Hover over components to see details)

Hover over the system components above to explore the hierarchical multi-agent architecture.

Qualitative Results

Application: Generative Video Advertising

We demonstrate Co-Director on digital advertising because it represents a rigorous "creative stress test" for generative AI. While abstract or experimental cinema often allows for loose visual interpretation, professional advertising requires high-fidelity creativity bound by absolute precision: brand identities must remain inviolable, product features must be visually consistent across disparate scenes, and the artistic direction must align with specific demographic expectations—all within a compressed narrative window.

Note: Videos are played at 2x speed; hover over a video to play at 1x speed with audio.

BarBaz-Scrub

FooTrent-Case

Wandom-Yarn

FredBaz-Jam

FredBaz-Mix

Grault-Gown

EveFoo-Scoot

MaloyBaz-Mask

Xyzzy-Bear

GenAD-Bench

GenAD-Bench Mosaic

GenAD-Bench is a rigorous benchmark designed to evaluate end-to-end generative workflows. It features 400 unique scenarios across 200 fictional products and 50 brands. By utilizing fictional entities, the benchmark prevents models from defaulting to memorized training priors, ensuring performance reflects true reasoning.

Evaluation on GenAD-Bench

We evaluate Co-Director against monolithic models and prior agentic frameworks (AniMaker, MovieAgent) using a multi-dimensional MLLM-as-a-Judge suite. Our MAB formulation ensures efficient convergence toward optimal configurations.

VAF: Visual Asset Fidelity
DA: Demographic Alignment
MA: Marketing Appeal
VQ: Visual Quality

Table 1: Evaluation on GenAD-Bench

This table compares how Co-Director performs against industry-leading models using automated measurements. It shows that our method consistently achieves higher scores in preserving brand identities and creating videos that truly resonate with specific target audiences.

Method VAF ↑ DA ↑ MA ↑ VQ ↑ Avg. ↑
Proprietary Models
Creatify 23.2 16.2 19.5 29.5 22.1
HeyGen 42.9 59.3 39.5 45.0 46.7
Kling 3.0 Omni 62.0 70.3 56.0 45.3 58.4
Veo 3.1 60.0 80.8 63.2 50.5 63.6
Wan 2.6 67.0 71.5 62.5 58.9 65.0
Open-Source Models
LTX-2.3 23.5 56.0 30.6 24.4 33.6
AniMaker 53.1 81.3 60.3 53.9 62.2
MovieAgent 61.2 81.3 66.4 52.4 65.3
Base Agentic Pipeline (T=1) 68.5 78.4 67.1 59.9 68.5
Random Search Baseline (T=4) 77.0 85.8 75.6 64.1 75.7
Co-Director (T=4) 82.1 91.4 82.0 70.2 81.4

All metrics are scaled [0, 100]. VAF: Visual Asset Fidelity, DA: Demographic Alignment, MA: Marketing Appeal, VQ: Visual Quality.

Mean Opinion Scores (MOS)

To ensure the results actually look good to people, we asked human judges to rate the videos on a scale of 1 to 5. These scores confirm that humans prefer our results over the baselines.

Method VAF DA MA VQ Avg.
AniMaker 3.34 3.71 2.72 2.50 3.07
MovieAgent 3.64 4.06 2.85 2.31 3.22
Wan 2.6 3.84 3.65 3.13 3.29 3.48
Kling 3.0 Omni 3.93 4.13 3.07 3.21 3.59
Veo 3.1 3.90 4.20 3.40 3.32 3.71
Co-Director (Ours) 4.22 4.41 3.65 3.58 3.96

Average of 5 human raters per video on a [1–5] scale.

Digital Directing: Controlling the "Look and Feel"

Co-Director's Multi-Armed Bandit navigates a factored creative action space, selecting Aesthetic Archetypes that dictate the sensory realization of the video. This allows the system to "re-shoot" the same product prompt to match different moods and strategic intents.

Clarity / Energy

"The Pop Look": High-key lighting with no dark shadows; clean, vibrant, and exciting.

Cinematic Premium

"The Movie Look": Dramatic chiaroscuro lighting and deep shadows; expensive and legendary.

Minimalist Focus

"The Gallery Look": Macro-zooms and clean backgrounds to highlight tiny textures in a museum-like setting.

Kinetic Grit

"The Raw Look": Dark, moody lighting with handheld or drone motion; authentic and intense.

Grault-Heel Aesthetic Archetypes

Example: Grault-Heel by Grault Design (Stereotypical)

Generalization: ViStoryBench

To demonstrate generality beyond advertising, we evaluate Co-Director on ViStoryBench. Our framework maintains robust temporal consistency and character identity across long-horizon sequences, even when constrained to rigid, pre-defined narrative structures.

Action

Plot

BibTeX

@misc{song2026codirectoragenticgenerativevideo,
      title={Co-Director: Agentic Generative Video Storytelling}, 
      author={Yale Song and Yiwen Song and Nick Losier and Nathan Hodson and Ye Jin and Rhyard Zhu and Yan Xu and Daniel Vlasic and Carina Claassen and Jasmine Leon and Khanh G. LeViet and Zack Chomyn and Joe Timmons and Brett Slatkin and Scott Penberthy and Tomas Pfister},
      year={2026},
      eprint={2604.24842},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2604.24842}, 
}