The Script is All You Need:
An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

Chenyu Mu*,1,2 Xin He*,1 Qu Yang*,1 Wanshun Chen1 Jiadi Yao1 Huang Liu1 Zihao Yi1 Bo Zhao1 Xingyu Chen1 Ruotian Ma1 Fanghua Ye1 Erkun Yang2 Cheng Deng2 Zhaopeng Tu†,1 Xiaolong Li1 Linus1

1 Tencent Hunyuan Multimodal Department    2 Xidian University

* Equal Contribution    † Correspondence

Overview

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a "semantic gap" between a creative idea and its cinematic execution.

To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence.

Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models.

Methodology

Framework Pipeline

Figure 1: The proposed agentic framework pipeline consisting of ScripterAgent, DirectorAgent, and CriticAgent.

ScripterAgent

Translates coarse dialogue into fine-grained, structured cinematic scripts. Trained with a two-stage paradigm (SFT + RL) to align with professional directorial standards.

DirectorAgent

Orchestrates video generation models. Uses a Cross-Scene Continuous Generation strategy with frame-anchoring to ensure seamless visual continuity across scenes and overcome temporal incoherence.

CriticAgent

Evaluates the generated film from both technical and cinematic perspectives, ensuring structural validity and semantic fidelity using automated metrics and VSA.

Experimental Results

Key Findings

  • Superior Script Generation: ScripterAgent significantly outperforms baselines, with expert ratings confirming higher Dramatic Tension (4.1 vs 3.7) and Visual Imagery (4.3 vs 3.8).
  • Universal Video Improvement: Using our generated scripts boosts performance across all SOTA models (including Sora2-Pro and Veo3.1), increasing Script Faithfulness by up to +0.4 points.
  • Trade-off Revealed: Analysis uncovers a trade-off between visual spectacle (e.g., Sora2-Pro) and script adherence (e.g., HYVideo1.5).
  • Enhanced Temporal Fidelity: Our new Visual-Script Alignment (VSA) metric confirms that our framework improves temporal-semantic coherence by over 7 points on average.

Script Generation Performance on ScriptBench Test Set

Method AI Rating (0-5) Human Rating (0-5)
Format Comp. Shot Division Content Comp. Narrative Coher. Character Consist. Dramatic Tension Visual Imagery
CHAE 3.33.23.43.5 3.13.33.4
MoPS 3.23.13.33.4 3.03.23.3
SEED-Story 3.63.53.73.8 3.63.73.8
ScriptAgent (SFT only) 3.93.63.83.9 3.73.63.8
ScriptAgent (Full) 4.03.94.14.2 4.04.14.3

Video Generation Evaluation on ScriptBench Test Set

Model AI Rating (0-5) Human Rating (0-5) Overall Mean
Cam. Artic. Body Block. Visual Fid. Emo. Arc Pace Tim. Visual App. Script Faith Char. Const. Phy. Law Nar. Coher. Avg. AI Avg. Human
Raw Dialogue (w/o ScripterAgent)
Vidu24.14.14.54.44.43.74.13.03.33.14.33.4
Seedance1.5-Pro4.04.04.54.24.33.53.73.23.13.54.23.4
Kling2.64.14.14.64.44.43.63.53.33.43.74.33.5
Wan2.64.24.24.74.44.43.53.23.13.73.44.43.4
HYVideo1.54.04.04.54.34.34.04.24.13.84.14.24.0
Sora2-Pro4.14.04.64.34.34.23.63.74.13.94.33.9
Veo3.14.03.94.44.44.33.94.04.13.94.04.24.0
Average4.14.04.54.34.33.83.83.53.63.74.23.7
w/ ScripterAgent (Ours)
Vidu24.24.44.74.54.53.94.33.73.93.84.53.9
Seedance1.5-Pro4.54.64.74.64.74.04.14.13.94.14.64.0
Kling2.64.34.54.64.54.63.94.14.04.24.14.54.1
Wan2.64.44.64.74.64.74.14.03.84.03.94.64.0
HYVideo1.54.44.54.84.54.74.54.64.44.24.34.64.4
Sora2-Pro4.14.44.74.54.64.84.24.34.54.14.54.4
Veo3.14.14.44.54.64.44.64.44.34.44.24.44.4
Average4.34.54.74.54.64.34.24.14.24.14.54.2

Video Generation Demos

Comparing our agentic framework across state-of-the-art video generation models. Select a case below to view the generated results.

Input Dialogue

Generated Script Plan (Snippet)