The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

The Script is All You Need:
An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

Chenyu Mu^*,1,2 Xin He^*,1 Qu Yang^*,1 Wanshun Chen¹ Jiadi Yao¹ Huang Liu¹ Zihao Yi¹ Bo Zhao¹ Xingyu Chen¹ Ruotian Ma¹ Fanghua Ye¹ Erkun Yang² Cheng Deng² Zhaopeng Tu^†,1 Xiaolong Li¹ Linus¹

¹ Tencent Hunyuan Multimodal Department ² Xidian University

* Equal Contribution † Correspondence

Overview

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a "semantic gap" between a creative idea and its cinematic execution.

To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence.

Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models.

Methodology

Figure 1: The proposed agentic framework pipeline consisting of ScripterAgent, DirectorAgent, and CriticAgent.

ScripterAgent

Translates coarse dialogue into fine-grained, structured cinematic scripts. Trained with a two-stage paradigm (SFT + RL) to align with professional directorial standards.

DirectorAgent

Orchestrates video generation models. Uses a Cross-Scene Continuous Generation strategy with frame-anchoring to ensure seamless visual continuity across scenes and overcome temporal incoherence.

CriticAgent

Evaluates the generated film from both technical and cinematic perspectives, ensuring structural validity and semantic fidelity using automated metrics and VSA.

Experimental Results

Key Findings

Superior Script Generation: ScripterAgent significantly outperforms baselines, with expert ratings confirming higher Dramatic Tension (4.1 vs 3.7) and Visual Imagery (4.3 vs 3.8).
Universal Video Improvement: Using our generated scripts boosts performance across all SOTA models (including Sora2-Pro and Veo3.1), increasing Script Faithfulness by up to +0.4 points.
Trade-off Revealed: Analysis uncovers a trade-off between visual spectacle (e.g., Sora2-Pro) and script adherence (e.g., HYVideo1.5).
Enhanced Temporal Fidelity: Our new Visual-Script Alignment (VSA) metric confirms that our framework improves temporal-semantic coherence by over 7 points on average.

Script Generation Performance on ScriptBench Test Set

Method	AI Rating (0-5)				Human Rating (0-5)
Method	Format Comp.	Shot Division	Content Comp.	Narrative Coher.	Character Consist.	Dramatic Tension	Visual Imagery
CHAE	3.3	3.2	3.4	3.5	3.1	3.3	3.4
MoPS	3.2	3.1	3.3	3.4	3.0	3.2	3.3
SEED-Story	3.6	3.5	3.7	3.8	3.6	3.7	3.8
ScriptAgent (SFT only)	3.9	3.6	3.8	3.9	3.7	3.6	3.8
ScriptAgent (Full)	4.0	3.9	4.1	4.2	4.0	4.1	4.3

Video Generation Evaluation on ScriptBench Test Set

Model	AI Rating (0-5)					Human Rating (0-5)					Overall Mean
Model	Cam. Artic.	Body Block.	Visual Fid.	Emo. Arc	Pace Tim.	Visual App.	Script Faith	Char. Const.	Phy. Law	Nar. Coher.	Avg. AI	Avg. Human
Raw Dialogue (w/o ScripterAgent)
Vidu2	4.1	4.1	4.5	4.4	4.4	3.7	4.1	3.0	3.3	3.1	4.3	3.4
Seedance1.5-Pro	4.0	4.0	4.5	4.2	4.3	3.5	3.7	3.2	3.1	3.5	4.2	3.4
Kling2.6	4.1	4.1	4.6	4.4	4.4	3.6	3.5	3.3	3.4	3.7	4.3	3.5
Wan2.6	4.2	4.2	4.7	4.4	4.4	3.5	3.2	3.1	3.7	3.4	4.4	3.4
HYVideo1.5	4.0	4.0	4.5	4.3	4.3	4.0	4.2	4.1	3.8	4.1	4.2	4.0
Sora2-Pro	4.1	4.0	4.6	4.3	4.3	4.2	3.6	3.7	4.1	3.9	4.3	3.9
Veo3.1	4.0	3.9	4.4	4.4	4.3	3.9	4.0	4.1	3.9	4.0	4.2	4.0
Average	4.1	4.0	4.5	4.3	4.3	3.8	3.8	3.5	3.6	3.7	4.2	3.7
w/ ScripterAgent (Ours)
Vidu2	4.2	4.4	4.7	4.5	4.5	3.9	4.3	3.7	3.9	3.8	4.5	3.9
Seedance1.5-Pro	4.5	4.6	4.7	4.6	4.7	4.0	4.1	4.1	3.9	4.1	4.6	4.0
Kling2.6	4.3	4.5	4.6	4.5	4.6	3.9	4.1	4.0	4.2	4.1	4.5	4.1
Wan2.6	4.4	4.6	4.7	4.6	4.7	4.1	4.0	3.8	4.0	3.9	4.6	4.0
HYVideo1.5	4.4	4.5	4.8	4.5	4.7	4.5	4.6	4.4	4.2	4.3	4.6	4.4
Sora2-Pro	4.1	4.4	4.7	4.5	4.6	4.8	4.2	4.3	4.5	4.1	4.5	4.4
Veo3.1	4.1	4.4	4.5	4.6	4.4	4.6	4.4	4.3	4.4	4.2	4.4	4.4
Average	4.3	4.5	4.7	4.5	4.6	4.3	4.2	4.1	4.2	4.1	4.5	4.2

Video Generation Demos

Comparing our agentic framework across state-of-the-art video generation models. Select a case below to view the generated results.