1 Tencent Hunyuan Multimodal Department 2 Xidian University
* Equal Contribution † Correspondence
Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a "semantic gap" between a creative idea and its cinematic execution.
To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence.
Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models.
Figure 1: The proposed agentic framework pipeline consisting of ScripterAgent, DirectorAgent, and CriticAgent.
Translates coarse dialogue into fine-grained, structured cinematic scripts. Trained with a two-stage paradigm (SFT + RL) to align with professional directorial standards.
Orchestrates video generation models. Uses a Cross-Scene Continuous Generation strategy with frame-anchoring to ensure seamless visual continuity across scenes and overcome temporal incoherence.
Evaluates the generated film from both technical and cinematic perspectives, ensuring structural validity and semantic fidelity using automated metrics and VSA.
| Method | AI Rating (0-5) | Human Rating (0-5) | |||||
|---|---|---|---|---|---|---|---|
| Format Comp. | Shot Division | Content Comp. | Narrative Coher. | Character Consist. | Dramatic Tension | Visual Imagery | |
| CHAE | 3.3 | 3.2 | 3.4 | 3.5 | 3.1 | 3.3 | 3.4 |
| MoPS | 3.2 | 3.1 | 3.3 | 3.4 | 3.0 | 3.2 | 3.3 |
| SEED-Story | 3.6 | 3.5 | 3.7 | 3.8 | 3.6 | 3.7 | 3.8 |
| ScriptAgent (SFT only) | 3.9 | 3.6 | 3.8 | 3.9 | 3.7 | 3.6 | 3.8 |
| ScriptAgent (Full) | 4.0 | 3.9 | 4.1 | 4.2 | 4.0 | 4.1 | 4.3 |
| Model | AI Rating (0-5) | Human Rating (0-5) | Overall Mean | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cam. Artic. | Body Block. | Visual Fid. | Emo. Arc | Pace Tim. | Visual App. | Script Faith | Char. Const. | Phy. Law | Nar. Coher. | Avg. AI | Avg. Human | |
| Raw Dialogue (w/o ScripterAgent) | ||||||||||||
| Vidu2 | 4.1 | 4.1 | 4.5 | 4.4 | 4.4 | 3.7 | 4.1 | 3.0 | 3.3 | 3.1 | 4.3 | 3.4 |
| Seedance1.5-Pro | 4.0 | 4.0 | 4.5 | 4.2 | 4.3 | 3.5 | 3.7 | 3.2 | 3.1 | 3.5 | 4.2 | 3.4 |
| Kling2.6 | 4.1 | 4.1 | 4.6 | 4.4 | 4.4 | 3.6 | 3.5 | 3.3 | 3.4 | 3.7 | 4.3 | 3.5 |
| Wan2.6 | 4.2 | 4.2 | 4.7 | 4.4 | 4.4 | 3.5 | 3.2 | 3.1 | 3.7 | 3.4 | 4.4 | 3.4 |
| HYVideo1.5 | 4.0 | 4.0 | 4.5 | 4.3 | 4.3 | 4.0 | 4.2 | 4.1 | 3.8 | 4.1 | 4.2 | 4.0 |
| Sora2-Pro | 4.1 | 4.0 | 4.6 | 4.3 | 4.3 | 4.2 | 3.6 | 3.7 | 4.1 | 3.9 | 4.3 | 3.9 |
| Veo3.1 | 4.0 | 3.9 | 4.4 | 4.4 | 4.3 | 3.9 | 4.0 | 4.1 | 3.9 | 4.0 | 4.2 | 4.0 |
| Average | 4.1 | 4.0 | 4.5 | 4.3 | 4.3 | 3.8 | 3.8 | 3.5 | 3.6 | 3.7 | 4.2 | 3.7 |
| w/ ScripterAgent (Ours) | ||||||||||||
| Vidu2 | 4.2 | 4.4 | 4.7 | 4.5 | 4.5 | 3.9 | 4.3 | 3.7 | 3.9 | 3.8 | 4.5 | 3.9 |
| Seedance1.5-Pro | 4.5 | 4.6 | 4.7 | 4.6 | 4.7 | 4.0 | 4.1 | 4.1 | 3.9 | 4.1 | 4.6 | 4.0 |
| Kling2.6 | 4.3 | 4.5 | 4.6 | 4.5 | 4.6 | 3.9 | 4.1 | 4.0 | 4.2 | 4.1 | 4.5 | 4.1 |
| Wan2.6 | 4.4 | 4.6 | 4.7 | 4.6 | 4.7 | 4.1 | 4.0 | 3.8 | 4.0 | 3.9 | 4.6 | 4.0 |
| HYVideo1.5 | 4.4 | 4.5 | 4.8 | 4.5 | 4.7 | 4.5 | 4.6 | 4.4 | 4.2 | 4.3 | 4.6 | 4.4 |
| Sora2-Pro | 4.1 | 4.4 | 4.7 | 4.5 | 4.6 | 4.8 | 4.2 | 4.3 | 4.5 | 4.1 | 4.5 | 4.4 |
| Veo3.1 | 4.1 | 4.4 | 4.5 | 4.6 | 4.4 | 4.6 | 4.4 | 4.3 | 4.4 | 4.2 | 4.4 | 4.4 |
| Average | 4.3 | 4.5 | 4.7 | 4.5 | 4.6 | 4.3 | 4.2 | 4.1 | 4.2 | 4.1 | 4.5 | 4.2 |
Comparing our agentic framework across state-of-the-art video generation models. Select a case below to view the generated results.