MindFlow: Harmonizing Cognitive Semantics and Acoustic Dynamics for Facial Animation Generation in Dyadic Conversations

ECCV 2026
Hejia Chen1, Haoxian Zhang2, Xu He2, Xiaoqiang Liu2, Pengfei Wan2, Shoulong Zhang3, Shuai Li1,3
1Beihang University 2Kling Team, Kuaishou Technology 3Zhongguancun Laboratory
✉ Corresponding authors
MindFlow teaser

Grounded in the Ventral-Dorsal dual-pathway cognitive model, MindFlow introduces a framework for streaming facial animation in dyadic conversations, under which digital avatars simultaneously perceive conversational emotions while reflexively synchronizing with acoustic rhythms, naturally yielding interactions that are both semantically rich and physically fluid.

Abstract

Generating lifelike facial animation for dyadic conversations requires reconciling high-level cognitive intent with precise low-level motor reflexes, yet existing methods fall short in semantic understanding of dialogue context and in precise dynamic control. In this paper, we propose MindFlow, a dual-pathway generative framework inspired by the Ventral-Dorsal pathway model in neuroscience, which decouples generation into two collaborative streams, thereby harmonizing deep semantic reasoning with fine-grained control. In the Ventral module, we transform the conventional Sentence-Action approach into a novel Chunk-State approach that models raw acoustic streams as a context-aware, evolving emotional state chain, capturing subtle paralinguistic nuances and mid-utterance emotional shifts missed by sentence-level modeling. The Dorsal module features a conditional autoregressive flow matching network for high-fidelity facial motion, driven by high-frequency acoustic cues and modulated by emotion states, plus a Selective Acoustic Injector for adaptive audio gating to ensure robustness in talking-and-listening dynamics without interference. Extensive experiments demonstrate that MindFlow achieves superior semantic appropriateness and motion naturalness compared to state-of-the-art baselines.

Highlights

  • Ventral-Dorsal dual pathway: MindFlow separates cognitive semantic understanding from reflexive motion generation to address rigid and hollow facial expressions in dyadic conversation animation.
  • Chunk-State approach: The Ventral module uses multimodal LLMs to analyze dynamic emotion states directly from audio streams, preserving prosodic cues and enabling continuous fine-grained expression control.
  • Selective Acoustic Injector: The Dorsal module adaptively gates listening and talking dynamics to generate high-quality facial motion across both conversational roles.

Demo Video

Trouble playing? Watch on YouTube  ยท  Download MP4

BibTeX

@inproceedings{chen2026mindflow,
  title={{MindFlow}: Harmonizing Cognitive Semantics and Acoustic Dynamics for Facial Animation Generation in Dyadic Conversations},
  author={Chen, Hejia and Zhang, Haoxian and He, Xu and Liu, Xiaoqiang and Wan, Pengfei and Zhang, Shoulong and Li, Shuai},
  booktitle={European Conference on Computer Vision},
  year={2026}
}

Acknowledgments

This work was supported by Zhongguancun Laboratory, the National Key R&D Program of China (2023YFF1203803), and the National Natural Science Foundation of China (62502469, 62525204).