Multimodal Fusion Technology for Analyzing Children’s Emotions Based on the Attention Mechanism

Multimodal Fusion Technology for Analyzing Children’s Emotions Based on the Attention Mechanism

Authors

  • Chuyao Ma Huzhou University, Huzhou 313000, Zhejiang, China
  • Caixia Sun Huzhou University, Huzhou 313000, Zhejiang, China
  • Wei Shen Huzhou University, Huzhou 313000, Zhejiang, China

DOI:

https://doi.org/10.53469/wjimt.2025.08(09).12

Keywords:

Emotion analysis of young children, Cross-modal Transformer architecture, Multi-head self-attention mechanism, Multi-modal fusion

Abstract

To enhance the emotion recognition ability of preschool education dialogue robots, this paper proposes a multimodal fusion model based on the cross-modal Transformer architecture. The model consists of feature extraction, fusion, and output layers. It extracts multi-source data through BERT, audio via AFEU units, and OpenFace toolkit. The multi-head self-attention mechanism is introduced to obtain high-level features, with text as an auxiliary and audio-video as the main modalities. The improved cross-modal Transformer and AVFSM module are used to fuse features and achieve emotion recognition. Experiments show that in the CH-SIMS and self-built Tea datasets, the model outperforms the baseline model in classification and regression metrics, verifying the effectiveness of each component. It has good robustness and generalization ability, and has a good application prospect in preschool education and other fields.

References

Peng Kaibei, Sun Xiaoming, Chen Haowei, et al. Voice Emotion recognition method for railway stations based on convolutional neural networks [J]. Computer Simulation, 2023, 40(2): 177-180+189.

Gao Lijun, Xue Lei. Speech Emotion Recognition Based on Transformer architecture [J]. Industrial Control Computers, 2023, 36(1): 82-83+86.

Wang Xi, Wang Junbao, Bianba Wangdui. Emotion Recognition of Tibetan Speech Based on Convolutional neural Network [J]. Information Technology and Informationization, 2022, (11): 202-206.

Cui Chenlu, Cui Lin. Lightweight Speech Emotion Recognition for Data Augmentation [J]. Computer and Modernization, 2023, (4): 83-89+100.

Zhu Yonghua, Feng Tianyu, Zhang Meixian, et al. Convolutional speech emotion recognition network based on incremental method [J]. Journal of Shanghai University (Natural Science Edition),2023,29(1):24-40.

Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [J]. 2018:4171-4186

Baltrusaitis T, Zadeh A, Lim Y C, et al. OpenFace 2.0: Facial Behavior Analysis Toolkit [C]. Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition, 2018: 59-66

Lu Xueqiang, Tian Chi, Zhang Le, et al. Multimodal sentiment analysis model fusing multi-feature and attention mechanism [J]. Data Analysis and Knowledge Discovery, 24,8(05):91-101.

Kim K, Park S. AOBERT: All-modalities-in-One BERT for multimodal sentiment analysis [J]. Information Fusion, 2023, 92: 37 to 45.

Junxi, Y., Wang, Z., & Chen, C. (2024). GCN-MF: A graph convolutional network based on matrix factorization for recommendation. Innovation & Technology Advances, 2024, 2(1), 14–26. https://doi.org/10.61187/ita.v2i1.30

Wang L, Peng J, Zheng C, et al. A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning [J]. Information Processing & Management, 2024, 61(3): 103675.

Fu Y, Zhang Z, Yang R, et al. Hybrid cross-modal interaction learning for multimodal sentiment analysis [J]. Neurocomputing, 2024, 571: 127201.

Zeng Y, Mai S, Hu H F. Which is Making the Contribution: Modulating Unimodal and Cross-Modal Dynamics for Multimodal Sentiment Analysis[C]. Findings of the Association for Computational Linguistics: Emnlp, 2021: 1262-1274.

Wu Y, Lin Z, Zhao Y, et al. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis[C]. Findings of the association for computational linguistics: Acl-Ijcnlp ,2021: 4730-4738.

Downloads

Published

2025-09-29

Issue

Section

Articles
Loading...