Multimodal Fusion Technology for Analyzing Children’s Emotions Based on the Attention Mechanism

Authors

Chuyao Ma Huzhou University, Huzhou 313000, Zhejiang, China
Caixia Sun Huzhou University, Huzhou 313000, Zhejiang, China
Wei Shen Huzhou University, Huzhou 313000, Zhejiang, China

DOI:

https://doi.org/10.53469/wjimt.2025.08(09).12

Keywords:

Emotion analysis of young children, Cross-modal Transformer architecture, Multi-head self-attention mechanism, Multi-modal fusion

Abstract

To enhance the emotion recognition ability of preschool education dialogue robots, this paper proposes a multimodal fusion model based on the cross-modal Transformer architecture. The model consists of feature extraction, fusion, and output layers. It extracts multi-source data through BERT, audio via AFEU units, and OpenFace toolkit. The multi-head self-attention mechanism is introduced to obtain high-level features, with text as an auxiliary and audio-video as the main modalities. The improved cross-modal Transformer and AVFSM module are used to fuse features and achieve emotion recognition. Experiments show that in the CH-SIMS and self-built Tea datasets, the model outperforms the baseline model in classification and regression metrics, verifying the effectiveness of each component. It has good robustness and generalization ability, and has a good application prospect in preschool education and other fields.

References

Peng Kaibei, Sun Xiaoming, Chen Haowei, et al. Voice Emotion recognition method for railway stations based on convolutional neural networks [J]. Computer Simulation, 2023, 40(2): 177-180+189.

Gao Lijun, Xue Lei. Speech Emotion Recognition Based on Transformer architecture [J]. Industrial Control Computers, 2023, 36(1): 82-83+86.

Wang Xi, Wang Junbao, Bianba Wangdui. Emotion Recognition of Tibetan Speech Based on Convolutional neural Network [J]. Information Technology and Informationization, 2022, (11): 202-206.

Cui Chenlu, Cui Lin. Lightweight Speech Emotion Recognition for Data Augmentation [J]. Computer and Modernization, 2023, (4): 83-89+100.

Zhu Yonghua, Feng Tianyu, Zhang Meixian, et al. Convolutional speech emotion recognition network based on incremental method [J]. Journal of Shanghai University (Natural Science Edition),2023,29(1):24-40.

Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [J]. 2018:4171-4186

Baltrusaitis T, Zadeh A, Lim Y C, et al. OpenFace 2.0: Facial Behavior Analysis Toolkit [C]. Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition, 2018: 59-66

Lu Xueqiang, Tian Chi, Zhang Le, et al. Multimodal sentiment analysis model fusing multi-feature and attention mechanism [J]. Data Analysis and Knowledge Discovery, 24,8(05):91-101.

Kim K, Park S. AOBERT: All-modalities-in-One BERT for multimodal sentiment analysis [J]. Information Fusion, 2023, 92: 37 to 45.

Junxi, Y., Wang, Z., & Chen, C. (2024). GCN-MF: A graph convolutional network based on matrix factorization for recommendation. Innovation & Technology Advances, 2024, 2(1), 14–26. https://doi.org/10.61187/ita.v2i1.30

Wang L, Peng J, Zheng C, et al. A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning [J]. Information Processing & Management, 2024, 61(3): 103675.

Fu Y, Zhang Z, Yang R, et al. Hybrid cross-modal interaction learning for multimodal sentiment analysis [J]. Neurocomputing, 2024, 571: 127201.

Zeng Y, Mai S, Hu H F. Which is Making the Contribution: Modulating Unimodal and Cross-Modal Dynamics for Multimodal Sentiment Analysis[C]. Findings of the Association for Computational Linguistics: Emnlp, 2021: 1262-1274.

Wu Y, Lin Z, Zhao Y, et al. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis[C]. Findings of the association for computational linguistics: Acl-Ijcnlp ,2021: 4730-4738.

Multimodal Fusion Technology for Analyzing Children’s Emotions Based on the Attention Mechanism