Multimodal Fusion Technology for Analyzing Children’s Emotions Based on the Attention Mechanism
DOI:
https://doi.org/10.53469/wjimt.2025.08(09).12Keywords:
Emotion analysis of young children, Cross-modal Transformer architecture, Multi-head self-attention mechanism, Multi-modal fusionAbstract
To enhance the emotion recognition ability of preschool education dialogue robots, this paper proposes a multimodal fusion model based on the cross-modal Transformer architecture. The model consists of feature extraction, fusion, and output layers. It extracts multi-source data through BERT, audio via AFEU units, and OpenFace toolkit. The multi-head self-attention mechanism is introduced to obtain high-level features, with text as an auxiliary and audio-video as the main modalities. The improved cross-modal Transformer and AVFSM module are used to fuse features and achieve emotion recognition. Experiments show that in the CH-SIMS and self-built Tea datasets, the model outperforms the baseline model in classification and regression metrics, verifying the effectiveness of each component. It has good robustness and generalization ability, and has a good application prospect in preschool education and other fields.
References
Peng Kaibei, Sun Xiaoming, Chen Haowei, et al. Voice Emotion recognition method for railway stations based on convolutional neural networks [J]. Computer Simulation, 2023, 40(2): 177-180+189.
Gao Lijun, Xue Lei. Speech Emotion Recognition Based on Transformer architecture [J]. Industrial Control Computers, 2023, 36(1): 82-83+86.
Wang Xi, Wang Junbao, Bianba Wangdui. Emotion Recognition of Tibetan Speech Based on Convolutional neural Network [J]. Information Technology and Informationization, 2022, (11): 202-206.
Cui Chenlu, Cui Lin. Lightweight Speech Emotion Recognition for Data Augmentation [J]. Computer and Modernization, 2023, (4): 83-89+100.
Zhu Yonghua, Feng Tianyu, Zhang Meixian, et al. Convolutional speech emotion recognition network based on incremental method [J]. Journal of Shanghai University (Natural Science Edition),2023,29(1):24-40.
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [J]. 2018:4171-4186
Baltrusaitis T, Zadeh A, Lim Y C, et al. OpenFace 2.0: Facial Behavior Analysis Toolkit [C]. Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition, 2018: 59-66
Lu Xueqiang, Tian Chi, Zhang Le, et al. Multimodal sentiment analysis model fusing multi-feature and attention mechanism [J]. Data Analysis and Knowledge Discovery, 24,8(05):91-101.
Kim K, Park S. AOBERT: All-modalities-in-One BERT for multimodal sentiment analysis [J]. Information Fusion, 2023, 92: 37 to 45.
Junxi, Y., Wang, Z., & Chen, C. (2024). GCN-MF: A graph convolutional network based on matrix factorization for recommendation. Innovation & Technology Advances, 2024, 2(1), 14–26. https://doi.org/10.61187/ita.v2i1.30
Wang L, Peng J, Zheng C, et al. A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning [J]. Information Processing & Management, 2024, 61(3): 103675.
Fu Y, Zhang Z, Yang R, et al. Hybrid cross-modal interaction learning for multimodal sentiment analysis [J]. Neurocomputing, 2024, 571: 127201.
Zeng Y, Mai S, Hu H F. Which is Making the Contribution: Modulating Unimodal and Cross-Modal Dynamics for Multimodal Sentiment Analysis[C]. Findings of the Association for Computational Linguistics: Emnlp, 2021: 1262-1274.
Wu Y, Lin Z, Zhao Y, et al. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis[C]. Findings of the association for computational linguistics: Acl-Ijcnlp ,2021: 4730-4738.