Design of Dimensionality Reduction Algorithm for High-Dimensional Large-Scale Translation Corpora and Lightweight Translation Model Training

Design of Dimensionality Reduction Algorithm for High-Dimensional Large-Scale Translation Corpora and Lightweight Translation Model Training

Authors

  • Juan Ji Nantong Institute of Technology, Nantong 226000, China

DOI:

https://doi.org/10.53469/wjimt.2026.09(03).02

Keywords:

Dimensionality Reduction of High-Dimensional Data, Feature Extraction, Lightweight Model, Neural Machine Translation, Knowledge Distillation

Abstract

Addressing the "curse of dimensionality" problem caused by the exponential growth of translation corpora in current machine translation research, and the practical bottlenecks of large-scale model deployment difficulties and high inference latency in resource-constrained scenarios, this paper designs a dimensionality reduction algorithm that integrates feature selection and deep reconstruction, and constructs a lightweight translation model training framework by combining knowledge distillation and structured pruning. First, feature importance is evaluated based on a self-attention mechanism to remove noisy and redundant features. Next, a variational autoencoder is used to perform deep reconstruction on the selected features to extract low-dimensional dense semantic representations. Then, the dimensionality-reduced features are input into a lightweight Transformer student network, which learns knowledge from the teacher model through knowledge distillation. Finally, structured pruning is used to further eliminate attention heads and neuron redundancy. Experimental results show that, while maintaining a BLEU score of 28.9, the feature dimension is compressed by 92.5%, the model parameter size is reduced to 26M, and the inference speed is improved by 2.3 times. Furthermore, it outperforms comparative methods at different compression rates, providing an efficient and feasible technical path for translation deployment in resource-constrained environments.

References

Huertas-García Á, Martín A, Huertas-Tato J, et al. Exploring dimensionality reduction techniques in multilingual transformers[J]. Cognitive computation, 2023, 15(2): 590-612.

Bensalah N, Ayad H, Adib A, et al. A comparative study of different dimensionality reduction techniques for arabic machine translation[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2023, 22(12): 1-17.

Fan A, Bhosale S, Schwenk H, et al. Beyond english-centric multilingual machine translation[J]. Journal of Machine Learning Research, 2021, 22(107): 1-48.

Hu Zelin, Gao Yi, Li Miao, et al. Research on Chinese-Mongolian neural machine translation method based on character-level language modeling [J]. Journal of Kunming University of Science and Technology: Natural Science Edition, 2023, 48(3):85-92.

Wang Jiaqi, Zhu Junguo, Yu Zhengtao. Low-resource machine translation based on gradient weight variation training strategy [J]. Computer Science and Exploration, 2024, 18(3):731-739.

Haddow B, Bawden R, Miceli-Barone A V, et al. Survey of low-resource machine translation[J]. Computational Linguistics, 2022, 48(3): 673-732.

Bensalah N, Ayad H, Adib A, et al. Contextualized dynamic meta embeddings based on Gated CNNs and self-attention for Arabic machine translation[J]. International Journal of Intelligent Computing and Cybernetics, 2024, 17(3): 605-631.

Xu J. Multi-region English translation synchronization mechanism driven by big data[J]. Evolutionary Intelligence, 2023, 16(5): 1539-1546.

Wang S. Accessing higher dimensions for unsupervised word translation[J]. Advances in Neural Information Processing Systems, 2023, 36(1): 69098-69116.

Zhipeng Z, Aleksey P. Research on the Development of Data Augmentation Techniques in the Field of Machine Translation[J]. International Journal of Open Information Technologies, 2023, 11(5): 33-40.

Liu L, Zhu M. Bertalign: Improved word embedding-based sentence alignment for Chinese–English parallel corpora of literary texts[J]. Digital Scholarship in the Humanities, 2023, 38(2): 621-634.

Jiang Y, Niu J. A corpus-based search for machine translationese in terms of discourse coherence[J]. Across Languages and Cultures, 2022, 23(2): 148-166.

Minyun X. Machine Translation Based on Neural Network: Challenge or Chance?[J]. Educalitra: English Education, Linguistics, and Literature Journal, 2023, 2(2): 41-51.

Downloads

Published

2026-03-31

Issue

Section

Articles
Loading...