Analysis on Unsupervised Acquisition Process of Bilingual Vocabulary through Iterative Back-Translation

Takuma Tanigawa, Tomoyosi Akiba, Hajime Tsukada


Abstract
In this paper, we investigate how new bilingual vocabulary is acquired through Iterative Back-Translation (IBT), which is known as a data augmentation method for machine translation from monolingual data of both source and target languages. To reveal the acquisition process, we first identify the word translation pairs in test data that do not exist in a bilingual data but do only in two monolingual data, then observe how many pairs are successfully translated by the translation model trained through IBT. We experimented on it with domain adaptation settings on two language pairs. Our experimental evaluation showed that more than 60% of the new bilingual vocabulary is successfully acquired through IBT along with the improvement in the translation quality in terms of BLEU. It also revealed that new bilingual vocabulary was gradually acquired by repeating IBT iterations. From the results, we present our hypothesis on the process of new bilingual vocabulary acquisition where the context of the words plays a critical role in the success of the acquisition.
Anthology ID:
2024.lrec-main.80
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
887–892
Language:
URL:
https://aclanthology.org/2024.lrec-main.80
DOI:
Bibkey:
Cite (ACL):
Takuma Tanigawa, Tomoyosi Akiba, and Hajime Tsukada. 2024. Analysis on Unsupervised Acquisition Process of Bilingual Vocabulary through Iterative Back-Translation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 887–892, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Analysis on Unsupervised Acquisition Process of Bilingual Vocabulary through Iterative Back-Translation (Tanigawa et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.80.pdf