Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms

Wonkee Lee, Seong-Hwan Heo, Jong-Hyeok Lee


Abstract
Semi-supervised learning that leverages synthetic data for training has been widely adopted for developing automatic post-editing (APE) models due to the lack of training data. With this aim, we focus on data-synthesis methods to create high-quality synthetic data. Given that APE takes as input a machine-translation result that might include errors, we present a data-synthesis method by which the resulting synthetic data mimic the translation errors found in actual data. We introduce a noising-based data-synthesis method by adapting the masked language model approach, generating a noisy text from a clean text by infilling masked tokens with erroneous tokens. Moreover, we propose selective corpus interleaving that combines two separate synthetic datasets by taking only the advantageous samples to enhance the quality of the synthetic data further. Experimental results show that using the synthetic data created by our approach results in significantly better APE performance than other synthetic data created by existing methods.
Anthology ID:
2024.lrec-main.43
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
484–494
Language:
URL:
https://aclanthology.org/2024.lrec-main.43
DOI:
Bibkey:
Cite (ACL):
Wonkee Lee, Seong-Hwan Heo, and Jong-Hyeok Lee. 2024. Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 484–494, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms (Lee et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.43.pdf