Improving Arabic Diacritization with Regularized Decoding and Adversarial Training

Han Qin, Guimin Chen, Yuanhe Tian, Yan Song


Abstract
Arabic diacritization is a fundamental task for Arabic language processing. Previous studies have demonstrated that automatically generated knowledge can be helpful to this task. However, these studies regard the auto-generated knowledge instances as gold references, which limits their effectiveness since such knowledge is not always accurate and inferior instances can lead to incorrect predictions. In this paper, we propose to use regularized decoding and adversarial training to appropriately learn from such noisy knowledge for diacritization. Experimental results on two benchmark datasets show that, even with quite flawed auto-generated knowledge, our model can still learn adequate diacritics and outperform all previous studies, on both datasets.
Anthology ID:
2021.acl-short.68
Volume:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Month:
August
Year:
2021
Address:
Online
Editors:
Chengqing Zong, Fei Xia, Wenjie Li, Roberto Navigli
Venues:
ACL | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
534–542
Language:
URL:
https://aclanthology.org/2021.acl-short.68
DOI:
10.18653/v1/2021.acl-short.68
Bibkey:
Cite (ACL):
Han Qin, Guimin Chen, Yuanhe Tian, and Yan Song. 2021. Improving Arabic Diacritization with Regularized Decoding and Adversarial Training. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 534–542, Online. Association for Computational Linguistics.
Cite (Informal):
Improving Arabic Diacritization with Regularized Decoding and Adversarial Training (Qin et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.acl-short.68.pdf
Video:
 https://aclanthology.org/2021.acl-short.68.mp4
Code
 synlp/AD-RDAT
Data
Arabic Text Diacritization