Nour Rabih


2024

pdf bib
OSACT 2024 Task 2: Arabic Dialect to MSA Translation
Hanin Atwany | Nour Rabih | Ibrahim Mohammed | Abdul Waheed | Bhiksha Raj
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024

We present the results of Shared Task “Dialect to MSA Translation”, which tackles challenges posed by the diverse Arabic dialects in machine translation. Covering Gulf, Egyptian, Levantine, Iraqi and Maghrebi dialects, the task offers 1001 sentences in both MSA and dialects for fine-tuning, alongside 1888 blind test sentences. Leveraging GPT-3.5, a state-of-the-art language model, our method achieved the a BLEU score of 29.61. This endeavor holds significant implications for Neural Machine Translation (NMT) systems targeting low-resource langu ages with linguistic variation. Additionally, negative experiments involving fine-tuning AraT5 and No Language Left Behind (NLLB) using the MADAR Dataset resulted in BLEU scores of 10.41 and 11.96, respectively. Future directions include expanding the dataset to incorporate more Arabic dialects and exploring alternative NMT architectures to further enhance translation capabilities.

2022

pdf bib
Gulf Arabic Diacritization: Guidelines, Initial Dataset, and Results
Nouf Alabbasi | Mohamed Al-Badrashiny | Maryam Aldahmani | Ahmed AlDhanhani | Abdullah Saleh Alhashmi | Fawaghy Ahmed Alhashmi | Khalid Al Hashemi | Rama Emad Alkhobbi | Shamma T Al Maazmi | Mohammed Ali Alyafeai | Mariam M Alzaabi | Mohamed Saqer Alzaabi | Fatma Khalid Badri | Kareem Darwish | Ehab Mansour Diab | Muhammad Morsy Elmallah | Amira Ayman Elnashar | Ashraf Hatim Elneima | MHD Tameem Kabbani | Nour Rabih | Ahmad Saad | Ammar Mamoun Sousou
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Arabic diacritic recovery is important for a variety of downstream tasks such as text-to-speech. In this paper, we introduce a new Gulf Arabic diacritization dataset composed of 19,850 words based on a subset of the Gumar corpus. We provide comprehensive set of guidelines for diacritization to enable the diacritization of more data. We also report on diacritization results based on the new corpus using a Hidden Markov Model and character-based sequence to sequence models.