A Document-Level Text Simplification Dataset for Japanese

Yoshinari Nagai; Teruaki Oka; Mamoru Komachi

A Document-Level Text Simplification Dataset for Japanese

Yoshinari Nagai, Teruaki Oka, Mamoru Komachi

Abstract

Document-level text simplification, a task that combines single-document summarization and intra-sentence simplification, has garnered significant attention. However, studies have primarily focused on languages such as English and German, leaving Japanese and similar languages underexplored because of a scarcity of linguistic resources. In this study, we devised JADOS, the first Japanese document-level text simplification dataset based on newspaper articles and Wikipedia. Our dataset focuses on simplification, to enhance readability by reducing the number of sentences and tokens in a document. We conducted investigations using our dataset. Firstly, we analyzed the characteristics of Japanese simplification by comparing it across different domains and with English counterparts. Moreover, we experimentally evaluated the performances of text summarization methods, transformer-based text simplification models, and large language models. In terms of D-SARI scores, the transformer-based models performed best across all domains. Finally, we manually evaluated several model outputs and target articles, demonstrating the need for document-level text simplification models in Japanese.

Anthology ID:: 2024.lrec-main.41
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 459–476
Language:
URL:: https://aclanthology.org/2024.lrec-main.41/
DOI:
Bibkey:
Cite (ACL):: Yoshinari Nagai, Teruaki Oka, and Mamoru Komachi. 2024. A Document-Level Text Simplification Dataset for Japanese. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 459–476, Torino, Italia. ELRA and ICCL.
Cite (Informal):: A Document-Level Text Simplification Dataset for Japanese (Nagai et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.41.pdf

PDF Cite Search Fix data