CSSWiki: A Chinese Sentence Simplification Dataset with Linguistic and Content Operations

Fengkai Liu, John S. Y. Lee


Abstract
Sentence Simplification aims to make sentences easier to read and understand. With most effort on corpus development focused on English, the amount of annotated data is limited in Chinese. To address this need, we introduce CSSWiki, an open-source dataset for Chinese sentence simplification based on Wikipedia. This dataset contains 1.6k source sentences paired with their simplified versions. Each sentence pair is annotated with operation tags that distinguish between linguistic and content modifications. We analyze differences in annotation scheme and data statistics between CSSWiki and existing datasets. We then report baseline sentence simplification performance on CSSWiki using zero-shot and few-shot approaches with Large Language Models.
Anthology ID:
2024.lrec-main.375
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
4205–4213
Language:
URL:
https://aclanthology.org/2024.lrec-main.375
DOI:
Bibkey:
Cite (ACL):
Fengkai Liu and John S. Y. Lee. 2024. CSSWiki: A Chinese Sentence Simplification Dataset with Linguistic and Content Operations. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4205–4213, Torino, Italia. ELRA and ICCL.
Cite (Informal):
CSSWiki: A Chinese Sentence Simplification Dataset with Linguistic and Content Operations (Liu & Lee, LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.375.pdf