A Closer Look at Clustering Bilingual Comparable Corpora

Anna Laskina, Eric Gaussier, Gaelle Calvary


Abstract
We study in this paper the problem of clustering comparable corpora, building upon the observation that different types of clusters can be present in such corpora: monolingual clusters comprising documents in a single language, and bilingual or multilingual clusters comprising documents written in different languages. Based on a state-of-the-art deep variant of Kmeans, we propose new clustering models fully adapted to comparable corpora and illustrate their behavior on several bilingual collections (in English, French, German and Russian) created from Wikipedia.
Anthology ID:
2024.lrec-main.12
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
133–142
Language:
URL:
https://aclanthology.org/2024.lrec-main.12
DOI:
Bibkey:
Cite (ACL):
Anna Laskina, Eric Gaussier, and Gaelle Calvary. 2024. A Closer Look at Clustering Bilingual Comparable Corpora. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 133–142, Torino, Italia. ELRA and ICCL.
Cite (Informal):
A Closer Look at Clustering Bilingual Comparable Corpora (Laskina et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.12.pdf