The Emergence of Semantic Units in Massively Multilingual Models

Andrea Gregor de Varda, Marco Marelli


Abstract
Massively multilingual models can process text in several languages relying on a shared set of parameters; however, little is known about the encoding of multilingual information in single network units. In this work, we study how two semantic variables, namely valence and arousal, are processed in the latent dimensions of mBERT and XLM-R across 13 languages. We report a significant cross-lingual overlap in the individual neurons processing affective information, which is more pronounced when considering XLM-R vis-à-vis mBERT. Furthermore, we uncover a positive relationship between cross-lingual alignment and performance, where the languages that rely more heavily on a shared cross-lingual neural substrate achieve higher performance scores in semantic encoding.
Anthology ID:
2024.lrec-main.1382
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
15910–15921
Language:
URL:
https://aclanthology.org/2024.lrec-main.1382
DOI:
Bibkey:
Cite (ACL):
Andrea Gregor de Varda and Marco Marelli. 2024. The Emergence of Semantic Units in Massively Multilingual Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 15910–15921, Torino, Italia. ELRA and ICCL.
Cite (Informal):
The Emergence of Semantic Units in Massively Multilingual Models (de Varda & Marelli, LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.1382.pdf