Token-length Bias in Minimal-pair Paradigm Datasets

Naoya Ueda, Masato Mita, Teruaki Oka, Mamoru Komachi


Abstract
Minimal-pair paradigm datasets have been used as benchmarks to evaluate the linguistic knowledge of models and provide an unsupervised method of acceptability judgment. The model performances are evaluated based on the percentage of minimal pairs in the MPP dataset where the model assigns a higher sentence log-likelihood to an acceptable sentence than to an unacceptable sentence. Each minimal pair in MPP datasets is controlled to align the number of words per sentence because the sentence length affects the sentence log-likelihood. However, aligning the number of words may be insufficient because recent language models tokenize sentences with subwords. Tokenization may cause a token length difference in minimal pairs, introducing token-length bias that skews the evaluation results. This study demonstrates that MPP datasets suffer from token-length bias and fail to evaluate the linguistic knowledge of a language model correctly. The results proved that sentences with a shorter token length would likely be assigned a higher log-likelihood regardless of their acceptability, which becomes problematic when comparing models with different tokenizers. To address this issue, we propose a debiased minimal pair generation method, allowing MPP datasets to measure language ability correctly and provide comparable results for all models.
Anthology ID:
2024.lrec-main.1410
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
16224–16236
Language:
URL:
https://aclanthology.org/2024.lrec-main.1410
DOI:
Bibkey:
Cite (ACL):
Naoya Ueda, Masato Mita, Teruaki Oka, and Mamoru Komachi. 2024. Token-length Bias in Minimal-pair Paradigm Datasets. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16224–16236, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Token-length Bias in Minimal-pair Paradigm Datasets (Ueda et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.1410.pdf