How Far Is Too Far? Studying the Effects of Domain Discrepancy on Masked Language Models

Subhradeep Kayal, Alexander Rakhlin, Ali Dashti, Serguei Stepaniants


Abstract
Pre-trained masked language models, such as BERT, perform strongly on a wide variety of NLP tasks and have become ubiquitous in recent years. The typical way to use such models is to fine-tune them on downstream data. In this work, we aim to study how the difference in domains between the pre-trained model and the task effects its final performance. We first devise a simple mechanism to quantify the domain difference (using a cloze task) and use it to partition our dataset. Using these partitions of varying domain discrepancy, we focus on answering key questions around the impact of discrepancy on final performance, robustness to out-of-domain test-time examples and effect of domain-adaptive pre-training. We base our experiments on a large-scale openly available e-commerce dataset, and our findings suggest that in spite of pre-training the performance of BERT degrades on datasets with high domain discrepancy, especially in low resource cases. This effect is somewhat mitigated by continued pre-training for domain adaptation. Furthermore, the domain-gap also makes BERT sensitive to out-of-domain examples during inference, even in high resource tasks, and it is prudent to use as diverse a dataset as possible during fine-tuning to make it robust to domain shift.
Anthology ID:
2024.lrec-main.718
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
8192–8199
Language:
URL:
https://aclanthology.org/2024.lrec-main.718
DOI:
Bibkey:
Cite (ACL):
Subhradeep Kayal, Alexander Rakhlin, Ali Dashti, and Serguei Stepaniants. 2024. How Far Is Too Far? Studying the Effects of Domain Discrepancy on Masked Language Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 8192–8199, Torino, Italia. ELRA and ICCL.
Cite (Informal):
How Far Is Too Far? Studying the Effects of Domain Discrepancy on Masked Language Models (Kayal et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.718.pdf