Badr M. Abdullah


2023

pdf bib
Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists
Julius Steuer | Johann-Mattis List | Badr M. Abdullah | Dietrich Klakow
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

We present a cross-linguistic study of vowel harmony that aims to quantifies this phenomenon using data-driven computational modeling. Concretely, we define an information-theoretic measure of harmonicity based on the predictability of vowels in a natural language lexicon, which we estimate using phoneme-level language models (PLMs). Prior quantitative studies have heavily relied on inflected word-forms in the analysis on vowel harmony. On the contrary, we train our models using cross-linguistically comparable lemma forms with little or no inflection, which enables us to cover more under-studied languages. Training data for our PLMs consists of word lists offering a maximum of 1000 entries per language. Despite the fact that the data we employ are substantially smaller than previously used corpora, our experiments demonstrate the neural PLMs capture vowel harmony patterns in a set of languages that exhibit this phenomenon. Our work also demonstrates that word lists are a valuable resource for typological research, and offers new possibilities for future studies on low-resource, under-studied languages.

pdf bib
On the Nature of Discrete Speech Representations in Multilingual Self-supervised Models
Badr M. Abdullah | Mohammed Maqsood Shaik | Dietrich Klakow
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

Self-supervision has emerged as an effective paradigm for learning representations of spoken language from raw audio without explicit labels or transcriptions. Self-supervised speech models, such as wav2vec 2.0 (Baevski et al., 2020) and HuBERT (Hsu et al., 2021), have shown significant promise in improving the performance across different speech processing tasks. One of the main advantages of self-supervised speech models is that they can be pre-trained on a large sample of languages (Conneau et al., 2020; Babu et al.,2022), which facilitates cross-lingual transfer for low-resource languages (San et al., 2021). State-of-the-art self-supervised speech models include a quantization module that transforms the continuous acoustic input into a sequence of discrete units. One of the key questions in this area is whether the discrete representations learned via self-supervision are language-specific or language-universal. In other words, we ask: do the discrete units learned by a multilingual speech model represent the same speech sounds across languages or do they differ based on the specific language being spoken? From the practical perspective, this question has important implications for the development of speech models that can generalize across languages, particularly for low-resource languages. Furthermore, examining the level of linguistic abstraction in speech models that lack symbolic supervision is also relevant to the field of human language acquisition (Dupoux, 2018).

2021

pdf bib
Do we read what we hear? Modeling orthographic influences on spoken word recognition
Nicole Macher | Badr M. Abdullah | Harm Brouwer | Dietrich Klakow
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

Theories and models of spoken word recognition aim to explain the process of accessing lexical knowledge given an acoustic realization of a word form. There is consensus that phonological and semantic information is crucial for this process. However, there is accumulating evidence that orthographic information could also have an impact on auditory word recognition. This paper presents two models of spoken word recognition that instantiate different hypotheses regarding the influence of orthography on this process. We show that these models reproduce human-like behavior in different ways and provide testable hypotheses for future research on the source of orthographic effects in spoken word recognition.

pdf bib
Familiar words but strange voices: Modelling the influence of speech variability on word recognition
Alexandra Mayn | Badr M. Abdullah | Dietrich Klakow
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

We present a deep neural model of spoken word recognition which is trained to retrieve the meaning of a word (in the form of a word embedding) given its spoken form, a task which resembles that faced by a human listener. Furthermore, we investigate the influence of variability in speech signals on the model’s performance. To this end, we conduct of set of controlled experiments using word-aligned read speech data in German. Our experiments show that (1) the model is more sensitive to dialectical variation than gender variation, and (2) recognition performance of word cognates from related languages reflect the degree of relatedness between languages in our study. Our work highlights the feasibility of modeling human speech perception using deep neural networks.

pdf bib
SIGTYP 2021 Shared Task: Robust Spoken Language Identification
Elizabeth Salesky | Badr M. Abdullah | Sabrina Mielke | Elena Klyachko | Oleg Serikov | Edoardo Maria Ponti | Ritesh Kumar | Ryan Cotterell | Ekaterina Vylomova
Proceedings of the Third Workshop on Computational Typology and Multilingual NLP

While language identification is a fundamental speech and language processing task, for many languages and language families it remains a challenging task. For many low-resource and endangered languages this is in part due to resource availability: where larger datasets exist, they may be single-speaker or have different domains than desired application scenarios, demanding a need for domain and speaker-invariant language identification systems. This year’s shared task on robust spoken language identification sought to investigate just this scenario: systems were to be trained on largely single-speaker speech from one domain, but evaluated on data in other domains recorded from speakers under different recording circumstances, mimicking realistic low-resource scenarios. We see that domain and speaker mismatch proves very challenging for current methods which can perform above 95% accuracy in-domain, which domain adaptation can address to some degree, but that these conditions merit further investigation to make spoken language identification accessible in many scenarios.

2020

pdf bib
Rediscovering the Slavic Continuum in Representations Emerging from Neural Models of Spoken Language Identification
Badr M. Abdullah | Jacek Kudera | Tania Avgustinova | Bernd Möbius | Dietrich Klakow
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

Deep neural networks have been employed for various spoken language recognition tasks, including tasks that are multilingual by definition such as spoken language identification (LID). In this paper, we present a neural model for Slavic language identification in speech signals and analyze its emergent representations to investigate whether they reflect objective measures of language relatedness or non-linguists’ perception of language similarity. While our analysis shows that the language representation space indeed captures language relatedness to a great extent, we find perceptual confusability to be the best predictor of the language representation similarity.

pdf bib
A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English
Marius Mosbach | Stefania Degaetano-Ortlieb | Marie-Pauline Krielke | Badr M. Abdullah | Dietrich Klakow
Proceedings of the 28th International Conference on Computational Linguistics

Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on. We evaluate three models (BERT, RoBERTa, and ALBERT), testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks. We focus on relative clauses (in American English) as a complex phenomenon needing contextual information and antecedent identification to be resolved. Based on a naturalistic dataset, probing shows that all three models indeed capture linguistic knowledge about grammaticality, achieving high performance. Evaluation on diagnostic cases and masked prediction tasks considering fine-grained linguistic knowledge, however, shows pronounced model-specific weaknesses especially on semantic knowledge, strongly impacting models’ performance. Our results highlight the importance of (a)model comparison in evaluation task and (b) building up claims of model performance and the linguistic knowledge they capture beyond purely probing-based evaluations.