Daniela Pisoiu


2024

pdf bib
GerDISDETECT: A German Multilabel Dataset for Disinformation Detection
Mina Schütz | Daniela Pisoiu | Daria Liakhovets | Alexander Schindler | Melanie Siegel
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Disinformation has become increasingly relevant in recent years both as a political issue and as object of research. Datasets for training machine learning models, especially for other languages than English, are sparse and the creation costly. Annotated datasets often have only binary or multiclass labels, which provide little information about the grounds and system of such classifications. We propose a novel textual dataset GerDISDETECT for German disinformation. To provide comprehensive analytical insights, a fine-grained taxonomy guided annotation scheme is required. The goal of this dataset, instead of providing a direct assessment regarding true or false, is to provide wide-ranging semantic descriptors that allow for complex interpretation as well as inferred decision-making regarding information and trustworthiness of potentially critical articles. This allows this dataset to be also used for other tasks. The dataset was collected in the first three months of 2022 and contains 39 multilabel classes with 5 top-level categories for a total of 1,890 articles: General View (3 labels), Offensive Language (11 labels), Reporting Style (15 labels), Writing Style (6 labels), and Extremism (4 labels). As a baseline, we further pre-trained a multilingual XLM-R model on around 200,000 unlabeled news articles and fine-tuned it for each category.