Can We Learn Question, Answer, and Distractors All from an Image? A New Task for Multiple-choice Visual Question Answering

Wenjian Ding, Yao Zhang, Jun Wang, Adam Jatowt, Zhenglu Yang


Abstract
Multiple-choice visual question answering (MC VQA) requires an answer picked from a list of distractors, based on a question and an image. This research has attracted wide interest from the fields of visual question answering, visual question generation, and visual distractor generation. However, these fields still stay in their own territories, and how to jointly generate meaningful questions, correct answers, and challenging distractors remains unexplored. In this paper, we introduce a novel task, Visual Question-Answer-Distractors Generation (VQADG), which can bridge this research gap as well as take as a cornerstone to promote existing VQA models. Specific to the VQADG task, we present a novel framework consisting of a vision-and-language model to encode the given image and generate QADs jointly, and contrastive learning to ensure the consistency of the generated question, answer, and distractors. Empirical evaluations on the benchmark dataset validate the performance of our model in the VQADG task.
Anthology ID:
2024.lrec-main.254
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
2852–2863
Language:
URL:
https://aclanthology.org/2024.lrec-main.254
DOI:
Bibkey:
Cite (ACL):
Wenjian Ding, Yao Zhang, Jun Wang, Adam Jatowt, and Zhenglu Yang. 2024. Can We Learn Question, Answer, and Distractors All from an Image? A New Task for Multiple-choice Visual Question Answering. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2852–2863, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Can We Learn Question, Answer, and Distractors All from an Image? A New Task for Multiple-choice Visual Question Answering (Ding et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.254.pdf