Wenpin Jiao


2024

pdf bib
Detection, Diagnosis, and Explanation: A Benchmark for Chinese Medial Hallucination Evaluation
Chengfeng Dou | Ying Zhang | Yanyuan Chen | Zhi Jin | Wenpin Jiao | Haiyan Zhao | Yu Huang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large Language Models (LLMs) have made significant progress recently. However, their practical use in healthcare is hindered by their tendency to generate hallucinations. One specific type, called snowballing hallucination, occurs when LLMs encounter misleading information, and poses a security threat to LLMs. To understand how well LLMs can resist these hallucination, we create the Chinese Medical Hallucination Evaluation benchmark (CMHE). This benchmark can be used to evaluate LLMs’ ability to detect medical hallucinations, make accurate diagnoses in noisy conditions, and provide plausible explanations. The creation of this benchmark involves a combination of manual and model-based approaches. In addition, we use ICD-10 as well as MeSH, two specialized glossaries, to aid in the evaluation. Our experiments show that the LLM struggles to identify fake medical terms and makes poor diagnoses in distracting environments. However, improving the model’s understanding of medical concepts can help it resist interference to some extent.

2023

pdf bib
PlugMed: Improving Specificity in Patient-Centered Medical Dialogue Generation using In-Context Learning
Chengfeng Dou | Zhi Jin | Wenpin Jiao | Haiyan Zhao | Yongqiang Zhao | Zhengwei Tao
Findings of the Association for Computational Linguistics: EMNLP 2023

The patient-centered medical dialogue systems strive to offer diagnostic interpretation services to users who are less knowledgeable about medical knowledge, through emphasizing the importance of providing responses specific to the patients. It is difficult for the large language models (LLMs) to guarantee the specificity of responses in spite of its promising performance even in some tasks in medical field. Inspired by in-context learning, we propose PlugMed, a Plug-and-Play Medical Dialogue System, for addressing this challenge. PlugMed is equipped with two modules, the prompt generation (PG) module and the response ranking (RR) module, to enhances LLMs’ dialogue strategies for improving the specificity of the dialogue. The PG module is designed to stimulate the imitative ability of LLMs by providing them with real dialogues from similar patients as prompts. The RR module incorporates fine-tuned small model as response filter to enable the selection of appropriate responses generated by LLMs. Furthermore, we introduce a new evaluation method based on matching both user’s intent and high-frequency medical term to effectively assess the specificity of the responses. We conduct experimental evaluations on three medical dialogue datasets, and the results, including both automatic and human evaluation, demonstrate the effectiveness of our approach.