Feng Wei


2026

Recent advancements in large language models (LLMs) have shown remarkable potential in automating machine learning tasks. However, existing LLM-based agents often struggle with low diversity and suboptimal code generation. While recent work (CITATION) has introduced Monte Carlo Tree Search (MCTS) to address these issues, limitations persist in the quality and diversity of thoughts generated, as well as in the scalar value feedback mechanisms used for node selection. In this study, we introduce Introspective Monte Carlo Tree Search (I-MCTS), a novel approach that iteratively expands tree nodes through an introspective process that meticulously analyzes solutions and results from parent and sibling nodes. This facilitates a continuous refinement of the node in the search tree, thereby enhancing the overall decision-making process. Furthermore, we integrate a Large Language Model (LLM)-based value model to facilitate direct evaluation of each node’s solution prior to conducting comprehensive computational rollouts. A hybrid rewarding mechanism is implemented to seamlessly transition the Q-value from estimated score to actual performance scores. Applied to the various ML tasks, our approach demonstrates a 4% absolute improvement in performance compared to the strong open-source AutoML agents, showcasing its effectiveness in enhancing agentic AutoML systems. Resource available at https://github.com/jokieleung/I-MCTS

2025

Historical analogies, which compare known past events with contemporary but unfamiliar events, are important abilities that help people make decisions and understand the world. However, research in applied history suggests that people have difficulty finding appropriate analogies. And previous studies in the AI community have also overlooked historical analogies. To fill this gap, in this paper, we focus on the historical analogy acquisition task, which aims to acquire analogous historical events for a given event. We explore retrieval and generation methods for acquiring historical analogies based on different large language models (LLMs). Furthermore, we propose a self-reflection method to mitigate hallucinations and stereotypes when LLMs generate historical analogies. Through human evaluations and our specially designed automatic multi-dimensional assessment, we find that LLMs generally have a good potential for historical analogies. And the performance of the models can be further improved by using our self-reflection method. Resources of this paper can be found at https://anonymous.4open.science/r/Historical-Analogy-of-LLMs-FC17
"世界卫生组织国际疾病分类ICD诊断编码的自动生成是医疗信息化的核心挑战,面临主诊断单标签分类准确性不足、其他诊断多标签预测不完整以及长尾分布等技术瓶颈。本文系统研究探索了大语言模型在中文电子病历ICD诊断编码任务中的微调范式创新,针对生成式微调、判别式微调,以及强化学习分别提出了不同的微调训练策略。其中,创新性地设计针对医疗特性的基于规则奖励的强化学习框架(RBRs-RL),通过动态难度校准、Token级梯度优化和超长奖励塑造策略改进了GRPO算法的效率和性能,同时结合提出的策略轮动数据增强迭代训练(SRADIT)策略,实现了强化微调性能上限的提升。此外,本文还系统比较了生成式与判别式微调在中文诊断ICD编码任务中的性能边界,同时构建了端到端的临床决策优化框架,为奖励微调提供有效路径。并且针对推理阶段,本文设计了一种温度调节集成共识预测方法(TCECP),提升了推理的稳定性和可靠性。最后基于Qwen2.5-7B模型的微调实验结果表明,通过本文提出的优化后的RBR-R1式强化微调方法,在CCL25-Eval任务朸的A榜和B榜分别取得80.98和82.33的优异成绩,其效果显著超越传统SFT的性能上限。综上所述,本文的探索与发现为医疗诊断编码系统的实际应用提供了重要的技术参考。"
Large language models (LLMs) have achieved great success, but their occasional content fabrication, or hallucination, limits their practical application. Hallucination arises because LLMs struggle to admit ignorance due to inadequate training on knowledge boundaries. We call it a limitation of LLMs that they can not accurately express their knowledge boundary, answering questions they know while admitting ignorance to questions they do not know. In this paper, we aim to teach LLMs to recognize and express their knowledge boundary, so they can reduce hallucinations caused by fabricating when they do not know. We propose CoKE, which first probes LLMs’ knowledge boundary via internal confidence given a set of questions, and then leverages the probing results to elicit the expression of the knowledge boundary. Extensive experiments show CoKE helps LLMs express knowledge boundaries, answering known questions while declining unknown ones, significantly improving in-domain and out-of-domain performance.
Program-of-Thought, which aims to use program instead of natural language in reasoning, is an important way for LLMs to solve mathematical problems. Since different programming languages excel in different areas, it is natural to use the most suitable language for solving specific problems. However, current research only focuses on single language PoT, ignoring the differences between programming languages. Therefore, this paper proposes a multilingual programme reasoning method, MultiLingPoT, and deeply explores the impact of multilingual integration in the training and inference. This method allows the model to answer questions using multiple languages by fine-tuning on multilingual data and improving individual language’s reasoning accuracy by 2.5%. Additionally, prior and posterior selection methods are used to help the model select the most suitable language during inference, and achieves 8% performance gains. Finally, our code metric analysis shows that language differences manifest in encapsulation levels and implementation granularity, while strategic deviation from language conventions can enhances code performance.
"随着社交媒体的迅速普及,用户生成内容呈指数级增长,同时也助长了仇恨言论的扩散。因此,有效检测仇恨言论已成为自然语言处理研究领域的一项关键挑战。为推动中文仇恨言论检测技术的发展,本文提出了一种新颖的大语言模型微调框架,该框架融合了动态线索增强提示和多阶段渐进优化方法。所提出的方法将复杂的细粒度仇恨言论识别任务分解为两个相辅相成的子任务:仇恨倾向分类和仇恨信息提取。为此采用了两种专门的训练策略:动态线索增强提示微调(DCA-SFT)用于优化模型的分类性能,而动态线索增强强化学习(DCA-RL)则用于提升模型的信息提取能力。具体而言,在DCA-SFT阶段,引入判别式分类并采用多标签独热(Multi-Hot)编码作为输出表示形式,以提高模型的多类别分类准确率。在DCA-RL阶段,通过知识蒸馏的方式,将闭源大语言模型在执行仇恨信息提取任务时的思维链(CoT)知识迁移至小参数模型,同时引入基于规则奖励的强化微调策略来增强小参数模型在信息提取任务中的逻辑推理能力。实验结果证明了该方法的有效性,在CCL25-Eval任务10的初赛排行榜上以0.3864的F1值,排名第二;在决赛排行榜上以0.3591的F1值,位列第三。"

2024

There is a growing interest in expanding the input capacity of language models (LMs) across various domains. However, simply increasing the context window does not guarantee robust performance across diverse long-input processing tasks, such as understanding extensive documents and extracting detailed information from lengthy and noisy data. In response, we introduce Segment+, a general framework that enables LMs to handle extended inputs within limited context windows efficiently. Segment+ utilizes structured notes and a filtering module to manage information flow, resulting in a system that is both controllable and interpretable. Our extensive experiments across various model sizes, focusing on long-document question-answering and Needle-in-a-Haystack tasks, demonstrate the effectiveness of Segment+ in improving performance.
Concept reasoning is an important capability for models to understand the world. However, the existing datasets, such as concept extraction and concept generation, suffer from modeledge leakage and context leakage. To address these limitations, we construct a dataset of concept reasoning for large language models (CR-LLM) with modeledge leakage prevention and context leakage prevention, which consists of 2,167 samples and covers different concept types. In addition, we propose a hybrid reasoning method, consisting of inductive reasoning, deductive reasoning and a controller. This method allows large language models to adaptively select the optimal reasoning method for each input sample. Finally, we conduct extensive experiments on CR-LLM using different models and methods. The results show that existing large language models and reasoning methods perform sub-optimally in the concept reasoning task. In contrast, our proposed method significantly improves the capabilities, achieving a 7% increase in accuracy compared to CoT and demonstrating better granularity. We release CR-LLM and code at https://github.com/Nianqi-Li/Concept-Reasoning-for-LLMs.

2023

Thanks to the recent success of Pre-trained Language Models (PLMs), it has become a promising research direction to develop a universal model (UIE) that can solve all typical information extraction tasks within one generative framework. Nonetheless, in real-world scenarios of UIE applications, new data of different IE tasks and domains usually come in a stream over time. A desirable UIE system should be capable of continually learning new tasks without forgetting old ones, thereby allowing knowledge and functionalities expansion without re-training the whole system. In this paper, we study the UIE system under a more challenging yet practical scenario, i.e., “lifelong learning” settings, to evaluate its abilities in three aspects, including knowledge sharing and expansion, catastrophic forgetting prevention, and rapid generalization on few-shot and unseen tasks. To achieve these three goals, we present a novel parameter- and deployment-efficient prompt tuning method namely Lottery Prompt Tuning (LPT).LPT freezes the PLM’s parameters and sequentially learns compact pruned prompt vectors for each task leveraging a binary prompt mask, while keeping the prompt parameters selected by the previous tasks insusceptible. Furthermore, we use a simple yet effective method to perform mask selection and show the powerful transferability of Lottery Prompts to novel tasks. Extensive experiments demonstrate that LPT consistently sets state-of-the-art performance on multiple lifelong learning settings of UIE, including task-incremental setting on seen tasks, few-shot adaptation, and zero-shot generalization on novel tasks.