Shuheng Zhou


2024

pdf bib
Beyond Full Fine-tuning: Harnessing the Power of LoRA for Multi-Task Instruction Tuning
Chunlei Xin | Yaojie Lu | Hongyu Lin | Shuheng Zhou | Huijia Zhu | Weiqiang Wang | Zhongyi Liu | Xianpei Han | Le Sun
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Low-Rank Adaptation (LoRA) is a widespread parameter-efficient fine-tuning algorithm for large-scale language models. It has been commonly accepted that LoRA mostly achieves promising results in single-task, low-resource settings, and struggles to handle multi-task instruction tuning scenarios. In this paper, we conduct a systematic study of LoRA on diverse tasks and rich resources with different learning capacities, examining its performance on seen tasks during training and its cross-task generalization on unseen tasks. Our findings challenge the prevalent assumption that the limited learning capacity will inevitably result in performance decline. In fact, our study reveals that when configured with an appropriate rank, LoRA can achieve remarkable performance in high-resource and multi-task scenarios, even comparable to that achieved through full fine-tuning. It turns out that the constrained learning capacity encourages LoRA to prioritize conforming to instruction requirements rather than memorizing specialized features of particular tasks or instances. This study reveals the underlying connection between learning capacity and generalization capabilities for robust parameter-efficient fine-tuning, highlighting a promising direction for the broader application of LoRA across various tasks and settings.

pdf bib
Probe Then Retrieve and Reason: Distilling Probing and Reasoning Capabilities into Smaller Language Models
Yichun Zhao | Shuheng Zhou | Huijia Zhu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Step-by-step reasoning methods, such as the Chain-of-Thought (CoT), have been demonstrated to be highly effective in harnessing the reasoning capabilities of Large Language Models (LLMs). Recent research efforts have sought to distill LLMs into Small Language Models (SLMs), with a significant focus on transferring the reasoning capabilities of LLMs to SLMs via CoT. However, the outcomes of CoT distillation are inadequate for knowledge-intensive reasoning tasks. This is because generating accurate rationales requires crucial factual knowledge, which SLMs struggle to retain due to their parameter constraints. We propose a retrieval-based CoT distillation framework, named Probe then Retrieve and Reason (PRR), which distills the question probing and reasoning capabilities from LLMs into SLMs. We train two complementary distilled SLMs, a probing model and a reasoning model, in tandem. When presented with a new question, the probing model first identifies the necessary knowledge to answer it, generating queries for retrieval. Subsequently, the reasoning model uses the retrieved knowledge to construct a step-by-step rationale for the answer. In knowledge-intensive reasoning tasks, such as StrategyQA and OpenbookQA, our distillation framework yields superior performance for SLMs compared to conventional methods, including simple CoT distillation and knowledge-augmented distillation using raw questions.

2023

pdf bib
Noise-Robust Training with Dynamic Loss and Contrastive Learning for Distantly-Supervised Named Entity Recognition
Zhiyuan Ma | Jintao Du | Shuheng Zhou
Findings of the Association for Computational Linguistics: ACL 2023

Distantly-supervised named entity recognition (NER) aims at training networks with distantly-labeled data, which is automatically obtained by matching entity mentions in the raw text with entity types in a knowledge base. Distant supervision may induce incomplete and noisy labels, so recent state-of-the-art methods employ sample selection mechanism to separate clean data from noisy data based on the model’s prediction scores. However, they ignore the noise distribution change caused by data selection, and they simply excludes noisy data during training, resulting in information loss. We propose to (1) use a dynamic loss function to better adapt to the changing noise during the training process, and (2) incorporate token level contrastive learning to fully utilize the noisy data as well as facilitate feature learning without relying on labels. Our method achieves superior performance on three benchmark datasets, outperforming existing distantly supervised NER models by significant margins.