Automatic Evaluation of Generative Models with Instruction Tuning

Shuhaib Mehri; Vered Shwartz

Automatic Evaluation of Generative Models with Instruction Tuning

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Automatic evaluation of natural language generation has long been an elusive goal in NLP. A recent paradigm fine-tunes pre-trained language models to emulate human judgements for a particular task and evaluation criterion. Inspired by the generalization ability of instruction-tuned models, we propose a learned metric based on instruction tuning. To test our approach, we collected HEAP, a dataset of human judgements across various NLG tasks and evaluation criteria. Our findings demonstrate that instruction tuning language models on HEAP yields good performance on many evaluation tasks, though some criteria are less trivial to learn than others. Further, jointly training on multiple tasks can yield additional performance improvements, which can be beneficial for future tasks with little to no human annotated data.

Anthology ID:: 2023.gem-1.4
Volume:: Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Sebastian Gehrmann, Alex Wang, João Sedoc, Elizabeth Clark, Kaustubh Dhole, Khyathi Raghavi Chandu, Enrico Santus, Hooman Sedghamiz
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 42–52
Language:
URL:: https://aclanthology.org/2023.gem-1.4/
DOI:
Bibkey:
Cite (ACL):: Shuhaib Mehri and Vered Shwartz. 2023. Automatic Evaluation of Generative Models with Instruction Tuning. In Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 42–52, Singapore. Association for Computational Linguistics.
Cite (Informal):: Automatic Evaluation of Generative Models with Instruction Tuning (Mehri & Shwartz, GEM 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.gem-1.4.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{mehri-shwartz-2023-automatic,
    title = "Automatic Evaluation of Generative Models with Instruction Tuning",
    author = "Mehri, Shuhaib  and
      Shwartz, Vered",
    editor = "Gehrmann, Sebastian  and
      Wang, Alex  and
      Sedoc, Jo{\~a}o  and
      Clark, Elizabeth  and
      Dhole, Kaustubh  and
      Chandu, Khyathi Raghavi  and
      Santus, Enrico  and
      Sedghamiz, Hooman",
    booktitle = "Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.gem-1.4/",
    pages = "42--52",
    abstract = "Automatic evaluation of natural language generation has long been an elusive goal in NLP. A recent paradigm fine-tunes pre-trained language models to emulate human judgements for a particular task and evaluation criterion. Inspired by the generalization ability of instruction-tuned models, we propose a learned metric based on instruction tuning. To test our approach, we collected HEAP, a dataset of human judgements across various NLG tasks and evaluation criteria. Our findings demonstrate that instruction tuning language models on HEAP yields good performance on many evaluation tasks, though some criteria are less trivial to learn than others. Further, jointly training on multiple tasks can yield additional performance improvements, which can be beneficial for future tasks with little to no human annotated data."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="mehri-shwartz-2023-automatic">
    <titleInfo>
        <title>Automatic Evaluation of Generative Models with Instruction Tuning</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Shuhaib</namePart>
        <namePart type="family">Mehri</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Vered</namePart>
        <namePart type="family">Shwartz</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2023-12</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Sebastian</namePart>
            <namePart type="family">Gehrmann</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Alex</namePart>
            <namePart type="family">Wang</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">João</namePart>
            <namePart type="family">Sedoc</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Elizabeth</namePart>
            <namePart type="family">Clark</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Kaustubh</namePart>
            <namePart type="family">Dhole</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Khyathi</namePart>
            <namePart type="given">Raghavi</namePart>
            <namePart type="family">Chandu</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Enrico</namePart>
            <namePart type="family">Santus</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Hooman</namePart>
            <namePart type="family">Sedghamiz</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Singapore</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>Automatic evaluation of natural language generation has long been an elusive goal in NLP. A recent paradigm fine-tunes pre-trained language models to emulate human judgements for a particular task and evaluation criterion. Inspired by the generalization ability of instruction-tuned models, we propose a learned metric based on instruction tuning. To test our approach, we collected HEAP, a dataset of human judgements across various NLG tasks and evaluation criteria. Our findings demonstrate that instruction tuning language models on HEAP yields good performance on many evaluation tasks, though some criteria are less trivial to learn than others. Further, jointly training on multiple tasks can yield additional performance improvements, which can be beneficial for future tasks with little to no human annotated data.</abstract>
    <identifier type="citekey">mehri-shwartz-2023-automatic</identifier>
    <location>
        <url>https://aclanthology.org/2023.gem-1.4/</url>
    </location>
    <part>
        <date>2023-12</date>
        <extent unit="page">
            <start>42</start>
            <end>52</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Automatic Evaluation of Generative Models with Instruction Tuning
%A Mehri, Shuhaib
%A Shwartz, Vered
%Y Gehrmann, Sebastian
%Y Wang, Alex
%Y Sedoc, João
%Y Clark, Elizabeth
%Y Dhole, Kaustubh
%Y Chandu, Khyathi Raghavi
%Y Santus, Enrico
%Y Sedghamiz, Hooman
%S Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
%D 2023
%8 December
%I Association for Computational Linguistics
%C Singapore
%F mehri-shwartz-2023-automatic
%X Automatic evaluation of natural language generation has long been an elusive goal in NLP. A recent paradigm fine-tunes pre-trained language models to emulate human judgements for a particular task and evaluation criterion. Inspired by the generalization ability of instruction-tuned models, we propose a learned metric based on instruction tuning. To test our approach, we collected HEAP, a dataset of human judgements across various NLG tasks and evaluation criteria. Our findings demonstrate that instruction tuning language models on HEAP yields good performance on many evaluation tasks, though some criteria are less trivial to learn than others. Further, jointly training on multiple tasks can yield additional performance improvements, which can be beneficial for future tasks with little to no human annotated data.
%U https://aclanthology.org/2023.gem-1.4/
%P 42-52

Download as File

Markdown (Informal)

[Automatic Evaluation of Generative Models with Instruction Tuning](https://aclanthology.org/2023.gem-1.4/) (Mehri & Shwartz, GEM 2023)

Automatic Evaluation of Generative Models with Instruction Tuning (Mehri & Shwartz, GEM 2023)

ACL

Shuhaib Mehri and Vered Shwartz. 2023. Automatic Evaluation of Generative Models with Instruction Tuning. In Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 42–52, Singapore. Association for Computational Linguistics.