Jungyeul Park

2024

pdf bib abs
A Linguistically-Informed Annotation Strategy for Korean Semantic Role Labeling
Yige Chen | KyungTae Lim | Jungyeul Park
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Semantic role labeling is an essential component of semantic and syntactic processing of natural languages, which reveals the predicate-argument structure of the language. Despite its importance, semantic role labeling for the Korean language has not been studied extensively. One notable issue is the lack of uniformity among data annotation strategies across different datasets, which often lack thorough rationales. In this study, we suggest an annotation strategy for Korean semantic role labeling that is in line with the previously proposed linguistic theories as well as the distinct properties of the Korean language. We further propose a simple yet viable conversion strategy from the Sejong verb dictionary to a CoNLL-style dataset for Korean semantic role labeling. Experiment results using a transformer-based sequence labeling model demonstrate the reliability and trainability of the converted dataset.

pdf bib abs
An Untold Story of Preprocessing Task Evaluation: An Alignment-based Joint Evaluation Approach
Eunkyul Leah Jo | Angela Yoonseo Park | Grace Tianjiao Zhang | Izia Xiaoxiao Wang | Junrui Wang | MingJia Mao | Jungyeul Park
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

A preprocessing task such as tokenization and sentence boundary detection (SBD) has commonly been considered as NLP challenges that have already been solved. This perception is due to their generally good performance and the presence of pre-tokenized data. However, it’s important to note that the low error rates of current methods are mainly specific to certain tasks, and rule-based tokenization can be difficult to use across different systems. Despite being subtle, these limitations are significant in the context of the NLP pipeline. In this paper, we introduce a novel evaluation algorithm for the preprocessing task, including both tokenization and SBD results. This algorithm aims to enhance the reliability of evaluations by reevaluating the counts of true positive cases for F1 measures in both preprocessing tasks jointly. It achieves this through an alignment-based approach inspired by sentence and word alignments used in machine translation. Our evaluation algorithm not only allows for precise counting of true positive tokens and sentence boundaries but also combines these two evaluation tasks into a single organized pipeline. To illustrate and clarify the intricacies of this calculation and integration, we provide detailed pseudo-code configurations for implementation. Additionally, we offer empirical evidence demonstrating how sentence and word alignment can improve evaluation reliability and present case studies to further support our approach.

pdf bib abs
Evaluating Prompting Strategies for Grammatical Error Correction Based on Language Proficiency
Min Zeng | Jiexin Kuang | Mengyang Qiu | Jayoung Song | Jungyeul Park
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper proposes an analysis of prompting strategies for grammatical error correction (GEC) with selected large language models (LLM) based on language proficiency. GEC using generative LLMs has been known for overcorrection where results obtain higher recall measures than precision measures. The writing examples of English language learners may be different from those of native speakers. Given that there is a significant differences in second language (L2) learners’ error types by their proficiency levels, this paper attempts to reduce overcorrection by examining the interaction between LLM’s performance and L2 language proficiency. Our method focuses on zero-shot and few-shot prompting and fine-tuning models for GEC for learners of English as a foreign language based on the different proficiency. We investigate GEC results and find that overcorrection happens primarily in advanced language learners’ writing (proficiency C) rather than proficiency A (a beginner level) and proficiency B (an intermediate level). Fine-tuned LLMs, and even few-shot prompting with writing examples of English learners, actually tend to exhibit decreased recall measures. To make our claim concrete, we conduct a comprehensive examination of GEC outcomes and their evaluation results based on language proficiency.

pdf bib abs
Towards Standardized Annotation and Parsing for Korean FrameNet
Yige Chen | Jae Ihn | KyungTae Lim | Jungyeul Park
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Previous research on Korean FrameNet has produced several datasets that serve as resources for FrameNet parsing in Korean. However, these datasets suffer from the problem that annotations are assigned on the word level, which is not optimally designed based on the agglutinative feature of Korean. To address this issue, we introduce a morphologically enhanced annotation strategy for Korean FrameNet datasets and parsing by leveraging the CoNLL-U format. We present the results of the FrameNet parsers trained on the Korean FrameNet data in the original format and our proposed format, respectively, and further elaborate on the linguistic rationales of our proposed scheme. We suggest the morpheme-based scheme to be the standard of Korean FrameNet data annotation.

2023

We present in this work a new Universal Morphology dataset for Korean. Previously, the Korean language has been underrepresented in the field of morphological paradigms amongst hundreds of diverse world languages. Hence, we propose this Universal Morphological paradigms for the Korean language that preserve its distinct characteristics. For our K-UniMorph dataset, we outline each grammatical criterion in detail for the verbal endings, clarify how to extract inflected forms, and demonstrate how we generate the morphological schemata. This dataset adopts morphological feature schema from CITATION and CITATION for the Korean language as we extract inflected verb forms from the Sejong morphologically analyzed corpus that is one of the largest annotated corpora for Korean. During the data creation, our methodology also includes investigating the correctness of the conversion from the Sejong corpus. Furthermore, we carry out the inflection task using three different Korean word forms: letters, syllables and morphemes. Finally, we discuss and describe future perspectives on Korean morphological paradigms and the dataset.

2022

In this study, we propose a morpheme-based scheme for Korean dependency parsing and adopt the proposed scheme to Universal Dependencies. We present the linguistic rationale that illustrates the motivation and the necessity of adopting the morpheme-based format, and develop scripts that convert between the original format used by Universal Dependencies and the proposed morpheme-based format automatically. The effectiveness of the proposed format for Korean dependency parsing is then testified by both statistical and neural models, including UDPipe and Stanza, with our carefully constructed morpheme-based word embedding for Korean. morphUD outperforms parsing results for all Korean UD treebanks, and we also present detailed error analysis.

2019

pdf bib abs
A New Annotation Scheme for the Sejong Part-of-speech Tagged Corpus
Jungyeul Park | Francis Tyers
Proceedings of the 13th Linguistic Annotation Workshop

In this paper we present a new annotation scheme for the Sejong part-of-speech tagged corpus based on Universal Dependencies style annotation. By using a new annotation scheme, we can produce Sejong-style morphological analysis and part-of-speech tagging results which have been the de facto standard for Korean language processing. We also explore the possibility of doing named-entity recognition and semantic-role labelling for Korean using the new annotation scheme.

pdf bib abs
Artificial Error Generation with Fluency Filtering
Mengyang Qiu | Jungyeul Park
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

The quantity and quality of training data plays a crucial role in grammatical error correction (GEC). However, due to the fact that obtaining human-annotated GEC data is both time-consuming and expensive, several studies have focused on generating artificial error sentences to boost training data for grammatical error correction, and shown significantly better performance. The present study explores how fluency filtering can affect the quality of artificial errors. By comparing artificial data filtered by different levels of fluency, we find that artificial error sentences with low fluency can greatly facilitate error correction, while high fluency errors introduce more noise.

pdf bib abs
Improving Precision of Grammatical Error Correction with a Cheat Sheet
Mengyang Qiu | Xuejiao Chen | Maggie Liu | Krishna Parvathala | Apurva Patil | Jungyeul Park
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

In this paper, we explore two approaches of generating error-focused phrases and examine whether these phrases can lead to better performance in grammatical error correction for the restricted track of BEA 2019 Shared Task on GEC. Our results show that phrases directly extracted from GEC corpora outperform phrases from statistical machine translation phrase table by a large margin. Appending error+context phrases to the original GEC corpora yields comparably high precision. We also explore the generation of artificial syntactic error sentences using error+context phrases for the unrestricted track. The additional training data greatly facilitates syntactic error correction (e.g., verb form) and contributes to better overall performance.

2018

pdf bib abs
Le benchmarking de la reconnaissance d’entités nommées pour le français (Benchmarking for French NER)
Jungyeul Park
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Cet article présente une tâche du benchmarking de la reconnaissance de l’entité nommée (REN) pour le français. Nous entrainons et évaluons plusieurs algorithmes d’étiquetage de séquence, et nous améliorons les résultats de REN avec une approche fondée sur l’utilisation de l’apprentissage semi-supervisé et du reclassement. Nous obtenons jusqu’à 77.95%, améliorant ainsi le résultat de plus de 34 points par rapport du résultat de base du modèle.

pdf bib abs
Une note sur l’analyse du constituant pour le français (A Note on constituent parsing for French)
Jungyeul Park
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Cet article traite des analyses d’erreurs quantitatives et qualitatives sur les résultats de l’analyse syntaxique des constituants pour le français. Pour cela, nous étendons l’approche de Kummerfeld et al. (2012) pour français, et nous présentons les détails de l’analyse. Nous entraînons les systèmes d’analyse syntaxique statistiques et neuraux avec le corpus arboré pour français, et nous évaluons les résultats d’analyse. Le corpus arboré pour le français fournit des étiquettes syntagmatiques à grain fin, et les caractéristiques grammaticales du corpus affectent des erreurs d’analyse syntaxique.

pdf bib abs
L’optimisation du plongement de mots pour le français : une application de la classification des phrases (Optimization of Word Embeddings for French : an Application of Sentence Classification)
Jungyeul Park
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Nous proposons trois nouvelles méthodes pour construire et optimiser des plongements de mots pour le français. Nous utilisons les résultats de l’étiquetage morpho-syntaxique, de la détection des expressions multi-mots et de la lemmatisation pour un espace vectoriel continu. Pour l’évaluation, nous utilisons ces vecteurs sur une tâche de classification de phrases et les comparons avec le vecteur du système de base. Nous explorons également l’approche d’adaptation de domaine pour construire des vecteurs. Malgré un petit nombre de vocabulaires et la petite taille du corpus d’apprentissage, les vecteurs spécialisés par domaine obtiennent de meilleures performances que les vecteurs hors domaine.

pdf bib
Data Anonymization for Requirements Quality Analysis: a Reproducible Automatic Error Detection Task
Juyeon Kang | Jungyeul Park
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
Corpus Selection Approaches for Multilingual Parsing from Raw Text to Universal Dependencies
Ryan Hornby | Clark Taylor | Jungyeul Park
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper describes UALing’s approach to the CoNLL 2017 UD Shared Task using corpus selection techniques to reduce training data size. The methodology is simple: we use similarity measures to select a corpus from available training data (even from multiple corpora for surprise languages) and use the resulting corpus to complete the parsing task. The training and parsing is done with the baseline UDPipe system (Straka et al., 2016). While our approach reduces the size of training data significantly, it retains performance within 0.5% of the baseline system. Due to the reduction in training data size, our system performs faster than the naïve, complete corpus method. Specifically, our system runs in less than 10 minutes, ranking it among the fastest entries for this task. Our system is available at https://github.com/CoNLL-UD-2017/UALING.

pdf bib abs
Building a Better Bitext for Structurally Different Languages through Self-training
Jungyeul Park | Loïc Dugast | Jeen-Pyo Hong | Chang-Uk Shin | Jeong-Won Cha
Proceedings of the First Workshop on Curation and Applications of Parallel and Comparable Corpora

We propose a novel method to bootstrap the construction of parallel corpora for new pairs of structurally different languages. We do so by combining the use of a pivot language and self-training. A pivot language enables the use of existing translation models to bootstrap the alignment and a self-training procedure enables to achieve better alignment, both at the document and sentence level. We also propose several evaluation methods for the resulting alignment.

pdf bib
Segmentation Granularity in Dependency Representations for Korean
Jungyeul Park
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

2016

pdf bib
Korean Language Resources for Everyone
Jungyeul Park | Jeen-Pyo Hong | Jeong-Won Cha
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

pdf bib
Generating a Linguistic Model for Requirement Quality Analysis
Juyeon Kang | Jungyeul Park
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters

2014

pdf bib abs
Named Entity Corpus Construction using Wikipedia and DBpedia Ontology
Younggyun Hahm | Jungyeul Park | Kyungtae Lim | Youngsik Kim | Dosam Hwang | Key-Sun Choi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we propose a novel method to automatically build a named entity corpus based on the DBpedia ontology. Since most of named entity recognition systems require time and effort consuming annotation tasks as training data. Work on NER has thus for been limited on certain languages like English that are resource-abundant in general. As an alternative, we suggest that the NE corpus generated by our proposed method, can be used as training data. Our approach introduces Wikipedia as a raw text and uses the DBpedia data set for named entity disambiguation. Our method is language-independent and easy to be applied to many different languages where Wikipedia and DBpedia are provided. Throughout the paper, we demonstrate that our NE corpus is of comparable quality even to the manually annotated NE corpus.

2013

pdf bib
Towards Fully Lexicalized Dependency Parsing for Korean
Jungyeul Park | Daisuke Kawahara | Sadao Kurohashi | Key-Sun Choi
Proceedings of the 13th International Conference on Parsing Technologies (IWPT 2013)

2012

pdf bib
Korean Treebank Transformation for Parser Training
DongHyun Choi | Jungyeul Park | Key-Sun Choi
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages

pdf bib
Korean NLP2RDF Resources
YoungGyun Hahm | KyungTae Lim | Jungyeul Park | Yongun Yoon | Key-Sun Choi
Proceedings of the 10th Workshop on Asian Language Resources

pdf bib abs
Using the International Standard Language Resource Number: Practical and Technical Aspects
Khalid Choukri | Victoria Arranz | Olivier Hamon | Jungyeul Park
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes the International Standard Language Resource Number (ISLRN), a new identification schema for Language Resources where a Language Resource is provided with a unique and universal name using a standardized nomenclature. This will ensure that Language Resources be identified, accessed and disseminated in a unique manner, thus allowing them to be recognized with proper references in all activities concerning Human Language Technologies as well as in all documents and scientific papers. This would allow, for instance, the formal identification of potentially repeated resources across different repositories, the formal referencing of language resources and their correct use when different versions are processed by tools.

Nous présentons le logiciel TiLT pour la correction des SMS et évaluons ses performances sur le corpus de SMS du DELIC. L’évaluation utilise la distance de Jaccard et la mesure BLEU. La présentation des résultats est suivie d’une analyse qualitative du système et de ses limites.

2006

pdf bib abs
Extraction de grammaires TAG lexicalisées avec traits à partir d’un corpus arboré pour le coréen
Jungyeul Park
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Nous présentons, ici, une implémentation d’un système qui n’extrait pas seulement une grammaire lexicalisée (LTAG), mais aussi une grammaire LTAG avec traits (FB-LTAG) à partir d’un corpus arboré. Nous montrons les expérimentations pratiques où nous extrayons les grammaires TAG à partir du Sejong Treebank pour le coréen. Avant tout, les 57 étiquettes syntaxiques et les analyses morphologiques dans le corpus SJTree nous permettent d’extraire les traits syntaxiques automatiquement. De plus, nous modifions le corpus pour l’extraction d’une grammaire lexicalisée et convertissons les grammaires lexicalisées en schémas d’arbre pour résoudre le problème de la couverture lexicale limitée des grammaires lexicalisées extraites.

pdf bib
Extraction of Tree Adjoining Grammars from a Treebank for Korean
Jungyeul Park
Proceedings of the COLING/ACL 2006 Student Research Workshop

pdf bib
Extracting Syntactic Features from a Korean Treebank
Jungyeul Park
Proceedings of the Eighth International Workshop on Tree Adjoining Grammar and Related Formalisms