Sijia Ge

2024

pdf bib abs
Annotate Chinese Aspect with UMR——a Case Study on the Liitle Prince
Sijia Ge | Zilong Li | Alvin Po-Chun Chen | Guanchao Wang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Aspect is a valuable tool for determining the perspective from which an event is observed, allowing for viewing both at the situation and viewpoint level. Uniform Meaning Representation (UMR) seeks to provide a standard, typologically-informed representation of aspects across languages. It employs an aspectual lattice to adapt to different languages and design values that encompass both viewpoint aspect and situation aspects. In the context of annotating the Chinese version of The Little Prince, we paid particular attention to the interactions between aspect values and aspect markers and we also want to know the annotation effectiveness and challenges under the UMR aspectual lattice. During our annotation process, we identified the relationships between aspectual markers and labels. We further categorized and analyzed complex examples that led to low inter-annotator agreement. The factors contributing to disagreement among annotators included the interpretations of lexical semantics, implications, and the influence of aspectual markers, which is related to the inclination of the situation aspect and the exclusivity between the two aspects’ perspectives. Overall, our work sheds light on the challenges of aspect annotation in Chinese and highlights the need for more comprehensive guidelines.

This paper reports the first release of the UMR (Uniform Meaning Representation) data set. UMR is a graph-based meaning representation formalism consisting of a sentence-level graph and a document-level graph. The sentence-level graph represents predicate-argument structures, named entities, word senses, aspectuality of events, as well as person and number information for entities. The document-level graph represents coreferential, temporal, and modal relations that go beyond sentence boundaries. UMR is designed to capture the commonalities and variations across languages and this is done through the use of a common set of abstract concepts, relations, and attributes as well as concrete concepts derived from words from invidual languages. This UMR release includes annotations for six languages (Arapaho, Chinese, English, Kukama, Navajo, Sanapana) that vary greatly in terms of their linguistic properties and resource availability. We also describe on-going efforts to enlarge this data set and extend it to other genres and modalities. We also briefly describe the available infrastructure (UMR annotation guidelines and tools) that others can use to create similar data sets.

This paper presents the first integration of PropBank role information into Wikidata, in order to provide a novel resource for information extraction, one combining Wikidata’s ontological metadata with PropBank’s rich argument structure encoding for event classes. We discuss a technique for PropBank augmentation to existing eventive Wikidata items, as well as identification of gaps in Wikidata’s coverage based on manual examination of over 11,300 PropBank rolesets. We propose five new Wikidata properties to integrate PropBank structure into Wikidata so that the annotated mappings can be added en masse. We then outline the methodology and challenges of this integration, including annotation with the combined resources.

2023

UMR-Writer is a web-based tool for annotating semantic graphs with the Uniform Meaning Representation (UMR) scheme. UMR is a graph-based semantic representation that can be applied cross-linguistically for deep semantic analysis of texts. In this work, we implemented a new keyboard interface in UMR-Writer 2.0, which is a powerful addition to the original mouse interface, supporting faster annotation for more experienced annotators. The new interface also addresses issues with the original mouse interface. Additionally, we demonstrate an efficient workflow for annotation project management in UMR-Writer 2.0, which has been applied to many projects.

2022

pdf bib abs
Integration of Named Entity Recognition and Sentence Segmentation on Ancient Chinese based on Siku-BERT
Sijia Ge
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities

Sentence segmentation and named entity recognition are two significant tasks in ancient Chinese processing since punctuation and named entity information are important for further research on ancient classics. These two are sequence labeling tasks in essence so we can tag the labels of these two tasks for each token simultaneously. Our work is to evaluate whether such a unified way would be better than tagging the label of each task separately with a BERT-based model. The paper adopts a BERT-based model that was pre-trained on ancient Chinese text to conduct experiments on Zuozhuan text. The results show there is no difference between these two tagging approaches without concerning the type of entities and punctuation. The ablation experiments show that the punctuation token in the text is useful for NER tasks, and finer tagging sets such as differentiating the tokens that locate at the end of an entity and those are in the middle of an entity could offer a useful feature for NER while impact negatively sentences segmentation with unified tagging.

2020

pdf bib abs
Integration of Automatic Sentence Segmentation and Lexical Analysis of Ancient Chinese based on BiLSTM-CRF Model
Ning Cheng | Bin Li | Liming Xiao | Changwei Xu | Sijia Ge | Xingyue Hao | Minxuan Feng
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

The basic tasks of ancient Chinese information processing include automatic sentence segmentation, word segmentation, part-of-speech tagging and named entity recognition. Tasks such as lexical analysis need to be based on sentence segmentation because of the reason that a plenty of ancient books are not punctuated. However, step-by-step processing is prone to cause multi-level diffusion of errors. This paper designs and implements an integrated annotation system of sentence segmentation and lexical analysis. The BiLSTM-CRF neural network model is used to verify the generalization ability and the effect of sentence segmentation and lexical analysis on different label levels on four cross-age test sets. Research shows that the integration method adopted in ancient Chinese improves the F1-score of sentence segmentation, word segmentation and part of speech tagging. Based on the experimental results of each test set, the F1-score of sentence segmentation reached 78.95, with an average increase of 3.5%; the F1-score of word segmentation reached 85.73%, with an average increase of 0.18%; and the F1-score of part-of-speech tagging reached 72.65, with an average increase of 0.35%.

Co-authors

Venues

lrec2
coling2
law2
nlp4dh1
ws1
show all...

lt4hala1