Arthur Lorenzi


2024

pdf bib
Framed Multi30K: A Frame-Based Multimodal-Multilingual Dataset
Marcelo Viridiano | Arthur Lorenzi | Tiago Timponi Torrent | Ely E. Matos | Adriana S. Pagano | Natália Sathler Sigiliano | Maucha Gamonal | Helen de Andrade Abreu | Lívia Vicente Dutra | Mairon Samagaio | Mariane Carvalho | Franciany Campos | Gabrielly Azalim | Bruna Mazzei | Mateus Fonseca de Oliveira | Ana Carolina Luz | Livia Padua Ruiz | Júlia Bellei | Amanda Pestana | Josiane Costa | Iasmin Rabelo | Anna Beatriz Silva | Raquel Roza | Mariana Souza Mota | Igor Oliveira | Márcio Henrique Pelegrino de Freitas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper presents Framed Multi30K (FM30K), a novel frame-based Brazilian Portuguese multimodal-multilingual dataset which i) extends the Multi30K dataset (Elliot et al., 2016) with 158,915 original Brazilian Portuguese descriptions, and 30,104 Brazilian Portuguese translations from original English descriptions; and ii) adds 2,677,613 frame evocation labels to the 158,915 English descriptions and to the ones created for Brazilian Portuguese; (iii) extends the Flickr30k Entities dataset (Plummer et al., 2015) with 190,608 frames and Frame Elements correlations with the existing phrase-to-region correlations.

pdf bib
UCxn: Typologically-Informed Annotation of Constructions Atop Universal Dependencies
Leonie Weissweiler | Nina Böbel | Kirian Guiller | Santiago Herrera | Wesley Samuel Scivetti | Arthur Lorenzi | Nurit Melnik | Archna Bhatia | Hinrich Schütze | Lori Levin | Amir Zeldes | Joakim Nivre | William Croft | Nathan Schneider
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The Universal Dependencies (UD) project has created an invaluable collection of treebanks with contributions in over 140 languages. However, the UD annotations do not tell the full story. Grammatical constructions that convey meaning through a particular combination of several morphosyntactic elements—for example, interrogative sentences with special markers and/or word orders—are not labeled holistically. We argue for (i) augmenting UD annotations with a ‘UCxn’ annotation layer for such meaning-bearing grammatical constructions, and (ii) approaching this in a typologically informed way so that morphosyntactic strategies can be compared across languages. As a case study, we consider five construction families in ten languages, identifying instances of each construction in UD treebanks through the use of morphosyntactic patterns. In addition to findings regarding these particular constructions, our study yields important insights on methodology for describing and identifying constructions in language-general and language-particular ways, and lays the foundation for future constructional enrichment of UD treebanks.

2023

pdf bib
Modeling Construction Grammar’s Way into NLP: Insights from negative results in automatically identifying schematic clausal constructions in Brazilian Portuguese
Arthur Lorenzi | Vânia Gomes de Almeida | Ely Edison Matos | Tiago Timponi Torrent
Proceedings of the First International Workshop on Construction Grammars and NLP (CxGs+NLP, GURT/SyntaxFest 2023)

This paper reports on negative results in a task of automatic identification of schematic clausal constructions and their elements in Brazilian Portuguese. The experiment was set up so as to test whether form and meaning properties of constructions, modeled in terms of Universal Dependencies and FrameNet Frames in a Constructicon, would improve the performance of transformer models in the task. Qualitative analysis of the results indicate that alternatives to the linearization of those properties, dataset size and a post-processing module should be explored in the future as a means to make use of information in Constructicons for NLP tasks.

2022

pdf bib
Lutma: A Frame-Making Tool for Collaborative FrameNet Development
Tiago Timponi Torrent | Arthur Lorenzi | Ely Edison Matos | Frederico Belcavello | Marcelo Viridiano | Maucha Andrade Gamonal
Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022

This paper presents Lutma, a collaborative, semi-constrained, tutorial-based tool for contributing frames and lexical units to the Global FrameNet initiative. The tool parameterizes the process of frame creation, avoiding consistency violations and promoting the integration of frames contributed by the community with existing frames. Lutma is structured in a wizard-like fashion so as to provide users with text and video tutorials relevant for each step in the frame creation process. We argue that this tool will allow for a sensible expansion of FrameNet coverage in terms of both languages and cultural perspectives encoded by them, positioning frames as a viable alternative for representing perspective in language models.

pdf bib
The Case for Perspective in Multimodal Datasets
Marcelo Viridiano | Tiago Timponi Torrent | Oliver Czulo | Arthur Lorenzi | Ely Matos | Frederico Belcavello
Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022

This paper argues in favor of the adoption of annotation practices for multimodal datasets that recognize and represent the inherently perspectivized nature of multimodal communication. To support our claim, we present a set of annotation experiments in which FrameNet annotation is applied to the Multi30k and the Flickr 30k Entities datasets. We assess the cosine similarity between the semantic representations derived from the annotation of both pictures and captions for frames. Our findings indicate that: (i) frame semantic similarity between captions of the same picture produced in different languages is sensitive to whether the caption is a translation of another caption or not, and (ii) picture annotation for semantic frames is sensitive to whether the image is annotated in presence of a caption or not.

pdf bib
Comparing Distributional and Curated Approaches for Cross-lingual Frame Alignment
Collin F. Baker | Michael Ellsworth | Miriam R. L. Petruck | Arthur Lorenzi
Proceedings of the Workshop on Dimensions of Meaning: Distributional and Curated Semantics (DistCurate 2022)

Despite advances in statistical approaches to the modeling of meaning, many ques- tions about the ideal way of exploiting both knowledge-based (e.g., FrameNet, WordNet) and data-based methods (e.g., BERT) remain unresolved. This workshop focuses on these questions with three session papers that run the gamut from highly distributional methods (Lekkas et al., 2022), to highly curated methods (Gamonal, 2022), and techniques with statistical methods producing structured semantics (Lawley and Schubert, 2022). In addition, we begin the workshop with a small comparison of cross-lingual techniques for frame semantic alignment for one language pair (Spanish and English). None of the distributional techniques consistently aligns the 1-best frame match from English to Spanish, all failing in at least one case. Predicting which techniques will align which frames cross-linguistically is not possible from any known characteristic of the alignment technique or the frames. Although distributional techniques are a rich source of semantic information for many tasks, at present curated, knowledge-based semantics remains the only technique that can consistently align frames across languages.

2020

pdf bib
Exploring Crosslinguistic Frame Alignment
Collin F. Baker | Arthur Lorenzi
Proceedings of the International FrameNet Workshop 2020: Towards a Global, Multilingual FrameNet

The FrameNet (FN) project at the International Computer Science Institute in Berkeley (ICSI), which documents the core vocabulary of contemporary English, was the first lexical resource based on Fillmore’s theory of Frame Semantics. Berkeley FrameNet has inspired related projects in roughly a dozen other languages, which have evolved somewhat independently; the current Multilingual FrameNet project (MLFN) is an attempt to find alignments between all of them. The alignment problem is complicated by the fact that these projects have adhered to the Berkeley FrameNet model to varying degrees, and they were also founded at different times, when different versions of the Berkeley FrameNet data were available. We describe several new methods for finding relations of similarity between semantic frames across languages. We will demonstrate ViToXF, a new tool which provides interactive visualizations of these cross-lingual relations, between frames, lexical units, and frame elements, based on resources such as multilingual dictionaries and on shared distributional vector spaces, making clear the strengths and weaknesses of different alignment methods.