What Is Needed for Intra-document Disambiguation of Math Identifiers?

Takuto Asakura, Yusuke Miyao


Abstract
In automated scientific document analysis, accurately interpreting math formulae is imperative alongside comprehending natural language. Ambiguity in math identifiers within a single document poses significant challenges to understanding math formulae. While disambiguating math identifiers across documents has seen some progress, resolving ambiguity within a document remains inadequately researched due to complexity and insufficient datasets. The level of difficulty and information required to accomplish this task was uncertain. This study aims to determine which information is necessary for the intra-document disambiguation of math identifiers. Our findings indicate that the position data and local formula structure surrounding the identifiers, including modifiers, are particularly critical. For our study, we expanded a dataset for formula grounding and doubled its size to include annotations for 27,655 math identifier occurrences. We have created a multi-layer perceptron model that performs similarly to humans, with an 85% accuracy and a kappa value of 0.73, outperforming rule-based baselines. We trained and evaluated the model with papers in natural language processing (NLP). Our findings were also confirmed valid in fields other than NLP by applying the trained models to papers from various fields. These results will aid in improving mathematical language processing, such as mathematical information retrieval.
Anthology ID:
2024.lrec-main.1522
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
17500–17512
Language:
URL:
https://aclanthology.org/2024.lrec-main.1522
DOI:
Bibkey:
Cite (ACL):
Takuto Asakura and Yusuke Miyao. 2024. What Is Needed for Intra-document Disambiguation of Math Identifiers?. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17500–17512, Torino, Italia. ELRA and ICCL.
Cite (Informal):
What Is Needed for Intra-document Disambiguation of Math Identifiers? (Asakura & Miyao, LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.1522.pdf