Word embedding, which represents individual words with semantically fixed-length vectors, has made it possible to successfully apply deep learning to natural language processing tasks such as semantic role-modeling, question answering, and machine translation. As math text consists of natural text, as well as math expressions that similarly exhibit linear correlation and contextual characteristics, word embedding techniques can also be applied to math documents. However, while mathematics is a precise and accurate science, it is usually expressed through imprecise and less accurate descriptions, contributing to the relative dearth of machine learning applications for information retrieval in this domain. Generally, mathematical documents communicate their knowledge with an ambiguous, context-dependent, and non-formal language. Given recent advances in word embedding, it is worthwhile to explore their use and effectiveness in math information retrieval tasks, such as math language processing and semantic knowledge extraction. In this paper, we explore math embedding by testing it on several different scenarios, namely, (1) math-term similarity, (2) analogy, (3) numerical concept-modeling based on the centroid of the keywords that characterize a concept, (4) math search using query expansions, and (5) semantic extraction, i.e., extracting descriptive phrases for math expressions. Due to the lack of benchmarks, our investigations were performed using the arXiv collection of STEM documents and carefully selected illustrations on the Digital Library of Mathematical Functions (DLMF: NIST digital library of mathematical functions. Release 1.0.20 of 2018-09-1, 2018). Our results show that math embedding holds much promise for similarity, analogy, and search tasks. However, we also observed the need for more robust math embedding approaches. Moreover, we explore and discuss fundamental issues that we believe thwart the progress in mathematical information retrieval in the direction of machine learning.

PDF
BibLaTeX

@Article{GreinerPetter2020b, title = {Math-word embedding in math search and semantic extraction}, author = {Andr{\'{e}} Greiner-Petter and Abdou Youssef and Terry Ruas and Bruce R. Miller and Moritz Schubotz and Akiko Aizawa and Bela Gipp}, journal = {Scientometrics}, year = {2020}, doi = {10.1007/s11192-020-03502-9}, url = {https://link.springer.com/article/10.1007/s11192-020-03502-9} }

This poster summarizes our contributions to Wikimedia’s processing pipeline for mathematical formulae. We describe how we have supported the transition from rendering formulae as course-grained PNG images in 2001 to providing modern semantically enriched language-independent MathML formulae in 2020. Additionally, we describe our plans to improve the accessibility and discoverability of mathematical knowledge in Wikimedia projects further.

PDF
BibLaTeX

@Article{Schubotz, author = {Moritz Schubotz and Andr{\'{e}} Greiner{-}Petter and Norman Meuschke and Olaf Teschke and Bela Gipp}, title = {Mathematical Formulae in Wikimedia Projects 2020}, journal = {CoRR}, volume = {abs/2003.09417}, year = {2020}, url = {https://arxiv.org/abs/2003.09417}, archivePrefix = {arXiv}, eprint = {2003.09417}, timestamp = {Tue, 24 Mar 2020 16:42:29 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2003-09417.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

Mathematical notation, i.e., the writing system used to communicate concepts in mathematics, encodes valuable information for a variety of information search and retrieval systems. Yet, mathematical notations remain mostly unutilized by today's systems. In this paper, we present the first in-depth study on the distributions of mathematical notation in two large scientific corpora:~the open access arXiv (2.5B mathematical objects) and the mathematical reviewing service for
pure and applied mathematics zbMATH (61M mathematical objects). Our study lays a foundation for future research projects on mathematical information retrieval for large scientific corpora. Further, we demonstrate the relevance of our results to a variety of use-cases. For example, to assist semantic extraction systems, to improve scientific search engines, and to facilitate specialized math recommendation systems.

The contributions of our presented research are as follows: (1) we present the first distributional analysis of mathematical formulae on arXiv and zbMATH; (2) we retrieve relevant mathematical objects for given textual search queries; (3) we extend zbMATH's search engine by providing relevant mathematical formulae; and (4) we exemplify the applicability of the results by presenting auto-completion for math inputs as the first contribution to math recommendation systems. To expedite future research projects, we have made available our source code and data.

The contributions of our presented research are as follows: (1) we present the first distributional analysis of mathematical formulae on arXiv and zbMATH; (2) we retrieve relevant mathematical objects for given textual search queries; (3) we extend zbMATH's search engine by providing relevant mathematical formulae; and (4) we exemplify the applicability of the results by presenting auto-completion for math inputs as the first contribution to math recommendation systems. To expedite future research projects, we have made available our source code and data.

PDF
ACM
BibLaTeX

@InProceedings{Greiner-PetterS20, author = {Andr{\'{e}} Greiner{-}Petter and Moritz Schubotz and Fabian M{\"{u}}ller and Corinna Breitinger and Howard S. Cohl and Akiko Aizawa and Bela Gipp}, editor = {Yennun Huang and Irwin King and Tie{-}Yan Liu and Maarten van Steen}, title = {Discovering Mathematical Objects of Interest - {A} Study of Mathematical Notations}, booktitle = {{WWW} '20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020}, pages = {1445--1456}, publisher = {{ACM} / {IW3C2}}, year = {2020}, url = {https://doi.org/10.1145/3366423.3380218}, doi = {10.1145/3366423.3380218} }

Nowadays, Machine Learning (ML) is seen as the universal solution to improve the effectiveness of information retrieval (IR) methods. However, while mathematics is a precise and accurate science, it is usually expressed by less accurate and imprecise descriptions. Generally, mathematical documents communicate their knowledge with an ambiguous, context-dependent, and non-formal language. In this work, we apply text embedding techniques to the arXiv collection of STEM documents and explore how these are unable to properly understand mathematics from that corpus, while proposing alternative to mitigate such situation.

PDF
BibLaTeX

@InProceedings{Greiner-PetterR19, author = {Andr{\'{e}} Greiner{-}Petter and Terry Ruas and Moritz Schubotz and Akiko Aizawa and William I. Grosky and Bela Gipp}, title = {Why Machines Cannot Learn Mathematics, Yet}, booktitle = {Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries {(BIRNDL} 2019) co-located with the 42nd International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval {(SIGIR} 2019), Paris, France, July 25, 2019}, series = {{CEUR} Workshop Proceedings}, volume = {2414}, pages = {130--137}, publisher = {CEUR-WS.org}, year = {2019}, url = {http://ceur-ws.org/Vol-2414/paper14.pdf} }

PDF
Emerald Link
BibLaTeX

@Article{Greiner-PetterS19, author = {Andr{'{e}} Greiner{-}Petter and Moritz Schubotz and Howard S. Cohl and Bela Gipp}, title = {Semantic preserving bijective mappings for expressions involving special functions between computer algebra systems and document preparation systems}, journal = {Aslib J. Inf. Manag.}, volume = {71}, number = {3}, pages = {415--439}, year = {2019}, url = {https://doi.org/10.1108/AJIM-08-2018-0185}, doi = {10.1108/AJIM-08-2018-0185}, ISSN = {2050-3806} }

Mathematical formulae carry complex and essential semantic information in a variety of formats.
Accessing this information with different systems requires a standardized machine-readable format that is capable of encoding presentational and semantic information.
Even though MathML is an official recommendation by W3C and an ISO standard
for representing mathematical expressions, we could identify only very few systems which use the full descriptiveness of MathML.
MathML's high complexity results in a steep learning curve for novice users.
We hypothesize that this complexity is the reason why many community-driven projects refrain from using MathML, and instead develop problem-specific data formats for their purposes.
We provide a user-friendly, open-source application programming interface for controlling MathML data.
Our API is written in JAVA and allows to create, manipulate, and efficiently access commonly needed information in presentation and content MathML.
Our interface also provides tools for calculating differences and similarities between MathML expressions.
The API also allows to determine the distance between expressions using different similarity measures.
In addition, we provide adapters for numerous conversion tools and the canonicalization project.
Our toolkit facilitates processing of mathematics for digital libraries, without the need to obtain XML expertise.

PDF
BibLaTeX

@InProceedings{GreinerPetter2018, author = {Greiner-Petter, Andre and Schubotz, Moritz and Cohl, Howard~S. and Gipp, Bela}, title = {MathTools: An Open API for Convenient MathML Handling}, booktitle = {11th Conference on Intelligent Computer Mathematics CICM, RISC, Hagenberg, Austria}, year = {2018}, month = {8}, address = {RISC, Hagenberg, Austria}, }

We have developed an automated procedure for symbolic and numerical testing
of formulae extracted from the NIST Digital Library of Mathematical Functions
(DLMF). For the NIST Digital Repository of Mathematical Formulae, we have
developed conversion tools from semantic LaTeX to the Computer Algebra System
(CAS) Maple which relies on Youssef's part-of-math tagger. We convert a test data subset
of 4,078 semantic LaTeX DLMF formulae
%extracted from the DLMF
to the native CAS representation and then apply an automated scheme for symbolic and numerical
testing and verification. Our framework is implemented using Java and Maple.
We describe in detail the conversion process which is required so that the
CAS can correctly interpret the mathematical representation of the formulae.
We describe the improvement of the effectiveness of our automated scheme through
incremental enhancement (making more precise) of the mathematical semantic markup
for the formulae.

Springer Link
BibLaTeX

@InProceedings{CohlGS18, author = {Howard S. Cohl and Andr{'{e}} Greiner{-}Petter and Moritz Schubotz}, title = {Automated Symbolic and Numerical Testing of {DLMF} Formulae Using Computer Algebra Systems}, booktitle = {Intelligent Computer Mathematics - 11th International Conference, {CICM} 2018, Hagenberg, Austria, August 13-17, 2018, Proceedings}, series = {Lecture Notes in Computer Science}, volume = {11006}, pages = {39--52}, publisher = {Springer}, year = {2018}, url = {https://doi.org/10.1007/978-3-319-96812-4_4}, doi = {10.1007/978-3-319-96812-4_4} }

Mathematical formulae represent complex semantic information in a concise form.
Especially in Science, Technology, Engineering, and Mathematics, mathematical formulae are crucial to communicate information, e.g., in scientific papers, and to perform computations using computer algebra systems. Enabling computers to access the information encoded in mathematical formulae requires machine-readable formats that can represent both the presentation and content, i.e., the semantics, of formulae. Exchanging such information between systems additionally requires conversion methods for mathematical representation formats. We analyze how the semantic enrichment of formulae improves the format conversion process and show that considering the textual context of formulae reduces the error rate of such conversions.
Our main contributions are:
(1) providing an openly available benchmark dataset for the mathematical format conversion task consisting of a newly created test collection, an extensive, manually curated gold standard and task-specific evaluation metrics;
(2) performing a quantitative evaluation of state-of-the-art tools for mathematical format conversions;
(3) presenting a new approach that considers the textual context of formulae to reduce the error rate for mathematical format conversions.
Our benchmark dataset facilitates future research on mathematical format conversions as well as research on many problems in mathematical information retrieval. Because we annotated and linked all components of formulae, e.g., identifiers, operators and other entities, to Wikidata entries, the gold standard can, for instance, be used to train methods for formula concept discovery and recognition. Such methods can then be applied to improve mathematical information retrieval systems, e.g., for semantic formula search, recommendation of mathematical content, or detection of mathematical plagiarism.

PDF
BibLaTeX

@InProceedings{Schubotz2018, author = {Schubotz, Moritz and Greiner-Petter, Andre and Scharpf, Philipp and Meuschke, Norman and Cohl, Howard and Gipp, Bela}, title = {Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context}, booktitle = {Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL)}, year = {2018}, month = {Jun.}, address = {Fort Worth, USA}, doi = {10.1145/3197026.3197058} }

Document preparation systems like LaTeX offer the ability to render mathematical expressions as
one would write these on paper. Using LaTeX, LaTeXML, and tools generated for use in the National
Institute of Standards (NIST)
Digital Library of Mathematical Functions, semantically enhanced mathematical LaTeX markup
(semantic LaTeX) is achieved by using a semantic macro set. Computer algebra systems (CAS)
such as Maple and Mathematica use alternative markup to represent mathematical
expressions.
By taking advantage of Youssef's Part-of-Math tagger and CAS internal representations,
we develop algorithms to translate mathematical expressions represented in semantic LaTeX to
corresponding CAS representations and vice versa.
We have also developed tools for translating the entire Wolfram Encoding Continued Fraction
Knowledge and University of Antwerp Continued Fractions for Special Functions datasets,
for use in the NIST Digital Repository of Mathematical Formulae.
The overall goal of these efforts is to provide semantically enriched standard conforming
MathML representations to the public for formulae in digital mathematics libraries.
These representations include presentation MathML, content MathML, generic LaTeX,
semantic LaTeX, and now CAS representations as well.

Springer Link
BibLaTeX

@InProceedings{Cohl17, author = {Howard S. Cohl and Moritz Schubotz and Abdou Youssef and Andr{'{e}} Greiner{-}Petter and J{"{u}}rgen Gerhard and Bonita V. Saunders and Marjorie A. McClain and Joon Bang and Kevin Chen}, title = {Semantic Preserving Bijective Mappings of Mathematical Formulae Between Document Preparation Systems and Computer Algebra Systems}, booktitle = {Intelligent Computer Mathematics - 10th International Conference, {CICM}}, pages = {115--131}, year = {2017}, month = {Jul.}, address = {Edinburgh, UK}, doi = {10.1007/978-3-319-62075-6_9} }