Database Systems and Information Management GroupIdentifier namespaces in mathematical notation

# Identifier namespaces in mathematical notation

Alexey Grigorev is currently working on his masterthesis on "Identifier namespaces in mathematical notation", advised by Moritz Schubotz and Juan Soto.

# Background

Introduction

In computer science, a namespace refers to a collection of terms that are managed together because they share functionality or purpose, typically for providing modularity and resolving name conflicts [1]. For exam-ple, XML uses namespaces to prefix element names to ensure uniqueness and remove ambiguity between them [12], and the Java programming language uses packages to organize identifiers into namespaces for modularity [14].

In this thesis we will extend the notion of namespaces to mathematical formulae.

In logic, a formula is defined recursively, and, in essence, it is a collection of variables, functions and other formulas, and formally the symbols for the variables and functions can be chosen arbitrarily [16]. However, in contrast to first order logic, in this work we are interested in the symbols in formulae and in mathematical notations that are used by different research communities. For example, in physics it is common to write the energy-mass relation as E=mc² rather than x=yz². However, the same identifier may be used in different areas but denote different things: For example, E may refer to "energy", "expected value" or "elimination matrix", depending on the domain of the article. Thus, we can note that these identifiers form namespaces, and we refer to such namespaces as identifier namespaces, and to the process of discovering identifier namespaces as namespace disambiguation.

In this thesis we compare different approaches for namespace disambiguation. The first approach is to assume that there is a strong correlation between identifiers in a document and the namespace of the document, and this correlation can be exploited to categorize documents and thus discover namespaces. For example, if we observe a document with two identifiers E, assigned to "energy", and m, assigned to "mass", then it is more likely that the document belongs to the "physics" namespace rather than to "statistics". To use it, we need to map identifiers to their definitions, and this can be done by extracting the definitions from the text that surrounds the formula [2]. Other approaches are based on the text of the documents, rather on the formulae [10], but nonetheless we believe that there is a correlation between the textual content of a document and the namespace of its identifiers.

Related Work

Kristianto et al [3] highlight the importance of interlinking the scientific documents and in their study they do it through annotating mathematical formulae and finding the documents that share the same identifiers. Schöneberg et al [9] propose mathematical-aware part of speech tagger and they discuss how it can be applied for classifying scientific publications.

There are several researches related to extracting textual description of mathematical formulae. One of the earliest works is by Grigore et al [4] that focuses on disambiguation, and Yokoi et al [5] that focuses on ad-vanced mathematical search.

Pagel and Schubotz [2] suggest a Mathematical Language Processing framework - a statistical approach for relating identifiers to definitions. Similar approach is suggested in [5], [3] and [8], where the authors use ma-chine learning methods for extracting the definitions.

Some work is also done in clustering mathematical formulae by Ma et al [7] to facilitate formula search where they propose features that can be extracted from the formulae.
In computational linguistics there is a related concept called semantic field or semantic domain: it describes a group of terms that are highly related and often are used together. Words that appear frequently in same documents are likely to be in the same semantic field, and this idea is successfully used for text categorization and word disambiguation [11].

# Goals

Formulas comprise an integral part of a mathematical corpus and the main objective of this study is to discover namespaces of identifiers based on these formulas. We expect the namespaces to be meaningful, in the sense that they can be related to real-world areas of knowledge, such as physics, linear algebra or statistics.

Once such namespaces are found, they can give good categorization of scientific documents based on formulas and notation used in them. We believe that this may facilitate better user experience: for instance, it will allow users to navigate easily between documents of the same
category and see in which other documents a particular identifier is used, how it is used, how it is derived, etc. Additionally, it may give a way to avoid ambiguity. If we follow the XML approach [11] and prepend namespace to the identifier, e.g. “physics.E”, then it will give additional context and make it clear that “physics.E” means “energy” rather than “expected value”.

We also see that using namespaces is beneficial for relating identifiers to definitions. Thus, as an application of namespaces, we would like to be able to use them for better definition extraction. It may help to overcome some of the current problems in this area, for example, the problem of dangling identifiers - identifiers that are used in formulas but never defined in the document. Such identifiers may be defined in other documents that share the same namespace, and thus we can take the definition from the namespace and assign it to the dangling identifier.

To achieve these objectives we define the following research tasks:

1. To identify similarities with computational linguistics, computer science and mathematics

2. To study existing solutions for clustering textual and mathematical data and how to use them to discover meaningful namespaces

3. To evaluate these approaches in order to find the best

4. To incorporate the found namespaces to the existing MLP framework (described in [1])

# Realization

To accomplish the proposed goal, we plan the following.
First, we would like to study and analyze existing approaches and recognize similarities and differences with identifier namespaces. From the linguistics point of view, the theory of semantic fields [14] and semantic domains [10] are the most related areas. Then, namespaces are well studied in computer science, e.g. in programming languages such as Java [13] or markup languages such as XML [11]. XML is an especially interesting in this respect, because it serves as the foundation for knowledge representation languages like OWL (Web Ontology Language) [12] that use the notion of namespaces as well.
Because we have limited resources, we believe that the namespaces should be discovered in an unsupervised manner. Thus, we would like to try the following methods for finding namespaces: categorization based on the textual data [9], on semantic domains [10], on keywords extracted from the documents [8] or on definitions extracted from the formulas in the documents [1].
Then we check if the discovered namespaces make sense. The meaningfulness can be evaluated by sampling some documents of the same category and examining them manually. We expect that within one discovered category we should be able to observe documents that can be related to some real-world category. For example, if we sample two documents and get “Ordinary Least Squares” and “Kernel Regression”, then we can relate them to the same area (e.g. “Statistics”) and this is the result we would like to achieve. On the other hand, if we observe “Ordinary Least Squares” and “Dirac comb” within the same category, then it will make it harder to explain such categorization.
Additionally, we plan to see to what extent the namespaces are beneficial for the keyword extraction, and therefore, we plan to incorporate them into the MLP framework [1] to see if the results give better precision and recall. Thus, the results by Pagel and Schubotz [1] will serve as the baseline for this evaluation.
The data set that we plan to use is a subset of English wikipedia articles - all those that contain the [itex] tag. The textual dataset can potentially be quite big: for example, the English wikipedia contains 4.5 million articles, and many thousands of them contain mathematical formulas. This is why it is important to think of ways to parallelize it. Thus, the algorithms will be implemented in Apache Flink [5].
At the end, we expect the following deliverables:

1. List of possible ways to cluster documents

2. Implementation of promising algorithms on Apache Flink

3. Implementation of the MLP project that includes the found namespaces

# References

[1] Pagael, R., Schubotz, M. (2014). Mathematical Language Processing Project. arXiv preprint arXiv:1407.0167.

[2] Kristianto, G. Y., Aizawa, A. (2014). Extracting Textual Descriptions of Mathematical Expressions in Scientific Papers. D-Lib Magazine, 20(11), 9.

[3] Grigore, M., Wolska, M., Kohlhase, M. (2009). Towards context-based disambiguation of mathematical expressions. In The Joint Conference of ASCM (pp. 262-271).

[4] Yokoi, K., Nghiem, M. Q., Matsubayashi, Y., Aizawa, A. (2011). Contextual analyis of mathematical expressions for advanced mathematical search. Polibits, (43), 81-86.

[6] Ma, K., Hui, S. C., Chang, K. (2010). Feature extraction and clustering-based retrieval for mathematical formulas. In Software Engineering and Data Mining (SEDM), 2010 2nd International Conference on (pp. 372-377). IEEE.

[7] Kristianto, G. Y., Nghiem, M. Q., Matsubayashi, Y., Aizawa, A. (2012). Extracting definitions of mathematical expressions in scientific papers. In The 26th Annual Conference of JSAI.

[8] Schöneberg, U., Sperber, W. (2014). POS Tagging and its Applications for Mathematics. In Intelligent Computer Mathematics (pp. 213-223). Springer International Publishing.

[9] Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47.

[10] Gliozzo, A., Strapparava, C. (2009). Semantic domains in computational linguistics. Springer.

[11] Bray, T., Hollander, D., Layman, A. (1999). Namespaces in XML. World Wide Web Consortium Recommendation REC-xml-names-19990114. http://www.w3.org/TR/1999/REC-xml-names-19990114.

[12] McGuinness, D. L., Van Harmelen, F. (2004). OWL web ontology language overview. W3C recommendation, 10(10), 2004.

[13] Gosling J., Joy B., Steele G., Bracha G., Buckley A. (2014) The Java Language Specification, Java SE 8 Edition. In Java Series. Addison-Wesley Professional.

[14] Vassilyev, L. M. (1974). The theory of semantic fields: A survey. Linguistics, 12(137), 79-94.