TU Berlin

Database Systems and Information Management GroupEvaluating Document Similarity Measures on Wikipedia

Logo FG DIMA-new  65px

Page Content

to Navigation

Evaluating Document Similarity Measures on Wikipedia

Malte Schwarzer is currently working on his bachelorthesis on "Evaluating Document Similarity Measures on Wikipedia", advised by Moritz Schubotz and Norman Meuschke from Universität Konstanz.


Scientific Background

Literature research is an important task of scientific work. Finding relevant information and related papers is often essential for the success of future results. The increasing number and availability of scientific papers is making literature research even more important. According to a study from 2012 the number of documents published every year is about 1.8 million [1]. Evaluating such an amount is a problem almost impossible to solve by hand. So literature recommender systems have been  made  to  help  researchers  finding  relevant papers. These  systems  are  following  an  automated  approach to determine document relevance. But determining relevance is a complex challenge because of its nature.

You can define relevance as consisting of two main components the commonly objective topical relevance and the purely subjective user relevance [2]. Experts of a subject can judge the topical relevance of a document. On contrary user relevance highly depends on the users information demand and if an information can satisfy the individual demand. This can differ from user to user and is making an automated approach to a complex challenge.

Current recommender systems are trying to determine relevance by using document similarity as approximation of relevance [3]. You can describe document similarity in several ways. One approach is based on citation and references that are used in scientific documents. This method has been approved to be helpful for literature research [4].

In  1963  Kessler  established  Bibliographic  coupling.  It  is  a  similarity  measure  that  uses  citation  analysis  to  determine  a similarity  relationship  between  documents  [5].  Documents  are  bibliographically  coupled  if  they  cite  one  or  more  documents in common. The basic idea is that documents that cite the same works are more likely to have to same subject. Ten years later in 1973 Small and MarshakovaShaikevic developed independently another citation based similarity measure named CoCitation (CoCit) [6], [7]. Instead of focusing on what a document cites this approach evaluates the citations a document receives. The number of papers citing two documents together equals the degree of their similarity. As a result of this CoCitation has forwardlooking perspective compared to Bibliographic coupling. The degree of document similarity measured by CoCitation can change over time whereas with Bibliographic coupling the degree is static and stays the same.

Gipp  and  Beel  introduced  an  improvement  to  CoCitation  called   CoCitation  Proximity  Analysis  (CPA)  in  2006  [8].  Their approach not just determines similarity by counting the citations furthermore it evaluates the position of the citations in the citing documents. Documents that are cited together in the same sentence are considered as more similar than documents cited  in  the same paragraph. Previous studies showed  that CPA  provides better recommendations for scientific documents than CoCitation [9].

In this thesis I want to compare the CoCitation Proximity Analysis with other similarity measures.  Instead of determining similarity of scientific documents, the similarity of Wikipedia articles should be determined.

Purpose Of This Thesis

The purpose of this work is to show that the CoCitation Proximity Analysis (CPA) approach developed by Bela Gipp and

Jöran Beel can be applied on Wikipedia articles. My main objective is to answer the following research questions:

1)  Can CPA be applied on Wikipedia articles?

The use of CPA for scientific documents has been proved already. Now I want to extend the scope of this similarity measure to cover Wikipedia articles as well. Furthermore it is questionable whether  citations  and  hyperlinks  can  be  treated equally  in  this scenario.

 2)  Is CPA superior to CoCit in this context?

 The empirical comparison of CPA and CoCit should show which approach is superior and if link recommender systems can be improved by migrating to CPA.

 3)  How can you evaluate 1) and 2) automatically?

 Evaluating the text corpus of Wikipedia has the benefit that you can do the research on the whole bibliography not just a sample. But this makes the evaluation also difficult. With limited resources it is impossible to process all 7 million documents and 172 million links of Wikipedia by hand. Therefore the evaluation has to be done automatically.

Wikipedia articles contain a socalled “See also” section. As the Wikipedia guideline says this section should include internal links to related Wikipedia articles. The purpose of "See also" links is to enable readers to explore tangentially related topics [10]. They can be used to assist readers in finding related articles. In my context those links are equal to recommendations  for relevant  articles  made  by  experts  of  the  subject.  In  other  words  “See  also”  links  approximate  a  “gold standard” for recommender systems.

By  comparing  the  similarity  measures  with  this  standard  I  can  test  their results  quality.  “See  also”  links  are  machinereadable and available for a large number of Wikipedia articles. This enables me to do a largescale evaluation of similarity measures.


The similarity measures will be implemented in the MapReduce programming model. It is a pattern for processing and generating large data sets with a parallel, distributed algorithm on a cluster [11]. The data processing engine Apache Flink should be used as a platform. In addition the results should be compared with two standards.

Comparing citation based and lexical document similarity

Lexical similarity is another dimension to determine document similarity. It measures the degree of similarity based on the set of words in two documents. Two documents sharing a large amount of words are considered similar. Apache Lucene’s “MoreLikeThis” function is an example for this approach [12], [13]. Because of its simplicity, this method delivers quite satisfying results for every kind of text documents. Therefore I want to use “MoreLikeThis” as baseline and compare it with CoCit and CPA.

Comparison with user generated “gold standard”

To examine the quality of the result of the similarity measures, the user generated “See also” links should be defined as “gold standard” (as mentioned in 2). An article recommended by a similarity measure is judged as good, when the  article is also found in the “See also” links. The measure, which shares the most results with the “See also” links, can be treated as the superior document recommender system.


[1]         M. Ware and M. Mabe, “The stm report,” Int. Assoc. Sci. Tech. Med. Publ., no. November, 2009.

[2]         R. Lachica, D. Karabeg, and S. Rudan, “Quality , Relevance and Importance in Information Retrieval with Fuzzy Semantic Networks Defining Quality , Relevance and Importance.”

[3]         J. Lin and W. J. Wilbur, “PubMed related articles: a probabilistic topicbased model for content similarity.,” BMC Bioinformatics, vol. 8, p. 423, Jan. 2007.

[4]         M. Cristo, E. S. De Moura, and N. Ziviani, “Link Information as a Similarity Measure in Web Classification,” Lect. Notes Comput. Sci., vol. 2857, no. String Processing and Information Retrieval, pp. 43–55,    2003.  

[5]        M. Kessler, “Bibliographic coupling between scientific papers,” Am. Doc., vol. 97, no. January, 1963.

[6]        H. Small, “A New Measure of the Relationship Two Documents,” vol. 24, no. 4, pp. 28–31, 1973.

[7]         M. IV, “System of document connections based on references,” … Informatsiya Seriya 2…, no. 6, pp. 3–8, 1973.

[8]         B. Gipp and J. Beel, “Citation Proximity Analysis (CPA)A new approach for identifying related work based on CoCitation Analysis,” Birger Larsen Jacqueline Leta, Ed. Proc. 12th Int. Conf. Sci. Inf., vol. 2, no. July, pp. 571–575, 2009.

[9]         S. Liu and C. Chen, “The effects of cocitation proximity on cocitation analysis,” Proc. ISSI, 2011.

[10]       “Wikipedia:Manual of Style/Layout,” Wikipedia, 2014. [Online]. Available: en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Layout. [Accessed: 11Nov2014].

[11]       J. Dean and S. Ghemawat, “MapReduce : Simplified Data Processing on Large Clusters,” Commun. ACM, vol. 51, no. 1, pp. 1–13, 2008.

[12]       G. Salton, A. Wong, and C. Yang, “A Vector Space Model for Automatic Indexing,” Communications, vol. 18, no. 11, 1975.

[13]       Johnson, “How MoreLikeThis Works in Lucene,” Blog, 2008. [Online]. Available: http://cephas.net/blog/2008/03/30/howmorelikethisworksinlucene/. [Accessed: 11Nov2014].


Quick Access

Schnellnavigation zur Seite über Nummerneingabe