Datenschutzerklärung|Data Privacy

Juan Soto

17.03.2016: The paper titled "Evaluating Link-based Recommendations for Wikipedia" was accepted for publication at the upcoming ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL)

We are very happy to announce that the full paper

Evaluating Link-based Recommendations for Wikipedia by Malte Schwarzer, Moritz Schubotz, Norman Meuschke, Corinna Breitinger, Volker Markl and Bela Gipp

was accepted for publication at JCDL 2016.

Literature recommender systems (LRS) support users in filtering the vast and increasing amount of documents in digital libraries and on the Web. For academic literature, research has proven the value of employing citation-based document similarity measures, such as Co-Citation (CoCit), as part of LRS. The CoCit measure considers two documents to be more strongly related, the more frequently they are co-cited, i.e. cited together, in other documents. CoCit assigns equal weight to each pair of co-cited documents regardless of where in the citing document the citations occur. Large-scale digital availability of academic full texts has enabled refining the CoCit measure by weighting co-citations according to the distance of the involved citations in the citing document. This approach considers two sources that are co-cited in closer proximity, e.g. in the same sentence, more strongly related than two sources that, for example, are cited in different sections. Co-Citation Proximity Analysis (CPA) was the first measure reflecting this approach. For academic literature, CPA’s ability to improve recommendation quality has been shown.

In this paper, we perform a large-scale investigation of the performance of the CPA approach in generating literature recommendations for Wikipedia, which is fundamentally different from the academic literature domain. We analyze links instead of citations to generate article recommendations. We evaluate CPA, CoCit and the Apache Lucene MoreLikeThis (MLT) function, which represents a traditional text-based similarity measure. We use a test collection of 4.6 million Wikipedia articles, the Big Data processing framework Apache Flink, and a ten-node computing cluster. To enable our large-scale evaluation, we derive two quasi gold standards from the links in Wikipedia’s “See also” sections and a comprehensive clickstream dataset.

We find that MLT and CPA have complementary strengths. MLT performs well in identifying closely related articles. The CPA approach, which consistently outperformed CoCit, is better suited to identify a broader spectrum of related articles as well as popular articles that typically exhibit a higher quality. Additional benefits of the CPA approach are its lower runtime requirements and its language-independence, which allows the cross-language retrieval of articles. We present a manual analysis of exemplary articles to demonstrate and discuss our findings. The raw data and source code of our study along with a manual on how to use them are openly available at: