TU Berlin

Database Systems and Information Management GroupPublications

Logo FG DIMA-new  65px

Page Content

to Navigation


Near-duplicate detection for web-forums
Citation key DBLP:conf/ideas/MuthmannBBL09
Author Klemens Muthmann, Wojciech M. Barczynski, Falk Brauer, Alexander Löser
Pages 142-151
Year 2009
Journal IDEAS '09. Proceedings of the 2009 International Database Engineering & Applications Symposium
Abstract Current forum search technologies lack the ability to identify threads with near-duplicate content and to group these threads in the search results. As a result, forum users are overloaded with duplicated search results and prefer to create new threads without trying to find existing ones. In this paper we therefore identify common reasons leading to near-duplicates and develop a new near-duplicate detection algorithm for forum threads. The algorithm is implemented using a large case study of a real-world forum serving more than one million users. We compare this work with current algorithms, similar to [4, 5], for detecting near-duplicates on machine generated web pages. Our preliminary results show, that we significantly outperform these algorithms and that we are able to group forum threads with a precision of 74%.
Link to original publication Download Bibtex entry


Quick Access

Schnellnavigation zur Seite über Nummerneingabe