direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Page Content


Near-duplicate detection for web-forums
Citation key DBLP:conf/ideas/MuthmannBBL09
Author Klemens Muthmann, Wojciech M. Barczynski, Falk Brauer, Alexander Löser
Pages 142-151
Year 2009
Journal IDEAS '09. Proceedings of the 2009 International Database Engineering & Applications Symposium
Abstract Current forum search technologies lack the ability to identify threads with near-duplicate content and to group these threads in the search results. As a result, forum users are overloaded with duplicated search results and prefer to create new threads without trying to find existing ones. In this paper we therefore identify common reasons leading to near-duplicates and develop a new near-duplicate detection algorithm for forum threads. The algorithm is implemented using a large case study of a real-world forum serving more than one million users. We compare this work with current algorithms, similar to [4, 5], for detecting near-duplicates on machine generated web pages. Our preliminary results show, that we significantly outperform these algorithms and that we are able to group forum threads with a precision of 74%.
Link to original publication Download Bibtex entry

Zusatzinformationen / Extras

Quick Access:

Schnellnavigation zur Seite über Nummerneingabe