Near-duplicate detection for web-forums
Zitatschlüssel DBLP:conf/ideas/MuthmannBBL09
Autor Klemens Muthmann, Wojciech M. Barczynski, Falk Brauer, Alexander Löser
Seiten 142-151
Jahr 2009
Journal IDEAS '09. Proceedings of the 2009 International Database Engineering & Applications Symposium
Zusammenfassung Current forum search technologies lack the ability to identify threads with near-duplicate content and to group these threads in the search results. As a result, forum users are overloaded with duplicated search results and prefer to create new threads without trying to find existing ones. In this paper we therefore identify common reasons leading to near-duplicates and develop a new near-duplicate detection algorithm for forum threads. The algorithm is implemented using a large case study of a real-world forum serving more than one million users. We compare this work with current algorithms, similar to [4, 5], for detecting near-duplicates on machine generated web pages. Our preliminary results show, that we significantly outperform these algorithms and that we are able to group forum threads with a precision of 74%.
