TU Berlin

Fachgebiet Datenbanksysteme und InformationsmanagementScalable Support Vector Machine Training on Parallel Data Processing Systems

Logo FG DIMA-new  65px


zur Navigation


Bearbeiter: Igor Viskovic

Betreuer: Alexander Alexandrov 

Angestrebter Abschluss: Master


Knowledge gained from large, complex and often unstructured data is the key to many scientific and business discoveries. Support Vector Machines (SVMs) are an advanced machine learning method known for achieving state of the art results in many pattern recognition tasks, particularly classification. With kernel methods they can be extended to handle various non-linear dependencies and complex data types. Getting good results with these methods requires extensive model selection and parameter tuning which is why SVM training is often done on clusters.

In this work we present a pipeline for training and evaluating SVM classifiers using the Stratosphere framework. We show that the framework can handle such tasks efficiently, even though they are not seen as traditional "big data" problems it was designed for. The pipeline does not rely on third-party software packages, is easily extendible by users and adapted to work on a wide range of systems. Popular kernels for vectors and strings are included along with interfaces for implementing additional data types and kernels. Adapting the pipeline for new use cases is kept domain specific and does not require knowledge of SVM training or Stratosphere. We demonstrate heuristics designed to improve performance and parameters granting the users additional control over training behaviour. We show their impact on the running time, quality of the solution and resource utilization. We also address scalability, showing that the pipeline scales well with the problem size.





Schnellnavigation zur Seite über Nummerneingabe