Knowledge gained from large, complex and often
unstructured data is the key to many scientific and business
discoveries. Support Vector Machines (SVMs) are an advanced machine
learning method known for achieving state of the art results in many
pattern recognition tasks, particularly classification. With kernel
methods they can be extended to handle various non-linear dependencies
and complex data types. Getting good results with these methods
requires extensive model selection and parameter tuning which is why
SVM training is often done on clusters.
In this work we present a pipeline for training and evaluating
SVM classifiers using the Stratosphere framework. We show that the
framework can handle such tasks efficiently, even though they are not
seen as traditional "big data" problems it was designed for.
The pipeline does not rely on third-party software packages, is easily
extendible by users and adapted to work on a wide range of systems.
Popular kernels for vectors and strings are included along with
interfaces for implementing additional data types and kernels.
Adapting the pipeline for new use cases is kept domain specific and
does not require knowledge of SVM training or Stratosphere. We
demonstrate heuristics designed to improve performance and parameters
granting the users additional control over training behaviour. We show
their impact on the running time, quality of the solution and resource
utilization. We also address scalability, showing that the pipeline
scales well with the problem size.