Knowledge gained from large, complex and often
unstructured data is the key to many scientific and business
discoveries. Support Vector Machines (SVMs) are an advanced machine
learning method known for achieving state of the art results in many
pattern recognition tasks, particularly classification. With kernel
methods they can be extended to handle various non-linear dependencies
and complex data types. Getting good results with these methods
requires extensive model selection and parameter tuning which is why
SVM training is often done on clusters.
In this work we present a pipeline for training and
evaluating SVM classifiers using the Stratosphere framework. We show
that the framework can handle such tasks efficiently, even though they
are not seen as traditional "big data" problems it was
designed for. The pipeline does not rely on third-party software
packages, is easily extendible by users and adapted to work on a wide
range of systems. Popular kernels for vectors and strings are
included along with interfaces for implementing additional data types
and kernels. Adapting the pipeline for new use cases is kept domain
specific and does not require knowledge of SVM training or
Stratosphere. We demonstrate heuristics designed to improve
performance and parameters granting the users additional control over
training behaviour. We show their impact on the running time, quality
of the solution and resource utilization. We also address
scalability, showing that the pipeline scales well with the problem
size.