Design, Implementation and Evaluation of Naive Bayes Classification for the Apache Flink Big Data Analytics Platform
Currently, Jonathan Hasenburg is working on his Bachelor's Thesis on "Design, Implementation and Evaluation of Naive Bayes Classification for the Apache Flink Big Data Analytics Platform", advised by Juan Soto (TU Berlin) and Till Rohrmann (now at Data Artisans).
On the 12th of January 2015, the Apache Software Foundation announced that Apache Flink became a "Top Level Project." This means that Flink has a mature community that involves numerous companies that are ac-tively using it and a team of 86 contributors (as of 20.04.2015). Some of Flink’s key features include speed, reliability, simplicity, scalability and program-execution-ubiquity, which means that Flink programs can be exe-cuted locally, embedded (e.g. web container) or in clusters.
People intending to analyze their data not only want a platform to run their jobs, but also a library of prede-fined methods to quickly jump into the value generating process. To offer those instruments the "Flink Ma-chine Learning Library" was started. This library has the vision to make machine learning easily accessible to a wide audience and offer popular algorithms that are optimized for use in Apache Flink.
An algorithm for the library is implemented as a learner. That way these implementations can be integrated easily in pipelines. A pipeline consists of transfomers which transform the data, e.g. scaling, whitening, feature extraction or outlier detection, and a learner which fits a model to the given input data. The obtained model is again a transformer which does its prediction by transforming the input data.
Naive Bayes Classifier
The Flink Machine Learning Library will consist of many different algorithms. This Bachelor’s thesis focuses on the Naive Bayes Classifier algorithm because it is widely used for text categorization (does a document belong to one category or to another) even though it has some weaknesses in its initial form. After its introduction in the 1960s several researchers came up with ideas how to improve the algorithm to make it compete with more advanced methods such as support vector machines.
In machine learning a feature is a characteristic that distinguishes one object from another. The feature of the Naive Bayes Classifier is the frequency of each word. For example, a positive text might contain words like "good", "nice" and "well" more often, while a negative text might consist of words like "bad", "sadly" and "worse". The algorithm is based on Bayes' theorem, which relates current probability to prior probability.
Goals of the Thesis
One key goal of this Bachelor thesis is the implementation of Naive Bayes Classifier for the Flink Machine Learning Library as a learner. Because of its performance the classifier is very popular, even though it is not always accurate. Thus ideas of previous researches around the Naive Bayes Classifier will be evaluated and some will be implemented, such as improvements introduced by:
The assumption of feature independence is not valid in documents since words in sentences are dependent. Karl-Michael Schneider proposed many small improvements, which can be divided in two approaches (Schneider 2005):
1. Modification of the data to better fit the assumptions made by Naive Bayes.
2. Modification of the classifier or the probabilistic model.
Sang-Bum Kim et al.
Kim, et al. (2002) are not concerned with the probability values calculated using Bayes formula, instead they propose introducing scores to rank document relevance.
Jason Rennie et al.
Rennie, et al. (2003) investigated the reasons behind the sometimes poor results of the Naive Bayes Classifier and proposed a simple heuristic solution for each of the problems.
The next step is to evaluate and compare the different versions of the algorithm.
Benchmark results generated by the outcome of the different versions will be compared to each other to spec-ify how well the algorithm is suited for Flink in general and to identify whether the improvements indeed en-hance the algorithm. The conclusion will depend on two factors:
• Accuracy: Are the ascertained categories correct?
• Performance: How long did it take to process the data?
It is expected to achieve great performance and accuracy improvements with the implemented changes. To determine which change really pushes the algorithm to a more advanced level will be another key goal.
The classifier will be tested locally and in a Flink cluster. Thus an access to a cluster is needed to generate benchmarks for the different versions. Appropriate datasets to use and test the algorithm, such as "20 News-group" or "Reuters-21578" will be selected.
Plan of Implementation
The implementation plan consists of seven steps.
Step 1: Select and install an environment for the development of programs written in Scala. In addition, gain greater familiarity in the Scala programming language. Scala is used, because the whole Machine Learning Library is implemented in that language.
Step 2: Read the literature and implement the algorithm without Flink. This is a necessary step to understand the coherences without the need of working with a complex system such as Flink simultaneously.
Step 3: Select an appropriate, large-scale data set. Test the algorithm.
Step 4: Gain greater familiarity with the Apache Flink platform, review documentation and examine existing programs.
Step 5: Implement the Naive Bayes Classifier for the Machine Learning Library.
Step 6: Identify improvements and include them in the local version and later in the Flink version. This is an iterative process, because each improvement will be reflected with an own version so that that the advantages can be observed.
Step 7: Conduct experiments, evaluate the results and produce plots that highlight key findings.
Kim, Sang-Bum, Hae-Chang Rim, DongSuk Yook, and Heui-Seok Lim. "Effective Methods for Improving Naive Bayes Text Classifiers." In Trends in Artificial Intelligence, edited by Mitsuru Ishizuka and Abdul Sattar, 414-423. Tokyo: Springer Berlin Heidelberg, 2002.
Markl, Volker. "Gesprengte Ketten." Informatik-Spektrum, 02 01, 2005: 10-15.
Rennie, Jason D. M., Lawrence Shih, Jaime Teevan, and David R. Karger. "Tackling the Poor Assumptions of Naive Bayes Text Classifiers." In In Proceedings of the Twentieth International Conference on Machine Learning, edited by Tom Fawcett and Nina Mishra, 616--623. Menlo Park: AAAI Press, 2003.
Schneider, Karl-Michael. "Techniques for Improving the Performance of Naive Bayes for Text Classification." In Computational Linguistics and Intelligent Text Processing, edited by Alexander Gelbukh, 682-693. Mexico City: Springer Berlin Heidelberg, 2005.