TU Berlin

Database Systems and Information Management GroupAn Evolutionary Algorithm for column family schema optimization in HBase

Logo FG DIMA-new  65px

Page Content

to Navigation

Short info

Candidate: Fangzhou Yang

Advisor: Johannes Kirschnick, Dr. Dragan Milosevic (ZANOX AG)


Apache HBase is the Hadoop database based on Google's Bigtable Model. As one of the most popular members of the NoSQL database family, it has been widely adopted by many companies.

HBase utilizes an on-disk columnar storage format, where columns are grouped into column families. Physically, all columns of one column family are stored together on the file system, thus the division of column families is closely related to the response time for a specific row query. Two solutions are usually used for building the column family structure. One is to store everything into one column family, the other one is to divide columns based on their types or semantic groups. Both solutions have defects.

In the thesis, a new Evolutionary Algorithm based on Genetic Algorithm is designed and applied to construct the optimum column family structure for a given query set, in which data is evenly distributed and the required column families are minimized. The reading performance of the optimized column family structure is evaluated on a real dataset provided by ZANOX AG, which contains 2.6 million rows of storage data and 1.3 million queries. It is shown that, with such an optimized column family structure, the reading performance of HBase can be significantly improved with statistical significance. For the examined query load the average response time is reduced by 34% to 72% when evaluated against different reference column family structures.


Quick Access

Schnellnavigation zur Seite über Nummerneingabe