Apache HBase is the Hadoop database based on
Google's Bigtable Model. As one of the most popular members of the
NoSQL database family, it has been widely adopted by many companies.
HBase utilizes an on-disk columnar storage format, where
columns are grouped into column families. Physically, all columns of
one column family are stored together on the file system, thus the
division of column families is closely related to the response time
for a specific row query. Two solutions are usually used for building
the column family structure. One is to store everything into one
column family, the other one is to divide columns based on their types
or semantic groups. Both solutions have defects.
In the thesis, a new Evolutionary Algorithm based on Genetic
Algorithm is designed and applied to construct the optimum column
family structure for a given query set, in which data is evenly
distributed and the required column families are minimized. The
reading performance of the optimized column family structure is
evaluated on a real dataset provided by ZANOX AG, which contains 2.6
million rows of storage data and 1.3 million queries. It is shown
that, with such an optimized column family structure, the reading
performance of HBase can be significantly improved with statistical
significance. For the examined query load the average response time is
reduced by 34% to 72% when evaluated against different reference
column family structures.