Page Content
DBT: Scalable Data Processing
Format and Credit: Lecture Series with Exercises (VL+UE, 3+1 SWS, 6 ECTS)
Goals: The global data volume is increasing
dramatically each year. Understanding how to store, process and manage
these huge amounts of data efficiently is a key requirement for
software engineers and data analysts in the modern IT world. This
course will teach students both the fundamentals of data processing in
traditional single-node database systems and how to scale out these
techniques to huge amounts of data in large-scale, distributed
environments.
The lecture is accompanied by an optional
implementation project, in which students will get hands-on experience
with the main data processing techniques, by implementing several
components of a relational database system and by using parallel
programming platforms like Apache Hadoop or Nephele/PACT.
Audience: This course is the base course for master students with focus on database systems and information management and should be attended in the first semester of the master program. In contrast to the introduction of database systems (MPGI5/DBS), which looks database systems from an application programmers point of view, this class focuses on the internals of database systems. To participate, students are required to have successfully completed a Bachelor in computer science with a focus on database systems (participation in the Datenbankpraktikum, Datenbankprojekt). Knowledge of data modeling, relational algebra, and SQL as well as a very good command of Java, or possibly C/C++/C#, programming is required to participate in the course. Due to capacity reasons, the class is limited to at most 60 participants.
Content: The course is split into two parts, each
covering roughly one half of the semester. During the first part, the
students become acquainted with the fundamentals of query processing
in traditional relational database systems. This includes the general
architecture of a DBMS, file- & buffer management, query
processing, indexing, metadata management, query optimization,
locking, recovery and transaction management. In the second half of
the lecture, the lecture will cover the basics of parallel data
processing, with a focus on large-scale, distributed systems and
“cloud computing”. Topics include parallel processing platforms
like MapReduce, distributed data storage and retrieval – e.g., via
DHTs –, techniques for distributed locking and transaction handling,
multi-tenancy and software as a service. The course consists of a
lecture and theoretical, written exercises.
In the optional
project, students will implement components of a relational database
system and get hands-on experience with a parallel data processing
platform. The actual components implemented may vary each year, but
will include parsing, query optimizer, execution engine, index
structures and storage system.
Your Contributions:
- Successful completion of all theoretical exercises
- Final Exam
Literature:
[1] Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom: Database Systems - The Complete Book, Pearson Education International, 2002.
[2] Garcia-Molina, Ullman, Widom: “Database Systems: The Complete Book,” Prentice Hall, 2000