direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Inhalt des Dokuments

DBT: Scalable Data Processing

 

Format and Credit: Lecture Series with Exercises (VL+UE, 3+1 SWS, 6 ECTS)

Goals: The global data volume is increasing dramatically each year. Understanding how to store, process and manage these huge amounts of data efficiently is a key requirement for software engineers and data analysts in the modern IT world. This course will teach students both the fundamentals of data processing in traditional single-node database systems and how to scale out these techniques to huge amounts of data in large-scale, distributed environments.
The lecture is accompanied by an optional implementation project, in which students will get hands-on experience with the main data processing techniques, by implementing several components of a relational database system and by using parallel programming platforms like Apache Hadoop or Nephele/PACT.

Audience: This course is the base course for master students with focus on database systems and information management and should be attended in the first semester of the master program. In contrast to the introduction of database systems (MPGI5/DBS), which looks database systems from an application programmers point of view, this class focuses on the internals of database systems. To participate, students are required to have successfully completed a Bachelor in computer science with a focus on database systems (participation in the Datenbankpraktikum, Datenbankprojekt). Knowledge of data modeling, relational algebra, and SQL as well as a very good command of Java, or possibly C/C++/C#, programming is required to participate in the course. Due to capacity reasons, the class is limited to at most 60 participants.

Content: The course is split into two parts, each covering roughly one half of the semester. During the first part, the students become acquainted with the fundamentals of query processing in traditional relational database systems. This includes the general architecture of a DBMS, file- & buffer management, query processing, indexing, metadata management, query optimization, locking, recovery and transaction management. In the second half of the lecture, the lecture will cover the basics of parallel data processing, with a focus on large-scale, distributed systems and “cloud computing”. Topics include parallel processing platforms like MapReduce, distributed data storage and retrieval – e.g., via DHTs –, techniques for distributed locking and transaction handling, multi-tenancy and software as a service. The course consists of a lecture and theoretical, written exercises.
In the optional project, students will implement components of a relational database system and get hands-on experience with a parallel data processing platform. The actual components implemented may vary each year, but will include parsing, query optimizer, execution engine, index structures and storage system.

Your Contributions:

 

  • Successful completion of all theoretical exercises
  • Final Exam

Literature:

[1] Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom: Database Systems - The Complete Book, Pearson Education International, 2002.

[2] Garcia-Molina, Ullman, Widom: “Database Systems: The Complete Book,” Prentice Hall, 2000

Zusatzinformationen / Extras

Quick Access:

Schnellnavigation zur Seite über Nummerneingabe

Auxiliary Functions