direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Inhalt des Dokuments

Publications

Multi-Dimensional Genomic Data Management for Region-Preserving Operations
Citation key DBLP:conf/icde/HorlovaKMC19
Author Olha Horlova, Abdulrahman Kaitoua, Volker Markl, Stefano Ceri
Year 2019
Journal 35th IEEE International Conference on Data Engineering, ICDE 2019
Abstract In previous work, we presented GenoMetric Query Language (GMQL), an algebraic language for querying genomic datasets, supported by Genomic Data Management System (GDMS), an open-source big data engine implemented on top of Apache Spark. GMQL datasets are represented as genomic regions (i.e. intervals of the genome, included within a start and stop position) with an associated value, representing the signal associated to that region (the most typical signals represent gene expressions, peaks of expressions, and variants relative to a reference genome.) GMQL can process queries over billions of regions, organized within distinct datasets. In this paper, we focus on the efficient execution of region-preserving GMQL operations, in which the regions of the result are a subset of the regions of one of the operands; most GMQL operations are region-preserving. Chains of region-preserving operations can be efficiently executed by taking advantage of an array-based data organization, where region management can be separated from value management. We discuss this optimization in the context of the current GDMS system which has a row-based (relational) organization, and therefore requires dynamic data transformations. A similar approach applies to other application domains with interval-based data organization.
Link to original publication Download Bibtex entry

Zusatzinformationen / Extras

Quick Access:

Schnellnavigation zur Seite über Nummerneingabe

Auxiliary Functions