Datenschutzerklärung|Data Privacy

Juan Soto

"Multi-Dimensional Genomic Data Management for Region-Preserving Operations" paper accepted at ICDE 2019

'''Multi-Dimensional Genomic Data Management for Region-Preserving Operations''', Olha Horlova, Abdulrahman Kaitoua, Volker Markl, Stefano Ceri , ICDE 2019, Macau, China.

In previous work, we presented GenoMetric Query Language (GMQL), an algebraic language for querying genomic datasets, supported by Genomic Data Management System (GDMS), an open-source big data engine implemented on top of Apache Spark. GMQL datasets are represented as genomic regions (i.e. intervals of the genome, included within a start and stop position) with an associated value, representing the signal associated to that region (the most typical signals represent gene expressions, peaks of expressions, and variants relative to a reference genome.) GMQL can process queries over billions of regions, organized within distinct datasets.

In this paper, we focus on the efficient execution of region-preserving GMQL operations, in which the regions of the result are a subset of the regions of one of the operands; most GMQL operations are region-preserving.
Chains of region-preserving operations can be efficiently executed by taking advantage of an array-based data organization, where region management can be separated from value management. We discuss this optimization in the context of the current GDMS system which has a row-based (relational) organization, and therefore requires dynamic data transformations. A similar approach applies to other application domains with interval-based data organization.

A preprint version is available.