Datenschutzerklärung|Data Privacy

A. Borusan

14.12.2015, 14 Uhr s.t. TU Berlin, EN building, seminar room EN 719 (7th floor), Einsteinufer 17, 10587 Berlin: "Specification and Optimization of Analytical Data Flows" (Fabian Hueske PhD Thesis Defense)

Since a few years, a trend is evolving that has many implications on the field of data analysis. Compared to the past, much larger data sets are being analyzed using more elaborate and diverse analysis methods such as information extraction techniques, data mining algorithms, and machine learning methods. This evolution has implications on the requirements on data processing systems. Due to the growing size of possibly unstructured data sets and the increasing computational complexity and diversity of analysis methods, data must be processed by user-defined functions in a massively parallel fashion. Many traditional database systems are not flexible enough to satisfy these requirements. However, there are many technologies such as declarative query specification and query optimization that have proven themselves as crucial for the usability and performance of data processing systems. In this thesis we address three challenges that arise in the context of specifying and optimizing scalable data analysis programs that include user-defined functions.

First, we propose a declarative and parallel programming model to specify data analysis tasks as data flow programs. In this model, data processing operators are composed of a system-provided second-order function and a user-defined first-order function. A cost-based optimizer compiles data flow programs specified in this abstraction into parallel data flows. The optimizer borrows many techniques from relational optimizers and ports them to the domain of general-purpose parallel programming models. Second, we propose an approach to enhance the optimization of data flow programs that include UDF operators with unknown semantics. We identify operator properties and conditions to reorder neighboring UDF operators without changing the semantics of the program. We show how to automatically extract these properties from UDF operators by leveraging static code analysis techniques. Our approach is able to emulate many relational optimizations such as filter and join reordering and holistic aggregation push-down while not being limited to relational operators. Finally, we analyze the impact of changing execution conditions such as varying predicate selectivities and memory budgets on the performance of relational query plans. We identify plan patterns that cause significantly varying execution performance for changing execution conditions. Plans that include such risky patterns are prone to cause problems in presence of imprecise optimizer estimates. Based on our findings, we introduce an approach to avoid risky plan choices. Moreover, we present a method to assess the risk of a query execution plan using a machine-learned prediction model. Experiments show that the prediction model outperforms risk predictions which are computed from optimizer estimates.