Datenschutzerklärung|Data Privacy

A. Borusan

12.07.2012, 14 Uhr s.t. TU Berlin, EN building, seminar room EN 719 (7th floor), Einsteinufer 17, 10587 Berlin: "Architectures for large-scale continuous data management" (Dionysios Logothetis, Telefonica Research Barcelona)

The ability to do rich analytics on massive sets of unstructured data drives the operation of many organizations today. These ³big data² analytics have given rise to a new class of data-intensive computing systems, like MapReduce, that can scale to very large data simply by employing more compute power. While these systems have been very successful, it is becoming apparent that scalability alone is not enough. Many analytics today are update-driven, and this brute-force approach is inefficient when trying to keep analytics up-to-date as data change continuously.

In the first part of the talk, I will present a new approach for programming analytics that takes the continuous nature of data into consideration. A fundamental requirement for efficient processing of continuous data is the ability to incrementally update the analytics by maintaining computation state. I will argue that state should be a first-class abstraction and present Continuous Bulk Processing (CBP), a model and architecture that integrates data-parallelism for scalability with state for efficient update-driven analytics. The model lends itself to several analytics, like incremental algorithms and iterative analysis. Through real-world applications, I will show how the integration of state in the programming model affords several optimizations in the underlying system, reducing processing time and resource usage relative to current practice.

While integrating state in the programming model allows efficient incremental programs, it may be challenging to design incremental algorithms for complex analytics, like iterative graph mining and machine learning. In the second part, I will talk about ongoing work on a system that can incrementally compute this class of analytics in a manner that is transparent to the user.