Andreas Kunft

Andreas Kunft presented "Efficiently Executing R Dataframes on Flink" at Flink Forward 2017

Andreas Kunft presented the talk "Efficiently Executing R Dataframes on Flink" at this year's Flink Forward in Berlin.


While dataflow engines offer scalability, their programming abstractions are often unfamiliar to data scientists, which are used to Python and R. To provide a more convenient interface, dataflow engines like Spark provide an R-like dataframe abstraction. While operations without user-defined code can be executed efficiently, the execution of UDFs is dominated by serialized data exchange between the dataflow engine and an external R process that evaluates the code. We present a new approach to execute user-defined functions by using the Truffle/Graal compiler infrastructure, which enables efficient execution of dynamic languages on the JVM. Based on fastR, the R language provided by this infrastructure, we exemplify the execution of R scripts directly inside the data pipelines of Flink, without data serialization and inter-process communication. Furthermore, we discuss future opportunities and problems, and compare our approach to native Flink, Spark, and SparkR.