Dirk Hesse: Introduction to Spark and Data Science using Spark

Dirk Hesse (Data Scientist at Intelligent Communication) will give a seminar in the lunch area, 8th floor Niels Henrik Abels hus at 14:15.

Title: Introduction to Spark and Data Science using Spark

Abstract: Today's data scientists are expected to not only be able to gain insights from exploring and build models on large data sets, but also to prepare these sets, more often than not coming from non-traditional and heterogenous data sources. Models should be highly performant and able to work even on many terabytes of data, in batches as well as on live streams of data. Deployment and use should be easy and highly automated. In the data science and big data communities, Apache Spark currently receives a lot of attention as it addresses all of the points mentioned above. In June 2015, IBM announced a major commitment to advancing Spark and thus paving the road for wide adaptation in more conservative industry settings. Many expect Spark to be one of the most important big data technologies for many years to come. In this presentation, I will give an overview over Spark's capabilities, starting from some basic MapReduce use cases and gradually advancing to topics like Spark's machine learning and streaming capabilities, getting data in and out of Spark, as well as testing and deployment. I will give a number of live code examples in Python (Spark also supports R, Java, and Scala), highlight the ease of use of Spark, and demonstrate that one can build and deploy robust models with few lines of very readable code.

Published Jan. 20, 2016 3:06 PM - Last modified Mar. 11, 2019 9:03 AM