CSC's trainings and events have moved

Find our upcoming trainings and events at www.csc.fi.

This site is an archive version and is no longer updated.

big_data_analysis_ApacheSpark_2019

Big Data Analysis with Apache Spark

Date:	27.11.2019 9:00 - 28.11.2019 16:00
Location details:	The event is organised at the CSC Training Facilities located in the premises of CSC at Keilaranta 14, Espoo, Finland. The best way to reach us is by public transportation; more detailed travel tips are available.
Language:	english-language
lecturers:	Apurva Nandan (CSC) Anni Pyysing (CSC)
Price:	free-price-finnish-academics. free-price-others.
	The course materials, lunches as well as morning and afternoon coffees are free of charge.

registration-closed

The seats are filled in the registration order. Please inform us of any cancellations in five (5) business days prior to the course.

Additional Information

This course is part of the PRACE Training Centre activity, please visit the PRACE Training portal for further information about the course.
For content please contact: apurva.nandan@csc.fi
Practicalities and wait list: patc@csc.fi

Description

Data is everywhere and with the rapid growth in data volume that is being used in data analysis tasks, it gets more and more challenging for the user to process it using standard methods. One typically runs into several problems - low memory or cpu, waiting forever for a job to complete or starting all over again if a job fails. Enter Spark, a high-performance distributed computing framework, which allows us to tackle big-data problems by distributing the workload across a cluster of machines. Say goodbye to all those painful workloads forever.

The two day course addresses the technical architecture and use cases of Spark, writing Spark code using Python, using Spark's machine learning library to perform ML based tasks. Then, we would be looking at the methods for running a spark cluster on CSC's container cloud Rahti, along with ways to manage and fine tune your cluster. The course will also demonstrate how to work with real-time data as well.

The first day includes the overview, architectural concepts, programming with Spark's fundamental data structure (RDD) and Spark's Machine Learning library. The second day focuses on the analysis of data by running SQL queries in Spark, working with real-time data streams and how to setup and manage a spark cluster.

Please NOTE: This is not a regular programming course, participants would be expected to learn emerging concepts in the field of big data / distributed processing, which might be completely different from the concepts of a general programming language.

Learning outcome

After the course the participants should be able to write simple to intermediate programmes in Spark using RDD and dataframes.

Intended Audience and Prerequisites

The course is intended for researchers, students, and professionals with programming skills, preferably in Python, as the exercises are in Python. Some knowledge of SQL is also recommended.

NOTE: This is a beginners course for Spark. If you are already familiar with it, please have a look at the agenda or email us to know more, whether the course content suits you or not.

Program

Day 1, Wednesday 27.11

09.00 – 09.45 Overview and architecture of Spark
09:45 – 10.30 Basics of RDDs and Demo
10.30 – 10.45 Coffee break
10.45 – 11.30 RDD: Transformations and Actions
11.30 – 12.00 Exercises
12.00 – 13.00 Lunch
13.00 – 13.30 Word Count Example
13.30 – 14.00 Exercises
14.00 – 14.30 Short overview of Machine learning library of Spark
14.30 – 14.45 Coffee break
14.45 – 15.30 Exercises
15.30 – 15.45 Wrap-up and further topics
15.45 – 16.00 Summary of the first day & exercises walk-through

Day 2, Thursday, 28.11

09.00 – 09.30 Spark Dataframes and SQL Overview
09:30 – 10.15 Exercises
10.15 – 10.30 Coffee break
10.30 – 10.45 Dataframes and SQL (contd.)
10.45 – 12.00 Exercises
12.00 – 13.00 Lunch
13.00 – 14.00 Setting up a Spark cluster
14.00 – 14.30 Exercises
14.00 – 14.30 Best practices and other useful stuff
14.30 – 14.45 Coffee break
14.45 – 15.00 Brief overview of Spark Streaming
15.00 – 15.15 Demo: Processing live twitter stream data
15.15 – 16.00 Summary of the course & exercises walk-through

View »

CSC's trainings and events have moved

Description

Learning outcome

Intended Audience and Prerequisites

Contact Information

Service Desk

CSC

Stay connected

Info

big_data_analysis_ApacheSpark_2019 - Training

CSC's trainings and events have moved

Description

Learning outcome

Intended Audience and Prerequisites

Contact Information

Service Desk

CSC

Stay connected

Info