2 – Apache Spark + Apache Storm + Kafka

Note: This course is mapped to CCA 175 Cloudera Certification Program

Apache Spark with Scala / Python

Objective

Apache Spark is an open-source data analytics cluster computing framework. Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce for certain applications. Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster’s memory and query it repeatedly, making it well suited to machine learning algorithms

The participants will start by learning the why and what of Data Analytics using Spark and how Spark with it in-memory processing benefits Iterative Machine Learning Algorithms.

The participants will be working with different datasets to understand the various examples in Spark and will also learn

 

  1. The need for Spark in the modern Data Analytical Architecture
  2. Concepts and features of RDD
  3. Transformations in Spark
  4. Actions in Spark
  5. Need for Hadoop 2 and its installation
  6. Introduction to Yarn and its integration with Hadoop
  7. Spark QL
  8. Spark Streaming and its difference with Apache Storm
  9. Working with Jupyter and Zeppelin Notebooks

 

Note: The course will be have 30% of theoretical discussion and 70% of actual hands on

Duration: 30 ~ 32 hours

Audience

This course is designed for anyone who is

  1. Wanting to architect a project using Spark.
  2. An ETL or Data Warehousing Developer looking at alternative approach to data analysis and storage.
  3. Data Engineer

 

Pre-Requisites

  1. Basic knowledge of Java.
  2. Basic understanding of Hadoop.

 

Course Outline

1        Introduction to Data Analysis and Spark

 

2        Getting Started with Apache Spark

 

3        Spark Language Basics

 

4        Spark Core Programming

 

5        Understanding Notebooks

 

 

6        Hadoop2 & YARN Overview

7        Working with Key/Value Pairs

8        Loading and Saving Data in Spark

9        Working with Spark SQL

10      Working with Spark Streaming

11      What is Apache Storm?

12      What is new in Spark 2?

13      What is Kafka?