2 – Apache Spark + Apache Storm + Kafka
Note: This course is mapped to CCA 175 Cloudera Certification Program
Apache Spark with Scala / Python
Objective
Apache Spark is an open-source data analytics cluster computing framework. Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce for certain applications. Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster’s memory and query it repeatedly, making it well suited to machine learning algorithms
The participants will start by learning the why and what of Data Analytics using Spark and how Spark with it in-memory processing benefits Iterative Machine Learning Algorithms.
The participants will be working with different datasets to understand the various examples in Spark and will also learn
- The need for Spark in the modern Data Analytical Architecture
- Concepts and features of RDD
- Transformations in Spark
- Actions in Spark
- Need for Hadoop 2 and its installation
- Introduction to Yarn and its integration with Hadoop
- Spark QL
- Spark Streaming and its difference with Apache Storm
- Working with Jupyter and Zeppelin Notebooks
Note: The course will be have 30% of theoretical discussion and 70% of actual hands on
Duration: 30 ~ 32 hours
Audience
This course is designed for anyone who is
- Wanting to architect a project using Spark.
- An ETL or Data Warehousing Developer looking at alternative approach to data analysis and storage.
- Data Engineer
Pre-Requisites
- Basic knowledge of Java.
- Basic understanding of Hadoop.
Course Outline
1 Introduction to Data Analysis and Spark
- What is Apache Spark
- Understanding Lambda Architecture for Big Data Solutions
- Role of Apache Spark in an ideal Lambda Architecture
- Understanding Apache Spark Stack
- Spark Versions
- Storage Layers in Spark
2 Getting Started with Apache Spark
- Downloading Apache Spark
- Installing Spark in a Single Node
- Understanding Spark Execution Modes
- Batch Analytics
- Real Time Analytics Options
- Exploring Spark Shells
- Introduction to Spark Core
- Setting up Spark as a Standalone Cluster
- Setting up Spark with Hadoop YARN Cluster
3 Spark Language Basics
- Basics of Python
- Basics of Scala
4 Spark Core Programming
- Understanding the Basic component of Spark -RDD
- Creating RDDs
- Operations in RDD
- Creating functions in Spark and passing parameters
- Understanding RDD Transformations and Actions
- Understanding RDD Persistence and Caching
- Examples for RDDs
5 Understanding Notebooks
- Installation of Anaconda Python
- Installation and working with Jupyter Notebook
- Installation of Zeppelin
- Working with Zeppelin notebooks
6 Hadoop2 & YARN Overview
- Anatomy of Hadoop Cluster, Installing and Configuring Plain Hadoop
- Batch v/s Real time
- Limitations of Hadoop
7 Working with Key/Value Pairs
- Understanding the Key/Value Pair Paradigm
- Creating a Pair RDD
- Understanding Transformations on Pair RDDs
- Understanding Actions on Pair RDDs
- Understanding Data Partitioning in RDDs
8 Loading and Saving Data in Spark
- Understanding Default File Formats supported in Spark
- Understanding File systems supported by Spark
- Loading data from the local file system
- Loading data from HDFS using default Mechanism
- Spark Properties
- Spark UI
- Logging in Spark
- Checkpoints in Spark
9 Working with Spark SQL
- Creating a HiveContext
- Inferring schema with case classes
- Programmatically specifying the schema
- Understanding how to load and save in Parquet, JSON, RDBMS and any arbitrary source ( JDBC/ODBC)
- Understanding DataFrames
- Working with DataFrames
10 Working with Spark Streaming
- Understanding the role of Spark Streaming
- Batch versus Real-time data processing
- Architecture of Spark Streaming
- First Spark Streaming program in Java with packaging and deploying
11 What is Apache Storm?
- Why Apache Storm
- Architecture of Apache Storm
- Setting up a single node Storm cluster
- Word Count from a simulated streaming file and code walk through
12 What is new in Spark 2?
13 What is Kafka?
- Why Kafka and Architecture of Kafka
- Setting up a 2 broker kakfa nodes
- Writing the basic Producer Consumer Example