Venkat Krishnan

2 – Spark [Cloudera Certification ] Plus Apache Storm

Course Outline

1        YARN Overview

  • Anatomy of Hadoop Cluster, Installing and Configuring Plain Hadoop
  • What is Big Data Analytics
  • Batch v/s Real time
  • Limitations of Hadoop
  • Storm for Real Time Analytics
  1. Spark Basics
  • Batch Analytics
  • Real Time Analytics Options
  • Basics of Python
  • Basics of Scala
  • In Memory Data – Spark
  1. Spark Installation
  • Spark Installation
  • Overview of Spark on a cluster
  • Spark Standalone Cluster.
  1. Working with RDD
  • RDDs
  • Transformations in RDD
  • Actions in RDD
  • Spark Application Components – Executor, Context, Driver
  • Loading Data in RDD
  • Saving Data through RDD
  • Spark Properties
  • Spark UI
  • Spark Partitioning / Parallelism
  • Logging in Spark
  • Checkpoints in Spark
  • Key-Value Pair RDD
  • Hadoop Integration Hands on.
  • PairedRDD
  1. Data Persistence
  • Avoiding Re-Computation
  • RDD Persist Options
  • Choosing Storage Level
  • Checkpointing v/s Caching

 

  1. Spark Applications
  • Creating an application
  • Submitting to the Cluster
  • YARN Cluster
  • Setting Configuration at Runtime
  1. Spark Advance Features
  • Accumulators
  • Broadcast Variables
  • Advance Partitioning & Operations
  1. Spark SQL
  • Spark QL
  • SQLContext v/s Hive Context
  • Why DataFrames?
  • Different ways of creating Data Frames
  • DataFrame operations
  • Saving DataFrames
  • UDFs
  1. Spark Streaming
  • Architecture
  • DStream Overview
  • Receivers
  1. Spark MLLib
  • Supervised Learning [Classification, Regression ]
  • Un-Supervised Learning [Clustering, K-Means ]