Apache Spark Training

Apache Spark training provides Basic, Intermediate and Advanced level Spark and its implementation and Application Development.

Apache Spark Introduction

Apache Spark Introduction

Spark Architecture

Big Data Introduction

In Memory Data Model

Distributed Computing



Spark Driver introduction

How Spark with Java/Scala/Python/RĀ  Languages

Spark Setup

Setting Spark in Stand Alone Mode

Java JDK

Spark Development Environment (Java/Scala)

Spark REPL

Scala Optional, can be swapped with Java

Scala Intermediate Topics

Why Scala for Spark?

Scala in other frameworks

Introduction to Scala REPL

Basic Scala operations

Variable Types in Scala

Control Structures in Scala

Foreach loop, Functions and Procedures

Collections in Scala- Array

ArrayBuffer, Map, Tuples, Lists, and more

Class in Scala


Getters and Setters

Extending a Class

Overriding Methods

Traits as Interfaces and Layered Traits

Functional Programming

Higher Order Functions

Anonymous Functions, and more

Scala and SBT Can be changed with Java/Maven

Scala SBT Setup

IDE setup


Java, POM Maven setup

IDE Setup

Spark Architecture

Elements and Features of Spark

Resilient Distributed Datasets (RDD)

Data Frames


Driver Application

Map Reduce

Interactive with Map Reduce

Spark Shell

Spark in Standalone

Spark in Distributed mode

Spark with Hadoop and YARN

Overview of Functional programming


Data State


Functional programming advantages

Higher order functions

Stateless data processing over distributed network


Spark RDD

Deep dive into Spark RDDs

Creating RDDs

RDD Data Loading

RDD partitioning

RDD Transformation & functions

Cache intermediate RDDs

The RDD general operations

A read-only partitioned collection of records

RDD for faster and efficient data processing

RDD Actions and Functions for Collect

Count, Collection Map, List

Save RDD results as Textfiles

Pair RDD functions

RDD Lineage

Key-Value pair in RDDs,

Spark MapReduceĀ  with RDD

Spark Internals with RDD, Immutable

RDD Persistence

RDD persistence overview,

Spark execution flow & Spark terminology,

Distribution shared memory vs. RDD,

RDD limitations

Distributed persistence,

RDD lineage,Key/Value pair for sorting implicit conversion like CountByKey, ReduceByKey, SortByKey, AggregataeByKey

Day 2

Data Frame and Spark SQL

Spark SQL Overview

Spark SQL Architecture

SQL Context in Spark SQL

Data Frames & Datasets

Architecture of Data Frameworks

JSON support in Spark SQL, working with XML data,

Parquet files,

Creating HiveContext,

Writing Data Frame to Hive

Reading JDBC files,

Understanding the Data Frames in Spark

Creating Data Frames, manual inferring of schema

Working with CSV files,

Reading JDBC tables,

Data Frame to JDBC,

User defined functions in Spark SQL,

Shared variable and accumulators

Learning to query and transform data in Data Frames,

Data Frame provides the benefit of both Spark RDD and Spark SQL,

Deploying Hive on Spark as the execution engine.

Partition Advanced part

Learning about the scheduling and partitioning in Spark

Hash partition, range partition

Scheduling within and around applications

Static partitioning

Dynamic sharing

Fair scheduling

Map partition with index

The Zip, GroupByKey, Spark master high availability

Standby Masters with Zookeeper

Single Node Recovery With Local File System

High Order Functions.

Hive Integration

Integrating Hive

Hive Context

Hive SQL

Spark Streaming

Spark Stream Introduction

Batch Processing

Micro Batch

Window and Time Slice

Spark Streaming architecture

Create DStreams

Create a simple Spark Streaming application

DStream operations

Apply DStream operations

Use Spark SQL to query DStreams

Define window operations

Describe how DStreams are fault-tolerant

Monitor Spark Application

Use the SparkUI to monitor a Spark application

Debug and tune Spark applications

Day 3

Spark MLib

What is Machine Learning?

Where is Machine Learning Used?

Different Types of Machine Learning Techniques

Understanding MLlib

Distributed Architecture for MLib

Features of MLlib and MLlib Tools

Various ML algorithms supported by MLlib

K-Means Clustering & How It Works with MLlib

Use cases with Kaggle Data Set

Kafka and Spark

Kafka Overview

Integrating Kafka Streams into Spark

Kafka Connect

Ingest from Kafka Stream

Spark to Kafka Stream


Hadoop Integration

HDFS File System Access


Spark and Zeblin Integration

Live Data View

Charts, Tables creation using Zeblin

Spark Performance Tuning



Memory Management



Gopalakrishnan Subramani


Mentor, Trainer, Consultant & Architect for IoT, Azure IoT & Cloud, Node, React,
Angular, Scala, Apache Kafka, Apache Spark with deep expertise in building Industrial Scada, IIoT Solutions, Web, Mobile and Backend Applications on premise and cloud infrastructure.

Mentor and Consultant for SCADA, HMI Device application using modern web technologies, excellency in IoT, Sensor, M2M connectivity, building connected, wired, wireless device applications, cloud enabling, data streaming, RESTFul Architecture, Progressive Web Applications using Hadoop, Spark, Kafka, AMPQ, MQTT, MQTT-SN, CoAP, JavaScript, MongoDB, MySQL, Cassandra & Java.