Spark Tutorial: Real Time Cluster Computing Framework

Apache Spark is an open-source cluster computing framework for real-time processing. It is of the most successful projects in the Apache Software Foundation. Spark has clearly evolved as the market leader for Big Data processing. Today, Spark is being adopted by major players like Amazon, eBay, and Yahoo! Many organizations run Spark on clusters with thousands of nodes. We are excited to begin this exciting journey through this Spark Tutorial blog. This blog is the first blog in the upcoming Apache Spark blog series which will include Spark Streaming, Spark Interview Questions, Spark MLlib and others.

When it comes to Real Time Data Analytics, Spark stands as the go-to tool across all other solutions. Through this blog, I will introduce you to this new exciting domain of Apache Spark and we will go through a complete use case, Flight Data Analytics using Spark.

The following are the topics covered in this Spark Tutorial blog:

Spark Tutorial: Real Time Analytics

Before we begin, let us have a look at the amount of data generated every minute by social media leaders.

Data Generated - Spark Tutorial - Edureka

Figure: Amount of data generated every minute

As we can see, there is a colossal amount of data that the internet world necessitates to process in seconds. We will go through all the stages of handling big data in enterprises and discover the need for a Real Time Processing Framework called Apache Spark.

To begin with, let me introduce you to few domains using real-time analytics big time in today’s world.

Real Time Analytics - Spark Tutorial - Edureka

Figure: Spark Tutorial – Examples of Real Time Analytics

We can see that Real Time Processing of Big Data is ingrained in every aspect of our lives. From fraud detection in banking to live surveillance systems in government, automated machines in healthcare to live prediction systems in the stock market, everything around us revolves around processing big data in near real time.

Let us look at some of these use cases of Real Time Analytics:

Healthcare: Healthcare domain uses Real Time analysis to continuously check the medical status of critical patients. Hospitals on the look out for blood and organ transplants need to stay in a real-time contact with each other during emergencies. Getting medical attention on time is a matter of life and death for patients.
Government: Government agencies perform Real Time Analysis mostly in the field of national security. Countries need to continuously keep a track of all the military and police agencies for updates regarding threats to security.
Telecommunications: Companies revolving around services in the form of calls, video chats and streaming use real-time analysis to reduce customer churn and stay ahead of the competition. They also extract measurements of jitter and delay in mobile networks to improve customer experiences.
Banking: Banking transacts with almost all of the world’s money. It becomes very important to ensure fault tolerant transactions across the whole system. Fraud detection is made possible through real-time analytics in banking.
Stock Market: Stockbrokers use real-time analytics to predict the movement of stock portfolios. Companies re-think their business model after using real-time analytics to analyze the market demand for their brand.

GET STARTED WITH SPARK

Spark Tutorial: Why Spark when Hadoop is already there?

The first of the many questions everyone asks when it comes to Spark is, “Why Spark when we have Hadoop already?“.

To answer this, we have to look at the concept of batch and real-time processing. Hadoop is based on the concept of batch processing where the processing happens of blocks of data that have already been stored over a period of time. At the time, Hadoop broke all the expectations with the revolutionary MapReduce framework in 2005. Hadoop MapReduce is the best framework for processing data in batches.

This went on until 2014, till Spark overtook Hadoop. The USP for Spark was that it could process data in real time and was about 100 times faster than Hadoop MapReduce in batch processing large data sets.

The following figure gives a detailed explanation of the differences between processing in Spark and Hadoop.

Figure: Spark Tutorial – Differences between Hadoop and Spark

Here, we can draw out one of the key differentiators between Hadoop and Spark. Hadoop is based on batch processing of big data. This means that the data is stored over a period of time and is then processed using Hadoop. Whereas in Spark, processing can take place in real-time. This real-time processing power in Spark helps us to solve the use cases of Real Time Analytics we saw in the previous section. Alongside this, Spark is also able to do batch processing 100 times faster than that of Hadoop MapReduce (Processing framework in Apache Hadoop). Therefore, Apache Spark is the go-to tool for big data processing in the industry.

Spark Tutorial: What is Apache Spark?

Apache Spark is an open-source cluster computing framework for real-time processing. It has a thriving open-source community and is the most active Apache project at the moment. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

Apache Spark - Spark Interview Questions - Edureka Figure: Spark Tutorial – Real Time Processing in Apache Spark

It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations.

Spark Tutorial: Features of Apache Spark

Spark has the following features:

Figure: Spark Tutorial – Spark Features

Let us look at the features in detail:

Polyglot:

Spark provides high-level APIs in Java, Scala, Python and R. Spark code can be written in any of these four languages. It provides a shell in Scala and Python. The Scala shell can be accessed through ./bin/spark-shell and Python shell through ./bin/pyspark from the installed directory.

Spark Features - Spark Tutorial - Edureka

Speed:

Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing. Spark is able to achieve this speed through controlled partitioning. It manages data using partitions that help parallelize distributed data processing with minimal network traffic.

Multiple Formats:

Spark supports multiple data sources such as Parquet, JSON, Hive and Cassandra apart from the usual formats such as text files, CSV and RDBMS tables. The Data Source API provides a pluggable mechanism for accessing structured data though Spark SQL. Data sources can be more than just simple pipes that convert data and pull it into Spark.

Spark Features 3 - Spark Tutorial - Edureka

Lazy Evaluation:

Apache Spark delays its evaluation till it is absolutely necessary. This is one of the key factors contributing to its speed. For transformations, Spark adds them to a DAG (Directed Acyclic Graph) of computation and only when the driver requests some data, does this DAG actually gets executed.

Real Time Computation:

Spark’s computation is real-time and has low latency because of its in-memory computation. Spark is designed for massive scalability and the Spark team has documented users of the system running production clusters with thousands of nodes and supports several computational models.

Spark Features 4 - Spark Tutorial - Edureka

Spark Features 5 - Spark Tutorial - Edureka

Hadoop Integration:

Apache Spark provides smooth compatibility with Hadoop. This is a boon for all the Big Data engineers who started their careers with Hadoop. Spark is a potential replacement for the MapReduce functions of Hadoop, while Spark has the ability to run on top of an existing Hadoop cluster using YARN for resource scheduling.

Machine Learning:

Spark’s MLlib is the machine learning component which is handy when it comes to big data processing. It eradicates the need to use multiple tools, one for processing and one for machine learning. Spark provides data engineers and data scientists with a powerful, unified engine that is both fast and easy to use.

Spark Features 6 - Spark Tutorial - Edureka

Spark Tutorial: Getting Started With Spark

The first step in getting started with Spark is installation. Let us install Apache Spark 2.1.0 on our Linux systems (I am using Ubuntu).

Installation:

The prerequisites for installing Spark is having Java and Scala installed.

Download Java in case it is not installed using below commands.

sudo apt-get install python-software-properties
sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

Download the latest Scala version from Scala Lang Official page. Once installed, set the scala path in ~/.bashrc file as shown below.
```
export SCALA_HOME=Path_Where_Scala_File_Is_Located
export PATH=$SCALA_HOME/bin:PATH
```
Download Spark 2.1.0 from the Apache Spark Downloads page. You can also choose to download a previous version.
Extract Spark tar using below command.
```
tar -xvf spark-2.1.0-bin-hadoop2.7.tgz
```

Set the Spark_Path in ~/.bashrc file.

export SPARK_HOME=Path_Where_Spark_Is_Installed
export PATH=$PATH:$SPARK_HOME/bin

Before we move further, let us start up Apache Spark on our systems and get used to the main concepts of Spark like Spark Session, Data Sources, RDDs, DataFrames and other libraries.

Spark Shell:

Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively.

Spark Session:

In earlier versions of Spark, Spark Context was the entry point for Spark. For every other API, we needed to use different contexts. For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext. To solve this issue, SparkSession came into the picture. It is essentially a combination of SQLContext, HiveContext and future StreamingContext.

Data Sources:

The Data Source API provides a pluggable mechanism for accessing structured data though Spark SQL. Data Source API is used to read and store structured and semi-structured data into Spark SQL. Data sources can be more than just simple pipes that convert data and pull it into Spark.

RDD:

Resilient Distributed Dataset (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

Dataset:

A Dataset is a distributed collection of data. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java.

DataFrames:

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases or existing RDDs.

Spark Tutorial: Using Spark with Hadoop

The best part of Spark is its compatibility with Hadoop. As a result, this makes for a very powerful combination of technologies. Here, we will be looking at how Spark can benefit from the best of Hadoop.

Figure: Spark Tutorial – Spark Features

Hadoop components can be used alongside Spark in the following ways:

HDFS: Spark can run on top of HDFS to leverage the distributed replicated storage.
MapReduce: Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing framework.
YARN: Spark applications can be made to run on YARN (Hadoop NextGen).
Batch & Real Time Processing: MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real-time processing.

LEARN SPARK FROM EXPERTS

Spark Tutorial: Spark Components

Spark components are what make Apache Spark fast and reliable. A lot of these Spark components were built to resolve the issues that cropped up while using Hadoop MapReduce. Apache Spark has the following components:

Spark Core
Spark Streaming
Spark SQL
GraphX
MLlib (Machine Learning)

Spark Core

Spark Core is the base engine for large-scale parallel and distributed data processing. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. Further, additional libraries which are built atop the core allow diverse workloads for streaming, SQL, and machine learning. It is responsible for:

Memory management and fault recovery
Scheduling, distributing and monitoring jobs on a cluster
Interacting with storage systems

Spark Streaming

Spark Streaming is the component of Spark which is used to process real-time streaming data. Thus, it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams. The fundamental stream unit is DStream which is basically a series of RDDs (Resilient Distributed Datasets) to process the real-time data.

Spark Streaming - Spark Tutorial - Edureka Figure: Spark Tutorial – Spark Streaming

Spark SQL

Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing.

Spark SQL integrates relational processing with Spark’s functional programming. Further, it provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.

The following are the four libraries of Spark SQL.

Data Source API
DataFrame API
Interpreter & Optimizer
SQL Service

A complete tutorial on Spark SQL can be found in the given blog: Spark SQL Tutorial Blog

GraphX

GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph.

The property graph is a directed multigraph which can have multiple edges in parallel. Every edge and vertex have user defined properties associated with it. Here, the parallel edges allow multiple relationships between the same vertices. At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph: a directed multigraph with properties attached to each vertex and edge.

To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and mapReduceTriplets) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.

MlLib (Machine Learning)

MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine learning in Apache Spark.

Use Case: Flight Data Analysis using Spark GraphX

Now that we have understood the core concepts of Spark, let us solve a real-life problem using Apache Spark. This will help give us the confidence to work on any Spark projects in the future.

Problem Statement: To analyse Real-Time Flight data using Spark GraphX, provide near real-time computation results and visualize the results using Google Data Studio

Use Case – Computations to be done:

Compute the total number of flight routes
Compute and sort the longest flight routes
Display the airport with the highest degree vertex
List the most important airports according to PageRank
List the routes with the lowest flight costs

We will use Spark GraphX for the above computations and visualize the results using Google Data Studio.

Use Case – Dataset:

US Flight Dataset - Spark Tutorial - Edureka Figure: Use Case – USA Flight Dataset

Click here to download the complete dataset: USA Flight Dataset – Spark Training – Edureka

Use Case – Flow Diagram:

The following illustration clearly explains all the steps involved in our Flight Data Analysis.

Figure: Use Case – Flow diagram of Flight Data Analysis using Spark GraphX

Use Case – Spark Implementation:

Moving ahead, now let us implement our project using Eclipse IDE for Spark.

Find the Pseudo Code below:

//Importing the necessary classes
import org.apache.spark._
...
import java.io.File

object airport {

 def main(args: Array[String]){

//Creating a Case Class Flight
case class Flight(dofM:String, dofW:String, ... ,dist:Int)

//Defining a Parse String function to parse input into Flight class
def parseFlight(str: String): Flight = {
val line = str.split(",")
Flight(line(0), line(1), ... , line(16).toInt)
}
val conf = new SparkConf().setAppName("airport").setMaster("local[2]")
val sc = new SparkContext(conf)
//Load the data into a RDD

val textRDD = sc.textFile("/home/edureka/usecases/airport/airportdataset.csv")

//Parse the RDD of CSV lines into an RDD of flight classes
val flightsRDD = Map ParseFlight to Text RDD

//Create airports RDD with ID and Name
val airports = Map Flight OriginID and Origin
airports.take(1)

//Defining a default vertex called nowhere and mapping Airport ID for printlns
val nowhere = "nowhere"
val airportMap = Use Map Function .collect.toList.toMap

//Create routes RDD with sourceID, destinationID and distance
val routes = flightsRDD. Use Map Function .distinct
routes.take(2)

//Create edges RDD with sourceID, destinationID and distance
val edges = routes.map{( Map OriginID and DestinationID ) => Edge(org_id.toLong, dest_id.toLong, distance)}
edges.take(1)

//Define the graph and display some vertices and edges
val graph = Graph( Airports, Edges and Nowhere )
graph.vertices.take(2)
graph.edges.take(2)

//Query 1 - Find the total number of airports
val numairports = Vertices Number

//Query 2 - Calculate the total number of routes?
val numroutes = Number Of Edges

//Query 3 - Calculate those routes with distances more than 1000 miles
graph.edges.filter { Get the edge distance )=> distance > 1000}.take(3)

//Similarly write Scala code for the below queries
//Query 4 - Sort and print the longest routes
//Query 5 - Display highest degree vertices for incoming and outgoing flights of airports
//Query 6 - Get the airport name with IDs 10397 and 12478
//Query 7 - Find the airport with the highest incoming flights
//Query 8 - Find the airport with the highest outgoing flights
//Query 9 - Find the most important airports according to PageRank
//Query 10 - Sort the airports by ranking
//Query 11 - Display the most important airports
//Query 12 - Find the Routes with the lowest flight costs
//Query 13 - Find airports and their lowest flight costs
//Query 14 - Display airport codes along with sorted lowest flight costs

Click here to get the full source code of Flight Analysis using Apache Spark.

Use Case – Visualizing Results:

We will be using Google Data Studio to visualize our analysis. Google Data Studio is a product under Google Analytics 360 Suite. We will use Geo Map service to map the Airports on their respective locations on the USA map and display the metrics quantity.

Display the total number of flights per Airport
Display the metric sum of Destination routes from every Airport
Display the total delay of all flights per Airport

Visualizing Results - Spark Tutorial - Edureka

Now, this concludes the Apache Spark blog. I hope you enjoyed reading it and found it informative. By now, you must have acquired a sound understanding of what Apache Spark is. The hands-on examples will give you the required confidence to work on any future projects you encounter in Apache Spark. Practice is the key to mastering any subject and I hope this blog has created enough interest in you to explore learning further on Apache Spark.

GET SPARK CERTIFIED TODAY

We recommend the following Apache Spark Training videos from Edureka to begin with:

What is Apache Spark | Apache Spark Training | Edureka

This video series on Spark Tutorial provide a complete background into the components along with Real-Life use cases such as Twitter Sentiment Analysis, NBA Game Prediction Analysis, Earthquake Detection System, Flight Data Analytics and Movie Recommendation Systems. We have personally designed the use cases so as to provide an all round expertise to anyone running the code.

Got a question for us? Please mention it in the comments section and we will get back to you at the earliest.

If you wish to learn Spark and build a career in domain of Spark to perform large-scale Data Processing using RDD, Spark Streaming, SparkSQL, MLlib, GraphX and Scala with Real Life use-cases, check out our interactive, live-online Apache Spark Certification Training here, that comes with 24*7 support to guide you throughout your learning period.

The post Spark Tutorial: Real Time Cluster Computing Framework appeared first on Edureka Blog.