Apache Spark is an open-source cluster computing framework for real-time processing. It is of the most successful projects in the Apache Software Foundation. Spark has clearly evolved as the market leader for Big Data processing. Today, Spark is being adopted by major players like Amazon, eBay, and Yahoo! Many organizations run Spark on clusters with thousands of nodes. We are excited to begin this exciting journey through this Spark Tutorial blog. This blog is the first blog in the upcoming Apache Spark blog series which will include Spark Streaming, Spark Interview Questions, Spark MLlib and others.
When it comes to Real Time Data Analytics, Spark stands as the go-to tool across all other solutions. Through this blog, I will introduce you to this new exciting domain of Apache Spark and we will go through a complete use case, Flight Data Analytics using Spark.
The following are the topics covered in this Spark Tutorial blog:
- Real Time Analytics
- Why Spark when Hadoop is already there?
- What is Apache Spark?
- Spark Features
- Getting Started with Spark
- Using Spark with Hadoop
- Spark Components
- Use Case: Analyze Flight Data using Spark GraphX
Spark Tutorial: Real Time Analytics
Before we begin, let us have a look at the amount of data generated every minute by social media leaders.
Figure: Amount of data generated every minute
As we can see, there is a colossal amount of data that the internet world necessitates to process in seconds. We will go through all the stages of handling big data in enterprises and discover the need for a Real Time Processing Framework called Apache Spark.
To begin with, let me introduce you to few domains using real-time analytics big time in today’s world.
Figure: Spark Tutorial – Examples of Real Time Analytics
We can see that Real Time Processing of Big Data is ingrained in every aspect of our lives. From fraud detection in banking to live surveillance systems in government, automated machines in healthcare to live prediction systems in the stock market, everything around us revolves around processing big data in near real time.
Let us look at some of these use cases of Real Time Analytics:
- Healthcare: Healthcare domain uses Real Time analysis to continuously check the medical status of critical patients. Hospitals on the look out for blood and organ transplants need to stay in a real-time contact with each other during emergencies. Getting medical attention on time is a matter of life and death for patients.
- Government: Government agencies perform Real Time Analysis mostly in the field of national security. Countries need to continuously keep a track of all the military and police agencies for updates regarding threats to security.
- Telecommunications: Companies revolving around services in the form of calls, video chats and streaming use real-time analysis to reduce customer churn and stay ahead of the competition. They also extract measurements of jitter and delay in mobile networks to improve customer experiences.
- Banking: Banking transacts with almost all of the world’s money. It becomes very important to ensure fault tolerant transactions across the whole system. Fraud detection is made possible through real-time analytics in banking.
- Stock Market: Stockbrokers use real-time analytics to predict the movement of stock portfolios. Companies re-think their business model after using real-time analytics to analyze the market demand for their brand.
Spark Tutorial: Why Spark when Hadoop is already there?
The first of the many questions everyone asks when it comes to Spark is, “Why Spark when we have Hadoop already?“.
To answer this, we have to look at the concept of batch and real-time processing. Hadoop is based on the concept of batch processing where the processing happens of blocks of data that have already been stored over a period of time. At the time, Hadoop broke all the expectations with the revolutionary MapReduce framework in 2005. Hadoop MapReduce is the best framework for processing data in batches.
This went on until 2014, till Spark overtook Hadoop. The USP for Spark was that it could process data in real time and was about 100 times faster than Hadoop MapReduce in batch processing large data sets.
The following figure gives a detailed explanation of the differences between processing in Spark and Hadoop.
Figure: Spark Tutorial – Differences between Hadoop and Spark
Here, we can draw out one of the key differentiators between Hadoop and Spark. Hadoop is based on batch processing of big data. This means that the data is stored over a period of time and is then processed using Hadoop. Whereas in Spark, processing can take place in real-time. This real-time processing power in Spark helps us to solve the use cases of Real Time Analytics we saw in the previous section. Alongside this, Spark is also able to do batch processing 100 times faster than that of Hadoop MapReduce (Processing framework in Apache Hadoop). Therefore, Apache Spark is the go-to tool for big data processing in the industry.
Spark Tutorial: What is Apache Spark?
Apache Spark is an open-source cluster computing framework for real-time processing. It has a thriving open-source community and is the most active Apache project at the moment. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
Figure: Spark Tutorial – Real Time Processing in Apache Spark
It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations.
Spark Tutorial: Features of Apache Spark
Spark has the following features:
Figure: Spark Tutorial – Spark Features
Let us look at the features in detail:
Polyglot:
Spark provides high-level APIs in Java, Scala, Python and R. Spark code can be written in any of these four languages. It provides a shell in Scala and Python. The Scala shell can be accessed through ./bin/spark-shell and Python shell through ./bin/pyspark from the installed directory. |
![]() |
|
Speed: Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing. Spark is able to achieve this speed through controlled partitioning. It manages data using partitions that help parallelize distributed data processing with minimal network traffic. |
Multiple Formats:
Spark supports multiple data sources such as Parquet, JSON, Hive and Cassandra apart from the usual formats such as text files, CSV and RDBMS tables. The Data Source API provides a pluggable mechanism for accessing structured data though Spark SQL. Data sources can be more than just simple pipes that convert data and pull it into Spark. |
![]() |
![]() |
Lazy Evaluation:
Apache Spark delays its evaluation till it is absolutely necessary. This is one of the key factors contributing to its speed. For transformations, Spark adds them to a DAG (Directed Acyclic Graph) of computation and only when the driver requests some data, does this DAG actually gets executed. |
Real Time Computation:
Spark’s computation is real-time and has low latency because of its in-memory computation. Spark is designed for massive scalability and the Spark team has documented users of the system running production clusters with thousands of nodes and supports several computational models. |
![]() |
![]() |
Hadoop Integration:
Apache Spark provides smooth compatibility with Hadoop. This is a boon for all the Big Data engineers who started their careers with Hadoop. Spark is a potential replacement for the MapReduce functions of Hadoop, while Spark has the ability to run on top of an existing Hadoop cluster using YARN for resource scheduling. |
Machine Learning: Spark’s MLlib is the machine learning component which is handy when it comes to big data processing. It eradicates the need to use multiple tools, one for processing and one for machine learning. Spark provides data engineers and data scientists with a powerful, unified engine that is both fast and easy to use. |
![]() |
Spark Tutorial: Getting Started With Spark
The first step in getting started with Spark is installation. Let us install Apache Spark 2.1.0 on our Linux systems (I am using Ubuntu).
Installation:
- The prerequisites for installing Spark is having Java and Scala installed.
- Download Java in case it is not installed using below commands.
sudo apt-get install python-software-properties sudo apt-add-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer
- Download the latest Scala version from Scala Lang Official page. Once installed, set the scala path in
~/.bashrc
file as shown below.export SCALA_HOME=Path_Where_Scala_File_Is_Located export PATH=$SCALA_HOME/bin:PATH
- Download Spark 2.1.0 from the Apache Spark Downloads page. You can also choose to download a previous version.
- Extract Spark tar using below command.
tar -xvf spark-2.1.0-bin-hadoop2.7.tgz
- Set the Spark_Path in
~/.bashrc
file.export SPARK_HOME=Path_Where_Spark_Is_Installed export PATH=$PATH:$SPARK_HOME/bin
Before we move further, let us start up Apache Spark on our systems and get used to the main concepts of Spark like Spark Session, Data Sources, RDDs, DataFrames and other libraries.
Spark Shell:
Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively.
Spark Session:
In earlier versions of Spark, Spark Context was the entry point for Spark. For every other API, we needed to use different contexts. For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext. To solve this issue, SparkSession came into the picture. It is essentially a combination of SQLContext, HiveContext and future StreamingContext.
Data Sources:
The Data Source API provides a pluggable mechanism for accessing structured data though Spark SQL. Data Source API is used to read and store structured and semi-structured data into Spark SQL. Data sources can be more than just simple pipes that convert data and pull it into Spark.
RDD:
Resilient Distributed Dataset (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
Dataset:
A Dataset is a distributed collection of data. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java.
DataFrames:
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases or existing RDDs.
Spark Tutorial: Using Spark with Hadoop
The best part of Spark is its compatibility with Hadoop. As a result, this makes for a very powerful combination of technologies. Here, we will be looking at how Spark can benefit from the best of Hadoop.
Figure: Spark Tutorial – Spark Features
Hadoop components can be used alongside Spark in the following ways:
- HDFS: Spark can run on top of HDFS to leverage the distributed replicated storage.
- MapReduce: Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing framework.
- YARN: Spark applications can be made to run on YARN (Hadoop NextGen).
- Batch & Real Time Processing: MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real-time processing.
Spark Tutorial: Spark Components
Spark components are what make Apache Spark fast and reliable. A lot of these Spark components were built to resolve the issues that cropped up while using Hadoop MapReduce. Apache Spark has the following components:
- Spark Core
- Spark Streaming
- Spark SQL
- GraphX
- MLlib (Machine Learning)
Spark Core
Spark Core is the base engine for large-scale parallel and distributed data processing. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. Further, additional libraries which are built atop the core allow diverse workloads for streaming, SQL, and machine learning. It is responsible for:
- Memory management and fault recovery
- Scheduling, distributing and monitoring jobs on a cluster
- Interacting with storage systems
Spark Streaming
Spark Streaming is the component of Spark which is used to process real-time streaming data. Thus, it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams. The fundamental stream unit is DStream which is basically a series of RDDs (Resilient Distributed Datasets) to process the real-time data.
Figure: Spark Tutorial – Spark Streaming
Spark SQL
Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing.
Spark SQL integrates relational processing with Spark’s functional programming. Further, it provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.
The following are the four libraries of Spark SQL.
- Data Source API
- DataFrame API
- Interpreter & Optimizer
- SQL Service
A complete tutorial on Spark SQL can be found in the given blog: Spark SQL Tutorial Blog
GraphX
GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph.
The property graph is a directed multigraph which can have multiple edges in parallel. Every edge and vertex have user defined properties associated with it. Here, the parallel edges allow multiple relationships between the same vertices. At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph: a directed multigraph with properties attached to each vertex and edge.
To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and mapReduceTriplets) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.
MlLib (Machine Learning)
MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine learning in Apache Spark.
Use Case: Flight Data Analysis using Spark GraphX
Now that we have understood the core concepts of Spark, let us solve a real-life problem using Apache Spark. This will help give us the confidence to work on any Spark projects in the future.
Problem Statement: To analyse Real-Time Flight data using Spark GraphX, provide near real-time computation results and visualize the results using Google Data Studio
Use Case – Computations to be done:
- Compute the total number of flight routes
- Compute and sort the longest flight routes
- Display the airport with the highest degree vertex
- List the most important airports according to PageRank
- List the routes with the lowest flight costs
We will use Spark GraphX for the above computations and visualize the results using Google Data Studio.
Use Case – Dataset:
Figure: Use Case – USA Flight Dataset
Click here to download the complete dataset: USA Flight Dataset – Spark Training – Edureka
Use Case – Flow Diagram:
The following illustration clearly explains all the steps involved in our Flight Data Analysis.
Figure: Use Case – Flow diagram of Flight Data Analysis using Spark GraphX
Use Case – Spark Implementation:
Moving ahead, now let us implement our project using Eclipse IDE for Spark.
Find the Pseudo Code below:
//Importing the necessary classes import org.apache.spark._ ... import java.io.File object airport { def main(args: Array[String]){ //Creating a Case Class Flight case class Flight(dofM:String, dofW:String, ... ,dist:Int) //Defining a Parse String function to parse input into Flight class def parseFlight(str: String): Flight = { val line = str.split(",") Flight(line(0), line(1), ... , line(16).toInt) } val conf = new SparkConf().setAppName("airport").setMaster("local[2]") val sc = new SparkContext(conf) //Load the data into a RDD val textRDD = sc.textFile("/home/edureka/usecases/airport/airportdataset.csv") //Parse the RDD of CSV lines into an RDD of flight classes val flightsRDD = Map ParseFlight to Text RDD //Create airports RDD with ID and Name val airports = Map Flight OriginID and Origin airports.take(1) //Defining a default vertex called nowhere and mapping Airport ID for printlns val nowhere = "nowhere" val airportMap = Use Map Function .collect.toList.toMap //Create routes RDD with sourceID, destinationID and distance val routes = flightsRDD. Use Map Function .distinct routes.take(2) //Create edges RDD with sourceID, destinationID and distance val edges = routes.map{( Map OriginID and DestinationID ) => Edge(org_id.toLong, dest_id.toLong, distance)} edges.take(1) //Define the graph and display some vertices and edges val graph = Graph( Airports, Edges and Nowhere ) graph.vertices.take(2) graph.edges.take(2) //Query 1 - Find the total number of airports val numairports = Vertices Number //Query 2 - Calculate the total number of routes? val numroutes = Number Of Edges //Query 3 - Calculate those routes with distances more than 1000 miles graph.edges.filter { Get the edge distance )=> distance > 1000}.take(3) //Similarly write Scala code for the below queries //Query 4 - Sort and print the longest routes //Query 5 - Display highest degree vertices for incoming and outgoing flights of airports //Query 6 - Get the airport name with IDs 10397 and 12478 //Query 7 - Find the airport with the highest incoming flights //Query 8 - Find the airport with the highest outgoing flights //Query 9 - Find the most important airports according to PageRank //Query 10 - Sort the airports by ranking //Query 11 - Display the most important airports //Query 12 - Find the Routes with the lowest flight costs //Query 13 - Find airports and their lowest flight costs //Query 14 - Display airport codes along with sorted lowest flight costs
Click here to get the full source code of Flight Analysis using Apache Spark.
Use Case – Visualizing Results:
We will be using Google Data Studio to visualize our analysis. Google Data Studio is a product under Google Analytics 360 Suite. We will use Geo Map service to map the Airports on their respective locations on the USA map and display the metrics quantity.
- Display the total number of flights per Airport
- Display the metric sum of Destination routes from every Airport
- Display the total delay of all flights per Airport
Now, this concludes the Apache Spark blog. I hope you enjoyed reading it and found it informative. By now, you must have acquired a sound understanding of what Apache Spark is. The hands-on examples will give you the required confidence to work on any future projects you encounter in Apache Spark. Practice is the key to mastering any subject and I hope this blog has created enough interest in you to explore learning further on Apache Spark.
We recommend the following Apache Spark Training videos from Edureka to begin with:
What is Apache Spark | Apache Spark Training | Edureka
This video series on Spark Tutorial provide a complete background into the components along with Real-Life use cases such as Twitter Sentiment Analysis, NBA Game Prediction Analysis, Earthquake Detection System, Flight Data Analytics and Movie Recommendation Systems. We have personally designed the use cases so as to provide an all round expertise to anyone running the code.
Got a question for us? Please mention it in the comments section and we will get back to you at the earliest.
If you wish to learn Spark and build a career in domain of Spark to perform large-scale Data Processing using RDD, Spark Streaming, SparkSQL, MLlib, GraphX and Scala with Real Life use-cases, check out our interactive, live-online Apache Spark Certification Training here, that comes with 24*7 support to guide you throughout your learning period.
The post Spark Tutorial: Real Time Cluster Computing Framework appeared first on Edureka Blog.