Drilling Down On Apache Drill, The New-Age Query Engine (Part 2)
In this second Apache Drill blog post, we will learn how integrate Hive and HBase with Apache Drill. Apache Drill provides inbuilt storage plugins for Hive and HBase integration. We just need to edit...
View ArticleSpark Accumulators Explained: Apache Spark
Contributed by Prithviraj Bose Here’s a blog on the stuff that you need to know about Spark accumulators. What are accumulators? Accumulators are variables that are used for aggregating information...
View ArticleDistributed Caching With Broadcast Variables: Apache Spark
Contributed by Prithviraj Bose Broadcast variables are useful when large datasets needs to be cached in executors. This blog explains how to get started. What are Broadcast Variables? Broadcast...
View ArticleStateful Transformations with Windowing in Spark Streaming
Contributed by Prithviraj BoseIn this blog we will discuss the windowing concept of Apache Spark’s stateful transformations.What is stateful transformation?Spark streaming uses a micro batch...
View ArticleSpark Functional Features
The extra-ordinary functional capabilities of Apache Spark make it a standalone project from Apache Software Foundation, which comes with high processing speed and efficiency like never-before. Let’s...
View Article5 Reasons to Learn Apache Spark
Those who have been into Big Data probably know about Spark, popularly known as the Swiss Army knife of Big Data analytics. We have talked about the different features of Spark in our previous posts....
View ArticleWhy Scala is getting Popular?
Market for Scala is increasing at a very fast pace. There are several reasons why Scala is the sough-after choice of programmers: Developers want more flexible languages to improve their productivity....
View ArticleHive & Yarn Get Electrified By Spark
In this blog, let us see how to build Spark for a specific Hadoop version. We will also learn how to build Spark with HIVE and YARN. Considering that you have Hadoop, jdk, mvn and git pre-installed...
View ArticleHive and Yarn Examples on Spark
We have learnt how to Build Hive and Yarn on Spark. Now let us try out Hive and Yarn examples on Spark. Hive Example on Spark We will run an example of Hive on Spark. We will create a table, load data...
View ArticleBig Data Processing with Spark and Scala
Understanding Spark & Scala In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing...
View ArticleApache Spark vs Hadoop MapReduce
Anybody working in the area of Big Data will know what MapReduce does and what its shortcomings are. It is not completely fair to say that there are shortcomings because MapReduce along with HDFS was...
View ArticleCumulative Stateful Transformation In Apache Spark Streaming
Contributed by Prithviraj Bose In my previous blog I have discussed stateful transformations using the windowing concept of Apache Spark Streaming. You can read it here. In this post I am going to...
View ArticleSpark SQL Tutorial – Understanding Spark SQL With Examples
Apache Spark is a lightning-fast cluster computing framework designed for fast computation. It is of the most successful projects in the Apache Software Foundation. Spark SQL is a new module in Spark...
View ArticleSpark Accumulators Explained: Apache Spark
Contributed by Prithviraj Bose Here’s a blog on the stuff that you need to know about Spark accumulators. What are accumulators? Accumulators are variables that are used for aggregating information...
View ArticleSpark Tutorial: Real Time Cluster Computing Framework
Apache Spark is an open-source cluster computing framework for real-time processing. It is of the most successful projects in the Apache Software Foundation. Spark has clearly evolved as the market...
View ArticleSpark Streaming Tutorial – Sentiment Analysis Using Apache Spark
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming can be used to stream live data and...
View ArticleSpark MLlib – Machine Learning Library Of Apache Spark
Spark MLlib is Apache Spark’s Machine Learning component. One of the major attractions of Spark is the ability to scale computation massively, and that is exactly what you need for machine learning...
View ArticleSpark GraphX Tutorial – Graph Analytics In Apache Spark
GraphX is Apache Spark’s API for graphs and graph-parallel computation. GraphX unifies ETL (Extract, Transform & Load) process, exploratory analysis and iterative graph computation within a single...
View ArticleData Scientist Skills – What Does It Take To Become A Data Scientist?
Data Scientist Skills: Data science is an umbrella term that encompasses data analytics, data mining, Artificial Intelligence, machine learning, Deep Learning and several other related disciplines. In...
View ArticleIntroduction to Spark with Python – PySpark for Beginners
Apache Spark is one the most widely used framework when it comes to handling and working with Big Data AND Python is one of the most widely used programming languages for Data Analysis, Machine...
View Article