Attend FREE Webinar on Digital Marketing for Career & Business Growth Register Now

Data Analytics Blog

Data Analytics Case Studies, WhyTos, HowTos, Interviews, News, Events, Jobs and more...

Introduction to Big Data Processing with Apache Spark

5 (100%) 5 votes

We’re entering a new world in which data may be more important than software.

Most of us are very active at social media like Facebook, Twitter, LinkedIn, Instagram etc. Just imagine how much several million people generate in various forms.The data can be in form of Image, Video, Text and many more. Every second across the world we constitute the Big Data.Imagine we perform 40,000 search queries every second (on Google only), which makes it 3.5 searches per day and 1.2 trillion searches per year.By 2020, we will have over 6.1 billion smartphone users globally, all phones packed with sensors capable of collecting all kinds of data, not to mention the data the users create themselves.According to Forbes, At the moment less than 0.5% of all data is ever analysed and used, just imagine the potential here.

Big data is not just a new concept.We are living with it for a while now.Now the challenge is to process the raw data and analyse it, connect the dots and create a story out of it. We have been using map reduce based framework for processing, but it doesn’t help much as it is hard disk based.

Here, I will explain you to one such framework, which is memory based and processes the huge data in light fastening speed.Read mode on!

What is Apache Spark?

Apache Spark is a fast and general-purpose open-source distributed cluster-computing framework.Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark is an in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. 

Now to break the heavy weighted definition, we should know the meaning of the words used.

What is open-source means? Open-source software is the software for which the original source code is made freely available and may be redistributed and modified.source code made available with a license in which the copyright holder provides the rights to study, change, and distribute the software to anyone and for any purpose.

Now comes distributed. When there is a requirement of fast processing a huge data, we should perform the same task on different machines parallelly.Spark is typically used to pull large distributed datasets into memory to gain performance benefits. With Spark, you can still keep the original copy intact when you pull the data into memory.When you call an action on it it will distribute the data into nodes memory (if it won’t fit it will spill over the disk), perform calculations and transfer the results back to the driver. For this spark uses a concept called Resilient Distributed Dataset(RDD).We will discuss RDD later in this post.

What is cluster-computing framework means? Data growing faster than processing speeds Only solution is to parallelize on large clusters.Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. 

Hadoop and Spark:

On a high level, Spark can run on top of Hadoop, benefiting from Hadoop’s cluster manager (Yet Another Resource Negotiator aka YARN) and underlying storage like HDFS or HBase.Hadoop is parallel data processing framework that has traditionally been used to run map/reduce jobs. These are long-running jobs that take minutes or hours to complete. Spark has designed to run on top of Hadoop and it is an alternative to the traditional batch map/reduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds. So, Hadoop supports both traditional map/reduce and Spark.Spark uses three concepts i.e. RDD, Transformation, Action which increases the performance drastically compared to other map reduce systems.One thing that comes up with RDDs is that when we come back to them being that they are resilient and in main memory is that how do they compare with distributed shared memory.

Spark is an alternative to Hadoop MapReduce in a range of circumstances. Spark is not a replacement for Hadoop, but is instead a great companion to a modern Hadoop cluster deployment.

What are the benefits of Apache Spark?

Nowadays the amount of data on the planet is set to grow 10-fold in the next six years to 2020 from around 4.4 zettabytes to 44ZB. That’s according to IDC’s annual Digital Universe study, which also predicted that, by 2020, the amount of information produced by machines will account for about 10% of data on earth.When the volume, variety of the data is more big data processing comes to your rescue.

The numerous advantages of Apache Spark makes it a very attractive big data framework.Spark takes MapReduce to another level with less expensive shuffles in the data processing.In-memory data storage and near real-time processing are few from so many features of  Spark which ensures that performance can be several times faster than other big data technologies.

Lazy evaluation of big data queries is also an important feature of Spark, which helps with optimization of the steps in data processing workflows. It provides a simpler solution to a complex problem.To discuss more in-memory, Spark holds intermediate results in memory rather than disks which makes it 10x faster than a system where writing to disk is involved. It’s designed to be an execution engine that works both in-memory and on-disk. Spark operators perform external operations when data does not fit in memory. Spark can be used for processing datasets that larger than the aggregate memory in a cluster.

There are other features include:

  • Speed increased by 100 times.
  • Ease to use
  • A Unified engine having a packaged solution
  • Supports more than just Map and Reduce functions.
  • Lazy evaluation of big data queries.
  • Provides concise and consistent APIs in Scala, Java and Python.
  • Offers interactive shell for Scala and Python. This is not available in Java yet.
  • A comprehensive support by Spark community.

Big Data & Analytics Course by Digital Vidya

Free Big Data & Analytics Webinar

Date: 22nd Mar, 2018 (Thursday)
Time: 3 PM to 4 PM (IST/GMT +5:30)

Spark Ecosystem

Following are some additional components in Apache Spark Ecosystem which provide additional capabilities in Big Data analytics and Machine Learning areas.

Introduction to Big Data

Spark Eco system

These libraries include:

  • Spark Core:
    • Spark Core is the distributed execution engine for Spark platform that all functionality is built on the top layer.It provides all sort of functionalities like task dispatching, scheduling, and input-output operations,in-memory computing capabilities to deliver speed.It uses the RDD which is the basic building block of Spark.
  • Spark Streaming:
    • Spark Streaming is a Spark component that enables processing the real-time streaming data,high-throughput, fault-tolerant stream.Spark Streaming enables powerful interactive and analytical applications across both streaming and historical data while inheriting Spark’s ease of use and fault tolerance characteristics. It uses the DStream which is basically a series of RDDs, to process the real-time data.
  • Spark SQL:
    • Spark SQL is a Spark’s package for structured data processing.Many professionals are mostly using SQL queries to explore data and for structural data processing.Spark SQL allows the users to ETL their data from different formats, transform it, and expose it for ad-hoc querying. It allows a programming abstraction called DataFrames and can also act as distributed SQL query engine.It allows querying data via SQL as well as Hadoop Hive queries to run up to 100x faster on existing deployments and data.
  • Spark MLlib:
    • MLlib in Spark is a scalable Machine learning library that has quickly emerged as a critical piece in mining Big Data for actionable insights.Machine learning Built on top of Spark, MLlib is a scalable machine learning library that delivers both high-quality algorithm and high speed.MLlib provides common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as supporting functionality such as model evaluation and data import.
  • Spark GraphX:
    • GraphX is a library for parallel graph computation built on top of Spark. At a high level, like Spark Streaming and Spark SQL, GraphX extends the Spark RDD API, allowing us to create a directed graph with arbitrary properties attached to each vertex and edge. Added to that, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.

Spark Architecture

Past few years we witnessed some significant progress made in technologies processing huge data sets and distributed computing.Apache Spark has come further to this point with in-memory data processing capabilities and if resources are configured correctly for Apache Spark then it can process data up to 10-100 time faster than traditional Hadoop and MapReduce based applications. Performance is the score point for Spark.

On a high level, Spark uses a master/slave architecture.Basically, it is designed to be used with a range of programming languages and on a variety of architectures.There is a driver that connects to a single coordinator called master that manages workers in which executors run. The driver and the executors run Java processes on their own. We can run them all on the same (horizontal cluster) or separate machines (vertical cluster) or in a mixed machine configuration. 

Introduction to Big Data

Spark context


When a user/client submits a spark user application code, the driver comes into picture.The driver implicitly converts the code containing transformations and actions into a logical directed acyclic graph (DAG). At this stage, the driver program also performs certain optimizations and then it converts the logical DAG into physical execution plan with set of stages. After creating the physical execution plan, it creates small physical execution units referred to as tasks under each stage. Then tasks are bundled to be sent to the Spark Cluster.

After that, the driver program talks to the cluster manager for resource negotiation. The cluster manager then launches executors on the worker nodes. Then the driver sends tasks to the cluster manager depending on data placement in the node. Before executors begin execution, they register themselves with the driver program so that the driver has an overall view of all the executors. Executors start executing the various tasks assigned by the driver program. At any point of time when the spark application is running, the driver program will monitor the set of executors that run. In case of driver programs main () method exits or when it call the stop () method of the Spark Context, it will terminate all the executors and release the resources from the cluster manager.

Resilient Distributed Datasets -Introduction

A Resilient Distributed Dataset (RDD) is an immutable, distributed collection of objects.

Resilient: Can be reconstructed in case of failure.

Distributed: Operations are parallelizable and distributed across the nodes.

Dataset: Data loaded and partitioned across cluster nodes(executers)

Spark RDDs are resilient or fault tolerant, which enables Spark to recover the RDD in the face of failures.The ability to always recompute an RDD is actually why RDDs are called “resilient”. When a machine holding RDD data fails, Spark uses this ability to recompute the missing partitions, transparent to the user.Spark’s representation of a dataset that is distributed across the RAM, or memory, of lots of machines.The “distributed” nature of the RDDs works because an RDD only contains a reference to the data, whereas the actual data is contained within partitions across the nodes in the cluster.

We can simplify a RDD to better understand by thinking of an RDD as a large array of integers distributed across machines. It is actually a dataset that has been partitioned across the cluster and the partitioned data could be from HDFS(Hadoop Distributed File System), HBase table, Cassandra table, Amazon S3.


In this article, we looked at how Apache Spark framework helps with big data processing with its standard API.Added to that, we also looked at the comparison of Spark with other Map-Reduce framework, Spark architecture and RDD. Spark is a less mature ecosystem and communities are trying to improvise it every day.Hopefully, we will see a newer version soon with further improvements in areas like security and integration with BI tools.

Deepak is a Big Data technology-driven professional and blogger in open source data engineering and data science. He extensively works in Data gathering, modeling, analysis, validation and architecture/solution design to build next generation analytics platform.

  • Big Data

  • Your Comment

    Your email address will not be published.