MapReduce: A Scalable Programming Paradigm For Processing Big Data

Data: A Digital Makeover

In this era of digital age, virtually we are floating in oceans of data. But, how did we arrive at the helm of this affair? Techniques like MapReduce, Hadoop have somewhat met our data needs, but let us walk through some thoughts and facts to consider as we construct a holistic view of the world that we currently inhabit.

Isn’t, it a fact that everyone is directly or indirectly dependent on data for most the things we do currently in our lives? Well, think about it for a second.

Let me just help you, that little device in your hand; yes, your smartphone is just another example that you are consuming the data from various sources. But that is not the point, the question is how processing of data takes place to make it valuable and meaningful for us.

The Current Data Landscape

The ever-increasing data concerns are widespread with the advent of the Internet of Things (IoT) and other latest technologies. The growth of data is exceeding the ability of traditional computing. There are many sources that predict exponential growth of data towards 2020 and beyond. Yet they are all in broad agreement that the size of this digital ocean will double every two years at least, a 50-fold growth from 2010 to 2020. Human and machine-generated data is experiencing an overall 10x faster growth rate than traditional business data, and machine data is increasing even more rapidly at 50x the growth rate.

How can we consume these humongous data sources and transform them into actionable information? The acquisition and analysis of data and its transformation into meaningful insights is a complex workflow that extends beyond data centers, into the cloud with a seamless hybrid environment. Adding to this, the concept “Big Data” evolved and gained huge acceptance as the nature of data changed from structured to unstructured. To find valuable insights-trends and patterns that can help the businesses, the research, the industry, and the humanity at large. Many techniques appeared as a part of the solution, a few gained acceptances while others nullified. Thus, it is very important to try and understand the current state of affairs, as it becomes paramount to perfect our situation for the materializing reality.

The Big Data Picture: Emergence of MapReduce

Big data has appeared as a concrete concept and is blossoming but has uncertain origins. Diebold (2012) argues that the term “big data probably originated in lunch-table conversations at Silicon Graphics Inc. (SGI) in the mid-1990s, in which John Mashey figured prominently.

However technically, the Big data is insignificant in vacuum. Its actual potential unlocks only when used to drive decision making. To enable such evidence-based decision making, organizations need efficient processes to turn high volumes of fast-moving and diverse data into meaningful insights.

Unarguably, the techniques like RDBMS, Grid computing etc. contributed invaluably in processing big data but did not fit well as a solution. Thus, the need for new techniques to process big data gave rise to various other techniques. The most popular technique is known as MapReduce, which implements Map and reduce concept using MapReduce programming. Let us dig down to see what led to the evolution of MapReduce programming.

MapReduce: The Evolution

It was early part of 90’s decade that the Internet was in full bloom, with the advent of Bigdata, MapReduce emerged as a most effective solution. Let us see in timeline what and how the MapReduce technique evolved.

The evolutionary timeline:

1997: Doug Cutting, a Yahoo! Employee started writing the first version of Lucene (used for web pages indexing).

2001: Lucene was open-sourced, Mike Cafarella, a University of Washington graduate grouped with Dough Cutting to index the entire web, their effort yielded a new Lucene sub-project called as Apache Nutch.

While indexing, Cutting and Cafarella were facing the following issues with the existing file system:

In availability of schema (no concept of having rows and columns)
Missing Endurance (once written should never lost)
Incapability of Fault tolerance (CPU, Memory, Network)
Unautomated Re-balancing (disk space consumption)

2003: With Google File System (GFS) concept and Java, a new file system was created which was called NDFS (Nutch Distributed File System).

It solved a few problems but, the following problems were still unresolved:

Endurance
Fault tolerance

For resolving these issues, the idea of distributed processing came up. While implementing this, the need for an algorithm for the NDFS appeared to integrate parallel processing running multi nodes at the same time.

2004: Google published a paper called MapReduce; Simple Data Processing on Large Clusters. The algorithm mentioned in the paper solved problems like:

Parallelization
Distribution
Fault-tolerance

The technique MapReduce evolved as a framework for writing applications that process enormous amounts of structured and unstructured data. The word MapReduce consists of two distinct words, “Map” and “Reduce.”

In the next section, let us demystify what it is? and what is it used for?

What is MapReduce?: The Concept

The technique MapReduce is a linearly scalable programming model, implemented via MapReduce programming. Simplifying the above statement, the MapReduce is a framework for writing applications that process massive amounts of data (multi-terabyte data-sets and more) in-parallel on large clusters (thousands of nodes and more) of commodity hardware in a reliable, fault-tolerant manner.

The MapReduce programming is primarily based on two functions:

A Map function and
A Reduce function.

Each of the functions defines a mapping from one set of key-value pairs to another. A key-value pair (KVP) is a set of two linked data items:

Key: which is a unique identifier for some data item, and
Value: This can be either of the following, found data and a pointer to the location of that data.

These functions are unaware of the size of the data or the cluster that they are running on. These functions work well for both types of datasets be it small or massive data set. More importantly, if the size of the input data doubles, a job will run twice as slow. But if you also double the size of the cluster, a job will run as fast as the original one.

Power of MapReduce

The power of MapReduce is that many relational databases have started incorporating some of the ideas from MapReduce (such as Aster Data’s and Greenplum’s databases), also many higher-level query languages use MapReduce as base (such as Pig and Hive) which makes MapReduce framework as obvious choice and more approachable to traditional database programmers.

MapReduce: The Workflow

Overall a MapReduce program consists of two phases:

The Map phase:

The master node takes the input.
It divides the input into smaller subproblems.
The master node distributes these smaller sub-problems to worker nodes.
A worker node may do this again to lead to a multi-level tree structure.
The worker node processes the smaller problem.
It passes the answer back to its master node.

The Reduce phase:

The master node collects the answers to all the sub-problems given by worker nodes.
It combines all answers to form the output to the original problem

Detailed steps followed in MapReduce technique are as follows:

Preparing the Map input: The system selects the Map processors, distributes them the input key-value K1 to work on, and provides that processor with all the input data associated with that key value.
Execute the user-provided Map () code
Execute the Map () code exactly once for each K1 key value, generating output that is organized by key values K2.
Shuffle the output of the Map () to the Reduce processors; the MapReduce system selects the Reduce processors, assigns the K2 key value to work on and provides that processor with all the data generated by Map() associated with that key value.
Execute the Reduce () code provided by the user – Reduce () is run exactly once for each K2 key value produced by the Map step.
Produce the final output, the MapReduce system collects all the output generated by, reduce () and sorts it by key-value K2 to produce the final outcome.

How MapReduce works? ; An Example

Consider the example of laptops for sale in a shop:

Apple, Hp, Lenovo, Fujitsu, Sony, Samsung, Asus

Two data sets with different combinations of laptops are:

Data set 1: Asus, Sony, Lenovo, Lenovo, Fujitsu, Hp, Sony, Apple, Samsung
Data set 2: Asus, Fujitsu, Lenovo, Asus, Hp, Sony, Hp, Apple, Asus

Map step

For each of the record in data set, map (String key, String value), i.e. map (k1, v1)–> list (k2, v2) is produced as follows:

Asus Sony Lenovo {(“Asus”,”1”), (“Sony”,”1”), (“Lenovo”,”1”)}
Lenovo Fujitsu Hp {(“Lenovo”,”1”), (“Fujitsu”,”1”), (“Hp”,”1”)}
Sony Apple Samsung {(“Sony”,”1”), (“Apple”,”1”), (“Samsung”,”1”)}
Asus Fujitsu Lenovo {(“Asus”,”1”), (“Fujitsu”,”1”), (“Lenovo”,”1”)}
Asus Hp Sony {(“Asus”,”1”), (“Hp”,”1”), (“Sony”,”1”)}
Hp Apple Asus {(“Hp”,”1”), (“Apple”,”1”), (“Asus”,”1”)}

Furthermore, let’s proceed to see what is done in reduce step:

Reduce step

For each of the above Map result, the obtained reduce (String key, Iterator values) that is reduce (k2, list (v2))–>list (v2) is as follows:

reduce (“Apple”, <1, 1>)–> 2
reduce (“Hp”, <1, 1, 1>)–>3
reduce (“Lenovo”, <1, 1, 1, 1>)–> 4
reduce (“Fujitsu”, <1, 1>)–>2
reduce (“Sony”, <1, 1, 1>)–>3
reduce (“Samsung”, <1>)–>1
reduce (“Asus”, <1, 1, 1, 1>)–> 4

The solution to a basic computer sale is easy with MapReduce programming. The MapReduce programming is the core of the distributed programming model in many applications to solve big data problems across diverse industries in the real world. There are many challenging problems such as data analysis, log analytics, recommendation engines, fraud detection, and user behavior analysis, among others, the MapReduce satisfies as a practical solution.

MapReduce; Its Applications

The functionality of the parallel processing of massive data resulted in the implementation of MapReduce in various data-intensive environments and used across various industries.

A brief introduction of related applications is as follows:

Distributed Pattern-based Searching: The use of distributed grep command provided by MapReduce to search a pattern in the given text distributed over a network
Geo-spatial Query Processing: With the technological advancement in location-based service, the MapReduce helps to find out the shortest path in Google map for a given location.
Distributed Sort: The use of distributed sort in MapReduce to arrange the data in a sorted manner split across multiple sites.
Web Link Graph Traversal: Solving a large-scale graph, also known as web graph using MapReduce programming.
Machine Learning Applications: MapReduce aids machine learning because it helps to build the systems that learn from data without need of explicit programming for all the conditions.
Data Clustering: Use of MapReduce in data clustering to solve the computational complexity that arises due to the voluminous data used in processing by dividing complete data set into small data subsets based on certain criteria and many more applications.

Consequently, A highly scalable data store with a good parallel programming model has been a challenge for the industry for some time. Meanwhile, it is not wrong to say that, MapReduce programming model does not solve all the problems. Yet, it is a strong solution for many tasks met in the field of data and other related domains. Finally, the future is that with time data needs are changing, since our world is rapidly inching towards the extremes of digitization.

MapReduce; The Future

The MapReduce framework with other technologies will bring new and more accessible programming techniques for working on massive data stores with both structured and unstructured data.
Many organizations are in the process of inventing therefore, expect various MapReduce frameworks with added features in the coming years.
A lot of research work is undergoing on the extension of MapReduce carried out with new functionalities and mechanism to perfect it for a new set of problems.
The MapReduce seems to increase with the comprehensive solutions for more day to day data problems.
Every parallel technology makes claims about scalability. MapReduce with other implementations has genuine scalability since the latest release, and is expanding the limit on the number of nodes to go beyond 4,000.

4 thoughts on “What is MapReduce Programming | A Scalable Programming Paradigm for Processing Big Data”

Deepesh pandole
January 18, 2018 at 17:03

thanks

1. Nawab
  January 18, 2018 at 17:38
  
  Deepesh, there are a lot of related blogs coming up.Keep visiting Digital Vidya.
  
paras dhankecha
January 19, 2018 at 16:39

I have never heard about map reduce programming. Thank you for such a detailed guide.

Majid Mushtaq
January 21, 2018 at 13:01

nice blog bro….. i m also researching new things right now,your blog is so understandable that i can get information easily and it will help me out to complete my research..