Demystifying Data Science and Big Data

5 Min Read. |

The growth in data, and therefore its power in driving business value and competitive advantage is making a striking expansion. 

Data Science

Data Science and Big Data Infographic

In the data space, the growth of data has led to a number of innovation in terms of technology, processes, and models. The terminologies that appear to cause a lot of confusion to the outsiders of the data domain, (often created by the insiders). Industry has used the terms Data Analytics, Data Science and created a lot of ambiguity around those terms. Big Data, which is really defined by characteristics of data, got caught in this comparison, and we have plenty of posts on the internet that are vaguely trying to distinguish between them. As more professionals aspire to become part of the community that solves data problems, it is important to get one level inside view on them. We hope to disambiguate these terms.

If you are interested in building mastery in Data Analytics, you should consider our Data Analytics Certification Courses with specializations in Python, R, SAS & Excel.


Share this infographic On Your Site

Please include attribution to with this graphic.

Let us first look at Data and Big Data. Data in the digital form has existed for over half a century now, however, the rate at which it is growing in the last 5 years supersedes the entire data evolution. The Big Data term got coined when the existing systems were not equipped to deal with the growth, and there was a need for technology innovation to process and extract insights from the data of the new big-size.

Data (of any size) has characteristics:

Type of data – Structured, Numeric, Text, Unstructured, Pictures, Audio, Video

Size of data – Small, Medium, Big, Very Big, Humongous

Data state – Data in motion, Data at rest

Big Data is particularly defined by following characteristics:

The industry did a good job at accepting a standard mechanism to define the proportional dimensions of Big Data, originally proposed by Gartner, however could not hold itself by augmenting several more of its own. We will look at the 4 Vs that are broadly used by the pundits of Big Data now.

Volume: Data size is moving swiftly from gigabyte to terabyte, petabyte, exabyte, zettabyte. These volumes could not be stored and analyzed using systems that were traditionally used to process data.

Velocity: The speed of generation of data (per hour, per minute, per second) has clearly put to all the data that existed before the social media era to a relatively much smaller degree. The velocity at which this data needs to be stored and processed, needed innovation over traditional means.

Variety: The variety of data that the social media era introduced, in form of unstructured text, videos, photos, sensor data, changed the concept of structured transactional data that fit neatly into the structured databases. Need for innovation to handle a variety of data was seen necessary.

Veracity: The dictionary meaning of Veracity is ‘conforming to facts, accuracy’. This V got augmented with the realization that the data in Big Data is often not all true, and needs to be considered as a large basket of all kind of information, and therefore needs to be treated. Even though this V does not define the magnitude, it definitely calls for technology solutions to deal with the problem.

Value: This V is common with dealing with any kind of data. However, it probably got an important inclusion as Big Data initiatives need to be enclosed with business use case, and avoid the trap of using technology for its “buzz”. 

Now that we have looked at Data and Big Data characteristics, let us look at the science of working with data.

Data Science

Data Science is an interdisciplinary field combining machine learning, statistics, advanced analysis and programming. This is the process for inspecting, cleansing, transforming and modeling data with an objective of discovering useful information, suggesting conclusion and supporting decision making. When we apply Data Science to Big Data – it is still Data Science, though technology platform would be the Big Data platforms.

Skills and Roles: The landscape is very large and growing rapidly. This is an outline of some of the most popular platforms, however in no way comprehensive.

Data Scientist

Having read serious and humorous attempts to define the difference between a Data Analyst and Data Scientist, I would summarize it as more similar to how the software industry defines, Software Engineer, and Senior Software Engineer, Solution Architect and Senior/Chief Solution Architect. The Venn Diagram proposed by Drew Conway in 2010, defined Data Scientist, as the intersection of programming, statistics and domain skills. The industry often calls the junior team members as data analysts who are able to explore and analyze data, however don’t know their machine learning models.

Technology skills:

  • Analytics tools like Advanced Excel or/and
  • Data Warehousing and SQL to do data query and filtering or/and
  • Programming skills using R or Python or SAS
  • Statistics Knowledge
  • Statistical modelling and Machine Learning
  • Tuning and testing machine learning algorithms
  • Visualization using programming libraries
  • Visualization using Business Intelligence (BI) tools

Domain skills:

  • Data Science is about discovery and building information. Skills of Where, How and What from the Data for the given domain
  • Skills to create motivating questions about the domain, and build hypothesis

Big Data Engineer:

  • Foundation Technology Platform: Apache Hadoop, HDFS, Map Reduce
  • Databases: HBase or Cassandra or MongoDB or Apache CouchDB
  • Hosting Platform Vendors: Cloudera, Hortonworks, AWS, Google Cloud Platform
  • Data Engineer: Apache Hive, Apache Pig, Apache Sqoop, Apache Flume

Big Data Application Engineer + Data Scientist:

  • Foundation Technology Platform: Apache Hadoop, HDFS, Map Reduce, HBase
  • Databases: HBase / Cassandra / MongoDB / Apache CouchDB
  • Hosting Platform Vendors: Cloudera / Hortonworks / AWS / Google Cloud Platform
  • Application Engineer Platforms: Apache Spark / Apache Storm / Apache Flink
  • Programming Languages (1 or 2 is good to know): Scala, Java, Python, R
  • Statistical modelling, Machine Learning – using Apache Spark’s machine learning (Spark MLib) library or Apache Flink machine learning (FlinkML) library or H2O
  • Graph Database, Graph Analytics
  • Scaling up Machine Learning Algorithms
  • Apache Mahout – Knowledge of premade algorithms for Scala + Apache Spark / H2O / Apache Flink

Are you inspired by the opportunity of Data Science? You may also enroll in our Data Science Master Course for more lucrative career options in Data Science.

Register for FREE Digital Marketing Orientation Class
Date: 27th Feb, 2021 (Sat)
Time: 11 AM to 12:30 PM (IST/GMT +5:30)
  • This field is for validation purposes and should be left unchanged.
We are good people. We don't spam.

You May Also Like…

Linear Programming and its Uses

Linear Programming and its Uses

Optimization is the new need of the hour. Everything in this world revolves around the concept of optimization.  It...

An overview of Anomaly Detection

An overview of Anomaly Detection

Companies produce massive amounts of data every day. If this data is processed correctly, it can help the business to...

1 Comment

  1. Charles

    A good piece. Awesome


Submit a Comment

Your email address will not be published. Required fields are marked *