The growth in data, and therefore its power in driving business value and competitive advantage is making a striking expansion.
In the data space, the growth of data has led to a number of innovation in terms of technology, processes, and models. The terminologies that appear to cause a lot of confusion to the outsiders of the data domain, (often created by the insiders). Industry has used the terms Data Analytics, Data Science and created a lot of ambiguity around those terms. Big Data, which is really defined by characteristics of data, got caught in this comparison, and we have plenty of posts on the internet that are vaguely trying to distinguish between them. As more professionals aspire to become part of the community that solves data problems, it is important to get one level inside view on them. We hope to disambiguate these terms.
If you are interested in building mastery in Data Analytics, you should consider our Data Analytics Certification Courses with specializations in Python, R, SAS & Excel.
Share this infographic On Your Site
Please include attribution to digitalvidya.com with this graphic.
Let us first look at Data and Big Data. Data in the digital form has existed for over half a century now, however, the rate at which it is growing in the last 5 years supersedes the entire data evolution. The Big Data term got coined when the existing systems were not equipped to deal with the growth, and there was a need for technology innovation to process and extract insights from the data of the new big-size.
Data (of any size) has characteristics:
Type of data – Structured, Numeric, Text, Unstructured, Pictures, Audio, Video
Size of data – Small, Medium, Big, Very Big, Humongous
Data state – Data in motion, Data at rest
Big Data is particularly defined by following characteristics:
The industry did a good job at accepting a standard mechanism to define the proportional dimensions of Big Data, originally proposed by Gartner, however could not hold itself by augmenting several more of its own. We will look at the 4 Vs that are broadly used by the pundits of Big Data now.
Volume: Data size is moving swiftly from gigabyte to terabyte, petabyte, exabyte, zettabyte. These volumes could not be stored and analyzed using systems that were traditionally used to process data.
Velocity: The speed of generation of data (per hour, per minute, per second) has clearly put to all the data that existed before the social media era to a relatively much smaller degree. The velocity at which this data needs to be stored and processed, needed innovation over traditional means.
Variety: The variety of data that the social media era introduced, in form of unstructured text, videos, photos, sensor data, changed the concept of structured transactional data that fit neatly into the structured databases. Need for innovation to handle a variety of data was seen necessary.
Veracity: The dictionary meaning of Veracity is ‘conforming to facts, accuracy’. This V got augmented with the realization that the data in Big Data is often not all true, and needs to be considered as a large basket of all kind of information, and therefore needs to be treated. Even though this V does not define the magnitude, it definitely calls for technology solutions to deal with the problem.
Value: This V is common with dealing with any kind of data. However, it probably got an important inclusion as Big Data initiatives need to be enclosed with business use case, and avoid the trap of using technology for its “buzz”.
Now that we have looked at Data and Big Data characteristics, let us look at the science of working with data.
Data Science
Data Science is an interdisciplinary field combining machine learning, statistics, advanced analysis and programming. This is the process for inspecting, cleansing, transforming and modeling data with an objective of discovering useful information, suggesting conclusion and supporting decision making. When we apply Data Science to Big Data – it is still Data Science, though technology platform would be the Big Data platforms.
Skills and Roles: The landscape is very large and growing rapidly. This is an outline of some of the most popular platforms, however in no way comprehensive.
Data Scientist
Having read serious and humorous attempts to define the difference between a Data Analyst and Data Scientist, I would summarize it as more similar to how the software industry defines, Software Engineer, and Senior Software Engineer, Solution Architect and Senior/Chief Solution Architect. The Venn Diagram proposed by Drew Conway in 2010, defined Data Scientist, as the intersection of programming, statistics and domain skills. The industry often calls the junior team members as data analysts who are able to explore and analyze data, however don’t know their machine learning models.
Technology skills:
- Analytics tools like Advanced Excel or/and
- Data Warehousing and SQL to do data query and filtering or/and
- Programming skills using R or Python or SAS
- Statistics Knowledge
- Statistical modelling and Machine Learning
- Tuning and testing machine learning algorithms
- Visualization using programming libraries
- Visualization using Business Intelligence (BI) tools
Domain skills:
- Data Science is about discovery and building information. Skills of Where, How and What from the Data for the given domain
- Skills to create motivating questions about the domain, and build hypothesis
Big Data Engineer:
- Foundation Technology Platform: Apache Hadoop, HDFS, Map Reduce
- Databases: HBase or Cassandra or MongoDB or Apache CouchDB
- Hosting Platform Vendors: Cloudera, Hortonworks, AWS, Google Cloud Platform
- Data Engineer: Apache Hive, Apache Pig, Apache Sqoop, Apache Flume
Big Data Application Engineer + Data Scientist:
- Foundation Technology Platform: Apache Hadoop, HDFS, Map Reduce, HBase
- Databases: HBase / Cassandra / MongoDB / Apache CouchDB
- Hosting Platform Vendors: Cloudera / Hortonworks / AWS / Google Cloud Platform
- Application Engineer Platforms: Apache Spark / Apache Storm / Apache Flink
- Programming Languages (1 or 2 is good to know): Scala, Java, Python, R
- Statistical modelling, Machine Learning – using Apache Spark’s machine learning (Spark MLib) library or Apache Flink machine learning (FlinkML) library or H2O
- Graph Database, Graph Analytics
- Scaling up Machine Learning Algorithms
- Apache Mahout – Knowledge of premade algorithms for Scala + Apache Spark / H2O / Apache Flink
Are you inspired by the opportunity of Data Science? You may also enroll in our Data Science Master Course for more lucrative career options in Data Science.
A good piece. Awesome