The human civilization started to curate, interpret and summarize data very early on, in order to bring out the value and insights. Patañjali (prior to 400 CE), compiled the Yoga Sūtras in 196 sutras, in an attempt to simplify the yogic science. Another example of applied data science is classification, where the Greek philosopher, Aristotle (384—322 B.C.E.) was the first to classify areas of human knowledge into the distinct disciplines of mathematics, biology, and ethics.
Coming forward to the 20th century, in the early 1900s, the Statistician, Gosset, working with a brewery, applied his statistical knowledge – both in the brewery and on the farm – to the selection of the best yielding varieties of barley. The statistical modeling largely stayed with the mathematicians and statisticians into the 21st century, until very recently, where now the software tools like Excel, SAS and programming languages like R and Python enable the practitioners from myriad fields, to apply the statistical algorithms using the readily available methods and libraries.
The Evolution of the Data Landscape & the Technologies that Rule
The Birth of SAS
The digital era of data science, was launched in the mid-1900s, when North Carolina State University’s (NCSU) agricultural department, began the project, which became the foundation of the Statistical Analysis System (SAS). In the early 1970s, the software was primarily leased to other agricultural departments in order to analyze the effect soil, weather and seed varieties had on crop yields. SAS software became very popular with the Banking, Finance, Insurance, Pharma industries and has continues its dominance in the new age of data. It has evolved to support the requirements of text analytics, natural language processing, high power analytics and visualizations. This continues to rule the statistical modelling of large businesses.
The R Programming Language by Statisticians
The most popular programming language for data analysts was not born from the software research labs, but rather from the need of Statistic Professors, to have a technology that suited for their statistics students. Ross Ihaka and Robert Gentleman, gave lectures on statistics in the University of Auckland. At first, developing the language was a mere hobby for the professors, considering that none of them had any deep computer science training. But starting in 1991, both Ihaka and Gentleman started working full time in developing their new software. This came to became one of the most widely used programming languages by statisticians and data analyst professionals due its open source contributed rich set of libraries, and continues to dominate the data science industry.
Microsoft Excel became the Jack of all Trades, Literally
In the last 3 decades, this tool has been used in pretty much at all places – homes, personal, small and medium companies, large organizations and industries, and in all functions and roles. O’Rielly conducted a survey in Europe in 2016, and published its report in 2017. The survey asked for the tools used by the Data Analysts in their roles, and 70% of the respondents listed Excel. This is a pretty large endorsement for Excel based analysts in 2017, and that building advanced skills in Excel, and combining it with the Data Analysis add-in pack, can position professionals to do their jobs significantly better, and make informed decisions based on data. The functions of HR, recruiter, marketers, sales, supply-chain specialist, procurement, operations, delivery, to name a sub-set who handle data and can very well leverage the power of data using a tool that makes analytics accessible to them.
Python Pandas and Scikit-Learn
As the Data Analytics shifted into the mainstream businesses, the field started to be infiltrated by the computer science academia and the working professionals. The developers at a financial management firm, AQR, started to develop a tool to perform analysis on the financial data, which was then open-sourced giving a strong foundation to the data analytics libraries for one of the most popular and general purpose programming languages. Python got consolidated for the machine learning libraries, with the contribution of Scikit-Learn, that was initially developed by David Cournapeau as a Google summer project. The unique capability that Python brings of being a general purpose widely uses programming language, lending itself to a powerful data analytics and machine learning capability, means that the software professionals turned data scientists have taken towards a large adoption in the area of data science.
The Foundation of Big Data Technologies
The growth of data in the famous 5 Vs – Volume (driven by the interconnected world), Velocity (generation per second), Variety (from traditional structured transactional data to data in forms of text, pictures, visuals, videos), Veracity (truthfulness of the data), leading to the most important outcome drive V – Value. The Hadoop based framework has increasingly becoming the choice of the organization needing to deal with 2 or more of the Vs with an array of tools enabling data engineering of the Big Data.
Apache Spark is becoming the prominent cluster-computing processing framework for its high performing streaming and machine learning library. The Spark Core exposes its distributed task dispatching, scheduling, and basic I/O functionalities, through an application programming interface supported for a large array of languages, Java, Python, Scala, and R.
In summary, it is rather an ambiguous question to choose the technology to build our data scientist careers, without a context. The decision needs to be based on the background, industry domain and prospects. The technologies will continue to evolve, as they must to meet the demands that the growth and complexity of data that humans are generating. The important point to note is that all listed technologies are being widely used in the industry, and have a huge potential.
One extremely simplified way to help think is, Python for programmers, R for Statisticians & Mathematicians and other software folks, who want to pursue a programming based tool to perform data analysis. SAS for folks who are in the domain of pharma, healthcare, insurance, finance. All functional roles under the sun that deal with data, learning advanced techniques of excel is a clear virtue.
Remember, simplification is not always accurate, but helpful.
The Evolution of the Analysts Role in Data Science
The evolution of the data in businesses has meant that there have been a variety of roles; functional and technology-based; that organizations need in order to harness the data for their businesses.
The organizations needed domain experts who can understand their business, involved in identifying problems, needs, and opportunities for improvement at all levels of an organization. This led to the positions of Business Analyst whose role is business and function-centric.
With data becoming the new natural resource that needs to be mined, the evolution of the analyst roles created a new floodgates to a new set of skills called the data analysts. The key difference to note is that this role data-driven.
The journey to becoming a Data Scientist – This is where the marriage of all the above roles becomes a possibility. In 2010, Drew Conway, a social- behavioral problem solver, curated a venn diagram that has become one of the most used visual to demonstrate the competencies that different roles bring together to make a competent Data Scientist.
This diagram was an outcome from O’Reilly hosted discussion on skills needed for data career, and many years later, with the data roles becoming a lot more common across different organizations, the competencies can provide an excellent reference skill-set. Let me summarize them:
a.) The data hacker – is the person who is skilled at gathering, collecting, cleaning data, applying and building algorithms.
b.) The statistician – even through the data-centric roles are no more confined to the fields of statisticians, the data analyst is expected to acquire a reasonable to strong knowledge of the statistical models, and application of it, be able to train the machine learning models and improve them.
c.) The domain expert – the skill that comes from deep insight of the industry, such that the domain data can be curated and then the leading questions should help build the questions that enable to discover next actions in the business.
Data is the new ‘Gold or Oil’, and it is imperative that all the different business functions are able to access and discover data, understand and interpret it, apply statistical modeling and fine tune the method to gain the right level of insight and enable a truth based decision making. The industry is ready, and with acquiring the right level of skills, in statistics and armed with the right programming tool or advanced spreadsheet and visualization, one can mine the gold and make a fine outcome of it, impacting their domain and industry.