Shweta Gupta, VP – Technology at Digital Vidya – As a Data Scientist, you continuously need to be in the learning mode. Upskilling, relearning and being on top of trends and analysis in the industry is very important, or may I say vital to being relevant. Its hard to read a lot of blogs and reports every single day, but we have a unique menu where we interview top notch Big Data and Analytics influencers who talk their heart about their personal experience and opinions about the space. It’s a great way to be informed and updated and I would recommend reading interviews for all those who want to catch up on the latest in the space in a short amount of time. This week we had the privilege of speaking to Ambuj Kathuria from Birlasoft. Let’s look at what he has to say.
In current capacity, Ambuj Kathuria is working as a Global Head – Data & Analytics service line for Birlasoft who is leading service provider in Consulting & IT services. His job profile includes accountable for managing the three main pillars of a practice: people management, service offerings management/innovation lab and sales enablement. Responsible for P & L.
Birlasoft partners with their clients in solving the Business Problems proactively to offer right ROI and manage Data to Decision journey end to end, through Synergy of (Business + Technology + Math). Additionaly, they strengthen their presence in new technology areas by delivering business value through strategic solutions focused around Artificial Intelligence, Data Science, Machine Learning and Optimizing Data pipeline on cloud platforms.
Over 18 years of total experience and all years were involved in multiple large & complex transformation initiatives. Ambuj’s career started as BA and then transformed into SAP Functional consultant which helped me in understanding business domain, better. Travelled almost the entire globe as part of his 18 years of journey.
Carrying vast global experience in multiple tool & technologies from enterprise world ( SAP, Oracle, IBM, MS, SAS) to new edge technologies both licensed and open source DPE, Cloud Data pipeline, ML, Big Data environment, Cognitive Analytics.
How did you get into Data Analytics? What interested you in learning Data Analytics?
Ambuj Kathuria: All my 18 years career I was always playing around on data whether in form of Master data, transaction data initially as part of SAP career and later when I transformed into BI/ EPM world and then later into Big data & AI. And my hunger and learning are still on.
Was there a specific “aha” moment when you realized the power of data?
Ambuj Kathuria: With my experince – data is a fuel for driving organization to a next orbit. 90% of business critical data is outside Organised Data systems but only 10% of it goes in to Decision Making! That’s where power of data comes. There are multiple instances when we present the hidden insight out of unstructured data, CXOs got wowed. It clears out many unseen, unnoticed problem and solutions. Business needs data backed answers for smarter decisions.
What is your typical day-in-a-life in your current job? Where do you spend most of your time?
Ambuj Kathuria: I worked in typical IT Service company where we need to support our customers on 24*7 but I try to spend 30% of my daily time in Innovation lab with my architect / CoE team and I always ensure at least 1 hour self – study every day before going to bed. Sometimes I use smart office app which convert the whole doc into speech and I listen during driving to update myself.
How do you stay updated on the latest trends in Data Analytics? Which are the Data Analytics resources (i.e. blogs/websites/apps) you visit regularly?
Ambuj Kathuria: I am a big fan of KdNuggets, Safari, Kaggle, Techworld. I follow Gartner, Forrester, and IDC to keep myself always updated on new trends and how trends impacting on IT landscape and decision making.
Team, Skills and Tools
Which are your favourite Data Analytics Tools that you use to perform in your job, and what are the other tools used widely in your team?
Ambuj Kathuria: I myself mostly use Excel and Power BI for Data analysis. Power BI is very intuitive with Natural Language Processing Features and is an advanced extension of excel. However, my team is heavily practicing Machine Learning using R, Python, Azure ML, Spark ML and multiple Data Visualization tools like Tableau, Power BI, Qlik, Spotfire.
I am also a big fan of Alteryx, it saves lot of time in data preparation & wrangling. Wrangling data is the most time-consuming and inefficient part of any data project – taking up over 80% of the time and resources. The successful analysis relies upon accurate, well-structured data that has been formatted for the specific needs of the task at hand.
Data preparation tool combined with the Analytics workbench e.g. Cloudera workbench / AWS/ Azure/ Anaconda increases Data Scientists productivity.
There is a high need for self-service data preparation tool to enhance their productivity. Currently, Data scientists, Analyst on average spend 70-80% of the time in data cleaning, mapping, and preparation. Data wrangling is cleaning data, connecting tools and getting data into a usable format for further analyzing it. Original data sources can be messy, in different formats, from multiple applications (and so on), which makes running data/predictive analysis on it difficult and sometimes impossible.
What are the different roles and skills within your data team?
Ambuj Kathuria: We have full functional innovation lab which covered e2e value chain skills from data to action comprising of roles like data architect, data platform engineer, Data Scientists, ML experts, Big Data Architects and BI/Visualization Experts.
Help describe some examples of the kind of problems your team is solving in this year?
Ambuj Kathuria: This year focus will be to provide valuable insight using Data as service and fill a gap on DevOps and production deployment. Below are the few areas where me and my team focusing in this year. Focus will be more on building our vertical offering more robust and try to see how new technologies (Anaconda, H2O, Data robot, Deep learning, reinforcement learning, Tensor flow, API consumption).
- Deployment of ML model
- DevOps for Data Science
- Automate ML
- Deep Learning / Reinforcement Learning
- Serving data through micro services
- Semantic layer and knowledge graph
- Credit Risk Analytics using Machine Learning for accurate PD LGD calculation and significantly reducing the time to derive insights for defaulters with Interpretability of results as well
- Augmenting Insurance underwriting using ML, The derived insights can be used to augment the traditional underwriting process for faster processing – down to almost 5-10 days from current 40-50 days cycle.
- Real-time Asset performance optimization – Descriptive analytics for diagnosis & forecasting of part failure and improve Efficiency, Revenue and Process Optimization by Predictive Capabilities.
- Advance Unstructured text mining using Rule & NER based entity extraction and Ontology based search on business terms based
- How ML can help in optimizing and improving Supply chain – building of SCM use case library to cover full supply chain cycle. Back order, Demand
- Text Analytics – Applying Machine Learning and Natural Language Processing on Unstructured Data to derive actionable insights by Name Entity Extraction & Ontology creation mechanism.
How do you measure the performance of your team?
Ambuj Kathuria: We follow 360-degree feedback mechanism to fill in the gaps in terms of learning and setting up new bars. However, we do also have comprehensive appraisal cycle where individuals are judged on various KRA.
Big Data Team, Skills and Tools
In the huge Big Data landscape, the skills are swiftly changing. Which is the technology do you see dominating in the ETL data space and real time?
Ambuj Kathuria: There are big shift from ETL to ELT and now with Semantic Layer introduction it is further disrupts business inhibitors. In second generation, High value Data Lakes must tie information together in the language of the business. There is very big shift from first generation HDFS based Data Lake to second generation Data Lake which works on knowledge graph, automate ETL & ingestion. Semantic layer connects data with business meaning using data catalog and GraphMarts & data layer consume knowledge graphs into in memory for layer preparation and BI & analytics.
When it comes to comparing Apache Hive and Apache Spark, I believe both are the Leaders in the Market due to their ability to handle humongous Data Sets. It is not wise to compare Hive and Spark straight forward as a lot of parameters drive the usage of both of these tools such as speed, advanced analytical capabilities, fault tolerance, Data volume etc. in different Business requirement scenarios. For example Spark is leading the race when it comes to Real time processing with lighting results, hence Business which needs lighting fast results they would prefer Spark ETL over other tools. This is because of in memory capabilities of Spark.
How do aspiring Data Engineering demonstrate their capabilities of handling the tool, technology, data and domain? Is Certificate (Cloudera/Hortonworks) a clear differentiator?
Ambuj Kathuria: Certification is just a learning stamp in my opinion. What matter is how this learning is being applied in solving the real world problems be it any technology, tool or data. The horizontal for these is the domain which is very much important to understand the data, tool & technology relevance in Business scenario better.
Is Analytical skills, Statistics, Machine Learning must have or good to have skills for Data Engineers?
Ambuj Kathuria: Yes, at least at a beginner level so that Data Engineers would be able to connect the dots better as in understanding Why, What and Which data is relevant in what Business Scenario, what problems it is going to solve.
Industry Readiness for Data Science
Are the industries looking to understand what they can do with data? Do they have the required data in place?
Ambuj Kathuria: With my experience most of the organizations are struggling to fetch out the right ROI of their Big Data Platform investments. This is because of lack of Business Case formulation before the start of Big Data projects. In most of the cases Big Data investments are not even relevant, they just do it because it is a trend in the Market. A lot of Business problems could be solved through engaging the right vendors in Analytics as a Service engagement where Organization can focus just on solving their Business Problems rather than worrying about the data. Right Data Strategy & Governance aligned with Business Objectives is the need of the hour. That’s why many Organizations are vouching for Chief Data Officers to achieve Business relevance in all the data investments and to achieve the right ROI.
Which are the top 3 problems that are on top of the Data Science, either based on industries, or based on technology area?
- Data preparation skills: The Biggest challenge for the Data Scientist is the right Data with best of Quality. Data is scattered across the Organization with no single view. Moreover a lot of hidden insights lies in the unstructured data or the macro economic data which when combined with the Enterprise data can change results dramatically.
- Visualization of Data: Data Scientists are well versed in applying the complex algorithms when it comes to predicting insights. But when it comes to presenting this information in prescriptive and intuitive way, is where the challenge lies. This is because of lack of skillsets, or may be Organizations using visualizations tools which may not be compatible to Machine Learning results etc.
- Soft Skills: A lot of Data scientists are extremely good in Machine Learning, Statistics but they lack the story telling part or how to communicate their findings in a very lucid or simple manner which Business can understand easily and take decisions quickly.
Industry Readiness for Big Data
Is Big Data becoming a reality in the industry beyond the social giants like Facebook, Google, Yahoo? If yes, which industries are actually moving towards the power of Big Data Analytics? If no, what is the outlook for adoption?
Ambuj Kathuria: As I mentioned above many Organizations are adopting Big Data platforms just because this is the new norm or trend. However most of them are not able to clearly define the right Business Case or quantify the right ROI of their investments.
Industries which are more consumer centric or we can call it as Industries which have B2C or B2B2C model rather than B2B would be the main beneficiary of their Big Data Investments when it comes to mining social data. However in B2B Industry scenario also Big Data investments are quite fruitful. For example Manufacturing companies are generating huge of amount of data through sensors installed in machines. This data when mined in to insights can predict Machine failures much before it does and can save millions. Hence, right Business case coupled with right Data & Analytics Strategy is the need of an hour.
Name 3 Industries and the kind of problems that they are solving using Big Data.
- Ecommerce: Identifying the purchase pattern of the Customer over a period of time to recommend relevant products
- Manufacturing: Predictive Maintenance of Machines, Improving & optimizing Supply chain using integrating data silos marrying with external factors.
- Banking – Data is a fuel in the Digital Transformation of Banks. Cognitive Artificial Intelligence spending will rise ten-fold from $984 million in 2015 to $9.3 billion in 2020.
- Insurance: Auto Insurance Underwriting based on Telematics & Machine Learning
Who in the Industry is your typical client for Big Data? Is it the CTO, CIO, CMO or special data leaders?
Ambuj Kathuria: It depends on what problem we are going to solve for the customer-
- Data Science – Mostly CMO, CPO or COO
- Data Engineering – Mostly CIO or CTO
- Data Strategy, Data Governance, Master Data Management – Mostly Chief Data Officer
Advice to Aspiring Data Scientists
According to you, what are the top skills, both technical and soft-skills that are needed for Data Analysts and Data Scientists?
Ambuj Kathuria: Data analyst and Data scientist both are two different skills set. Data Analyst should have statistics/ ML knowledge but more importantly he/ she should be good in domain / data understanding. However Data scientist should be pure mathematician who should work as outsider and see the data pattern, correlation and provide recommendation based on it rather bias with domain knowledge.
Technical Skills: Excel, R, Python, Azure ML, Visualization tools, Statistics, any analytics workbench
Soft Skills: Domain, Business & Communication Skills to connect the finding out of Data Analysis with Business Context.
How much focus should aspiring data practitioners do in working with messy, noisy data? What are the other areas that they must build their expertise in?
Ambuj Kathuria: As per my experience around 60-70% of the time of Data Analysts/Scientists goes in to Data Preparation due to Noisy or messy data. There are lot of upcoming and already available tools in the market for Data Preparation like Alteryx, Talend etc. which shall be leveraged so that more focus should be on deriving insights by applying the best of Data Science algorithm.
What is your advice for newbies, Data Science students or practitioners who are looking at building a career in Data Analytics industry?
- Programming and software skills – R, Python, SAS
- Visualization Tools – Power BI, Qliksense, Tableau, Spotfire,
- Statistical foundation and applied knowledge – Regression, Correlation, Clustering, Probability, Normal Distribution etc.
- Machine Learning – Linear Regression, Neural Networks, Random Forest, XG Boost, SVM etc.
- Data Preparation Tools – Alteryx, Talend
What are the changing trends that you foresee in the field of Data Science and what do you recommend the current crop of data analysts do to keep pace?
Ambuj Kathuria: I would recommend to keep a tab on Latest Trends like Artificial Intelligence, Deep Learning Neural Networks, and New Visualization tools. On top of it constantly getting updated in terms of changing Business trends in at least one domain in depth.
Big Data Solution Space
What is the kind of structured and un-structured data companies have? What is the size that we are talking about?
- Structured Data: Data in Excel, ERP Systems or any other Data Platform across the Enterprise
- Unstructured Data: Emails, Call Logs, Social Data, Macro economic Data etc.
Are there legacy systems that are being replaced? If yes, which legacy skills are being replaced?
Yes, still DWH offloading, legacy DB offloading are most talking use case for Big data. Some enterprise customer who have done this, they are talking of data monetization, data as service, data grid, IPaaS etc.
What is the size of clusters/environments that are being deployed for the clients? What are the production challenges?
Our Customers run clusters of thousands of machines and manages Terabytes to Petabytes of Data.
Some Production Challenges as per our experience:
- Scalability: Big Data Platforms deployed on premise face issues when there are spikes in Data
- Network: Due to Network Issues on public cloud deployed Big Data Platforms, 24X7 availability is interrupted
- Conformity in Performance all the time: Performance may not be same both in Public and on Premise deployments due to network, bandwidth and other issues.
Would you like to share few words about the work we are doing at Digital Vidya in developing Data Analytics Talent for the industry?
I personally feel Digital Vidya is doing a great job when it comes to bringing people from diverse set of experiences into one platform to share ideas, learning from each other and of course creating the best of Data Science skill pipe line.
To know more about Ambuj Kathuria, you can check out his LinkedIn profile.
Are you inspired by the opportunity of Data Analytics? Start your journey by attending our upcoming orientation session on Data Analytics for Career & Business Growth. It’s online and Free :).