Shweta Gupta – Vice President Technology at Digital Vidya
It is bewildering to see how Data Science, ML, Deep Learning and AI are giving a twist to every process known in industries such as E-Commerce, BFSI, Manufacturing, Healthcare etc. At Digital Vidya, we are eager to know the in’s and out’s of this new phenomenon and bring relevant knowledge to our readers. We have been constantly in pursuit to fill the knowledge void which exists in India and equip emerging and aspiring Data Scientists with the necessary information that would assist them to be on top of the trends going on in this fast-changing domain.
We recently interviewed Lovekesh Vig, a Senior Scientist at TCS research who shared insights that were mind-boggling. There were great key points that I have taken a special note of personally and I hope you too discover the much-desired mysteries of Data Science from this interview.
Lovekesh Vig was first exposed to Machine Learning by his advisor Douglas Fischer during his Ph.D. at Vanderbilt University, He was a very accomplished Machine Learning scientist and he taught him how to think from the models point of view. Another faculty who had a big impact on Lokesh was David Noelle who specialized in computational neuroscience and artificial neural networks. He was fascinated with the underlying mechanics of the brain that give rise to all of our thoughts, dreams and desires, and our attempts to replicate those mechanics. After my Ph.D., he served as a faculty at the School of Computational and Integrative Science, JNU where he worked on problems in computational neuroscience and Machine Learning.
Deep Learning was an emerging field at the time and his interest in neuroscience had natural synergies with the advances being made in artificial neural networks. About the same time, Lovekesh started consulting as a scientist with TCS Research and initiated some Deep Learning research projects. The quality of people and the commitment to research at TCS provided for a wonderful environment to explore new ideas and after a year he decided to join them full time as a Senior Scientist. Since then, it has been a wonderful journey of discovery and Lovekesh is thoroughly enjoying the role as Head of the Deep Learning/AI research area at TCS Research. Without further delay, lets get started and know what Lovekesh has to say.
What interested you in learning Data Analytics?
Lovekesh Vig: I was interested in Machine Learning, and one of the big lessons in Machine Learning is that before one starts thinking about a predictive model, it is important to utilize analytics to ‘look’ at the data, as this almost always helps to devise more appropriate Machine Learning models. The different visualization techniques to view the same data from different angles, and the techniques for aggregating and summarizing different data types (time series, spatio-temporal) was especially interesting.
What was the first dataset you remember working with? What did you do with it?
Lovekesh Vig: If I recall it was the Iris dataset from the UCI repository, and the task was to build a classifier. At the time, there were very few off the shelf libraries so we had to develop even standard models like decision trees, neural networks etc. fom scratch which was a great learning experience and a lot of fun.
Was there a specific “aha” moment when you realized the power of data?
Lovekesh Vig: As a student when I finally understood how giants like Google and Amazon were leveraging data for their search and recommendation applications, it struck me for the first time how elegant and powerful their while system was. Data was actually changing the world in front of me, from Maps to advertising to retail, and was having an impact on every aspect of life, generally for the better.
What is your typical day-in-a-life in your current job? Where do you spend most of your time?
Lovekesh Vig: My typical day is spent in the lab working on different research projects with my team, reading up on the latest developments in the field and identifying new research ideas of interest to TCS.
How do you stay updated on the latest trends in Data Analytics? Which are the Data Analytics resources (i.e. blogs/websites/apps) you visit regularly?
Lovekesh Vig: I regularly peruse KD Nuggets and have found it be a rich source of information on practical Data Science application. I follow the FAIR and DeepMind Blogs for latest developments in Deep Learning. Christopher Olah’s and Andrej Kapathy’s blogs are also very illuminating. The OprnAI blog is another excellent resource.
Share the names of 3 people that you follow in the field of Data Science.
Lovekesh Vig: Geoffrey Hinton, Juergen Schmiduber and Andrew Ng
Team, Skills and Tools
Which are your favourite Data Analytics Tools that you use to perform in your job, and what are the other tools used widely in your team?
Lovekesh Vig: We use Google Tensorflow, Theano, Pytorch, Keras, and a standard open source Machine Learning libraries (generally based on numpy, scipy) to code our models.
What are the different roles and skills within your data team?
Lovekesh Vig: We have a Deep Vision that works on problems related to image processing, people here specialize in building complex deep models for enterprise relevant vision tasks. We also have a team dedicated to working on sensor analytics for large scale time series sensor data, and another team dedicated to applications in Natural Language Understanding. The roles all revolve around research applications in artificial intelligence and Deep Learning, from junior researchers to senior scientists.
Help describe some examples of the kind of problems your team is solving in this year?
Lovekesh Vig: Machine Learning in the enterprise world presents a different set of challenges; data is noisy or missing, predictions must be interpretable, data is often highly sensitive with restricted access, incorporation of domain knowledge is often necessary and a failsafe mechanism has to be in place to ensure safety/legal concerns are met. My team helps devise solutions for these problems across a variety of domains ranging from healthcare where we are designing systems for automatic karyotyping of chromosomes for medical diagnostics, to product recognition in large retail outlets, to robotic control in factory settings, to generation of natural dialogue for chatbots.
How do you measure the performance of your team?
Lovekesh Vig: We have a goal setting mechanism whereby people are aware of what is expected of them in the coming year, and performance is generally benchmarked against this. I allow for some degree of flexibility depending on how risky/unpredictable the research project is, but generally, research output in terms of patents and publications in top venues are important parameters for performance evaluation.
Industry Readiness for Data Science
Are the industries looking to understand what they can do with data? Do they have the required data in place?
Lovekesh Vig: The initial wave of Deep Learning was driven by giants like Google and Facebook, and it was no surprise that the problems generally revolved around end consumer products, and that the impact did not immediately percolate to enterprise applications. This is changing now and the second wave of Deep Learning application is going to hit enterprise applications in every sector from manufacturing to healthcare. As far as being data ready is concerned, there is quite a bit of variation here, both within and across sectors.
For instance, in healthcare there is now a substantial amount of data for medical imaging bassed diagnostic models to be trained, however, the challenge of correlating diagnosis with patient medical records, incorporating domain knowledge and specific genetic factors is still quite steep. Some sectors like manufacturing have doen a better job of delineating the industry 4.0 standards for moving towards “Smart Factories” by augmenting machines with sensors for health monitoring and fault detection. Other sectors are still unsure aout where data can take them and how to get there, but there is definitely a renewed desire to utilize data in all its forms which inclues images, unstructured text, sensor data and relational data. This is resulting in all sorts of novel innovations that will change the way enterprises function.
Which are the top 3 problems that are on top of the Data Science, either based on industries, or based on technology area
Technology Area: Deep Learning
- Making interpretable predictions from Deep Models
- Incorporating domain knowledge into Deep Models
- Training deep models with limited data
Industry Readiness for Big Data
Is Big Data becoming a reality in the industry beyond the social giants like Facebook, Google, Yahoo? If yes, which industries are actually moving towards the power of Big Data Analytics? If no, what is the outlook for adoption?
Lovekesh Vig: The impact of big data on organizations in Retail, Manufacturing and Energy has been profound, and other sectors like Healthcare and Pharma are now catching up and redesigning their processes for better data capture, integration and prediction.
Name 3 Industries and the kind of problems that they are solving using Big Data.
Lovekesh Vig: Manufacturing and industries involving heavy machinery are certainly moving towards Big Data Analytics for machine health monitoring, fault detection and prognostics. Retail companies have profited enormously by leveraging big data to make recommendations to their customers. Financial firms have successfully used big data to predict stock prices and to optimize portfolios.
Who in the Industry is your typical client for Big Data? Is it the CTO, CIO, CMO or special data leaders?
Lovekesh Vig: Our typical client is the CIO.
Advice to Aspiring Data Scientists
According to you, what are the top skills, both technical and soft-skills that are needed for Data Analysts and Data Scientists?
Lovekesh Vig: Technically, the things we look for are a deeper understanding of the models being used. Because of the proliferation of Machine Learning APIs, it is easy to use a model without fully understanding how it works, get a prediction and be done. A good Data Scientist will know the Math behind the model, understand the impact of the different parameters and thus be able to discern which models work best for which types of data. About 80% of the time for a Data Scientist is spent cleaning and encoding the data in a suitable form for a Machine Learning Algorithm to consume. A Data Scientist should, therefore, be equipped with the techniques to deal with missing, noisy data efficiently and be aware of the impact the data encoding has on the final results.
Soft skills include the ability to understand the business use case, to plan a project’s execution which involves estimation of the compute and manual resources and the available skill set. To be able to communicate ideas effectively and clearly articulate the strengths/limitations and what can and cannot be accomplished with the data. In general, the data scientist must be invested in the success of the project.
How much focus should aspiring data practitioners do in working with messy, noisy data? What are the other areas that they must build their expertise in?
Lovekesh Vig: A lot, as I mentioned cleaning and encoding the data is critical towards building useful applications and a majority of the Data Scientists time is spent on that. Other areas include predictive modeling, optimization of compute and storage infrastructure and data visualization.
What is your advice for newbies, Data Science students or practitioners who are looking at building a career in Data Analytics industry?
Programming and software skills – R, Python, SAS or Excel
A good programmer should be able to move across languages and tools, you can have a favorite language to work in (I would recommend python), but that should not restrict you from moving across languages/tools if the situation demands).
Pick your favorite visualization tool and become a power user of the different visualization capabilities, and learn which visualizations are useful for different types of data. Visualization is essential before doing any sort of predictive modeling.
Statistical foundation and applied knowledge
These are aspects that are often ignored by yolung data scientists in favor of more flashy technologies, but these are foundational to the understanding of Machine learning and can be a great differentiator to a career down the road.
Doing a course in machine learning is not sufficient anymore, machine learning is an applied field. It is essential to back up your theoretical knowledge with real experience, possibly on data challenges posted on sites like Kaggle.
What are the changing trends that you foresee in the field of Data Science and what do you recommend the current crop of data analysts do to keep pace?
Lovekesh Vig: With Deep Learning proliferation and most Deep Learning models running on GPU clusters, programming in CUDA or openGL for running computations on GPUs has become a highly valued skill and is likely to be in demand in the foreseeable future. Also, deep models will become so common soon that every data scientist will have to be familiar with at least the basic types of deep models for applications involving image and sequential data. As AI penetrates the enterprise world, data scientists will have to be able to adapt to new emerging application areas like smart conversational apps, intelligent device applications and augmented reality. The days of being able to rely on a fixed set of technology skills are fading fast, and the future belongs to people who are capable of quickly adapting to a new domain, solving relevant problems and moving on.
Big Data Solution Space
What is the kind of structured and un-structured data companies have? What is the size that we are talking about?
Lovekesh Vig: Unstructured data can be in the form of financial reports, company blogs, emails, images etc., whereas structured data generally refers to the historical transactional or operational data stored in normalized data formats. The size varies from sector to sector, sensor data usually runs into hundreds of Terabytes a year for any large factory. Healthcare datasets are also getting larger as the imaging techniques are yielding better and better resolution images, historical reports are getting more and more detailed, and genome sequencing is getting cheaper. In general, we are in the Terabyte realm for most real-world applications.
Would you like to share few words about the work we are doing at Digital Vidya in developing Data Analytics Talent for the industry?
I think Digital Vidya has a very important role in getting the word out from the industry’s prespective. Too often we find a huge disconnect between the skills in demand in the industry, and the skills that academic institutions and youngsters tend to focus on. I imagine a lot of that is due to a gap in communication between industry back to academia and young people starting out thier careers. Initiatives like Digital Vidya will hopefully help to plug that gap.
To know more about Lovekesh Vig, you can check out his LinkedIn profile.
Are you inspired by the opportunity of Data Analytics? Start your journey by attending our upcoming orientation session on Data Analytics for Career & Business Growth. It’s online and Free :).