Interview with Raghav Bali, Senior Data Scientist, United Health Group

10 Min Read. |

Raghav is an avid technology fan and his interest in computers goes long back. The first time he got the opportunity to code something was in class 5th. His career started as a Software Engineer with Infosys. He was lucky enough to start his career on an ERP platform. This gave him the opportunity to understand software architecture and different design patterns, the role of database design and challenges associated with big software projects. This opportunity encouraged him to get into the details and study more on software design and architecture.

He landed up at IIIT-Bangalore for his Masters with an aim to learn software design at scale. In between the course, he took up a newly introduced course on Machine Learning. This particular subject was what changed the course of his career and his foray into Data Science and Machine Learning. He got to understand the theoretical concepts, algorithms and their applications.

Post his Masters, he moved on to work at American Express where he got to work with Financial data along with Digital Assets to improve adoption. This experience helped him understand a completely new domain and how data is leveraged in the real-world to make improvements to things that are used widely.

Moving to Intel provided him with the opportunity to work with experienced researchers and experts to develop enterprise-level data-driven solutions that leveraged Machine Learning models to solve problems. The depth of understanding and the immense passion to learn more is what amazed him while working with such experts. During his tenure at Intel, he had the privilege to publish a conference paper and author 4 books on Machine Learning and Deep Learning covering a wide variety of algorithms, use-cases, and real-world challenges. The books have been immensely successful and well-received.

In his current role at Optum (United HealthGroup), as a Senior Data Scientist, he is making an impact in the world of healthcare. His work involves research & development of enterprise-level solutions based on Machine Learning, Deep Learning and Natural Language Processing for Healthcare & Insurance related use cases.

Download Detailed Curriculum and Get Complimentary access to Orientation Session

Date: 08th Aug, 2020 (Saturday)
Time: 10:30 AM - 11:30 AM (IST/GMT +5:30)
  • This field is for validation purposes and should be left unchanged.

What was the first data set you remember working with? What did you do with it?

Raghav Bali: The first real problem statement that I picked up for a Machine Learning use case was part of my project work at IIIT-B. The problem statement was to Identify Indian Languages from a short utterance. Now, today it might seem like a trivial problem with all the Deep Learning frameworks and tonnes of public datasets at our disposal.

But back in 2013, we had no access to a dataset that would help us solve this problem. This led us to painstakingly prepare our own dataset. This exercise not only helped us complete our project and build an impressive classifier but also helped us in understanding the challenges and importance of preparing a dataset. We carefully prepared a strategy/charter on how to collect the data, identified a diverse set of speakers and made sure we kept the privacy of our donors. We collected a decent sized dataset for 4 Indian Languages which was quite comprehensive in multiple aspects.

As I mentioned, we utilized this dataset to finally prepare an impressive ensemble of classifiers to detect Indian languages. The dataset helped us in preparing a classifier that was generic enough to work on new and noisy samples at runtime.

Was there a specific “aha” moment when you realized the power of data?

Raghav Bali: I would say I have been lucky so far to have had the chance to work across so many different domains. Each of my past and current roles have given me the highs or the “aha” moments as you say. Amongst all such moments, this one particular is important in my journey so far and to my thought process. This was during a project at Intel where we were supposed to develop a model to predict specific hardware failures.

Being diligent and thorough in our approach to solve this problem, we still overlooked one of the simple algorithms in favor of a complex and trending one (Deep Learning, anyone?). When we presented our well-performing solution to one of the experts for review, we were asked to go back and try out a simpler solution.

At first, we got disheartened on not being appreciated. But once we went back to the drawing board and tried the simpler set of algorithms, to our surprise, this solution worked elegantly and blazingly faster than the complex one we had built earlier.

This was a perfect example where the data itself led the way to determine which model would work best rather than what the trending ones are. Since then keeping it simple has been my mantra to solve problems.

How do you stay updated on the latest trends in Data Science? Which are the Data Science resources (i.e. blogs/websites/apps) you visit regularly?

Raghav Bali: Well that’s a very good question and often asked. It is also very important to follow the right channels due to the sheer speed of improvements and new things coming up in the world of Data Science and Machine Learning.

I primarily use Twitter and Reddit to be on top of the game. Handles/people like @ilyasut, @aureliengeron, @distillpub, @seb_ruder ,etc. on Twitter keep sharing the latest stuff from their work as well as other researchers. Arxiv.org, Hacker news along with r/MachineLearning, TowardsDataScience@Medium are some of the other online channels that I keep a tab on.

Share the names of 3 people/publications/research that you follow in the field of Data Science or Big Data Analytics.

Raghav Bali: Machine Learning is a very diverse field with quality content all over the place (though there is a lot of noise too). Research work from the labs and associates of the Trinity (Hinton, Bengio, and Lecunn) is a must to follow along with work from the labs of Jurgen Schmidhuber.

Team, Skills, and Tools

Which are your favorite Data AnalyticsScience Tools that you use to perform in your job, and what are the other tools used widely in your team?

Raghav Bali: It’s an every evolving ecosystem mostly centered around python (though I work on R and Java as well). The following are the most widely used:
(i) Jupyter, Pandas, PySpark, Matplotlib, Seaborn, Dask for data preparation and exploratory analysis
(ii) Sklearn, TensorFlow, Keras, MLextend, spacy, genism, etc for modeling, experimentation and so on
(iii) Flask, Gunicorn, Django, Tensorflow Serve, etc for deployment

Download Detailed Curriculum and Get Complimentary access to Orientation Session

Date: 08th Aug, 2020 (Saturday)
Time: 10:30 AM - 11:30 AM (IST/GMT +5:30)
  • This field is for validation purposes and should be left unchanged.

What are the different roles and skills within your data team?

Raghav Bali: Our team is a good mix of people from different backgrounds ranging from software engineering, mathematics, economics, statistics, etc. The roles and skillsets are focused on solving problems at scale. Primarily we have Data Scientist Managers who identify opportunities within the company where Machine Learning/Data Science can help.

There are experienced folks at Lead and Senior Data Scientist levels who work on a number of projects depending upon the area of expertise (NLP, Deep Learning, TimeSeries to name a few) with the help of Data Scientists and Machine Learning engineers to solve problems.

How do you measure the performance of your team?

Raghav Bali: Our team has a clear charter to assist the organization in achieving its goals. We are measured in terms of the number of projects in production, dollars saved/earned along with research contributions and innovations (in terms of publications and participation).

Industry Readiness for Data Science

Are the industries looking to understand what they can do with data? Do they have the required data in place?

Raghav Bali: I think the answer to this is large yes. Particularly for large organizations that have been in their respective domains for quite some time. Most large organizations have been collecting and storing huge volumes of data over the years. It is only recently that they have started leveraging the data to take them to the next level.

Startup or relatively newer organizations, though they do not have huge volumes of data, have the advantage of knowing the power of data. Hence most modern organizations are being centered around data. Our home-grown startups like Flipkart, Swiggy, Zomato, Paytm, etc. are prime examples.

Which are the top 3 challenges that are on top of the Data Science, either based on industries or based on technology area?

Raghav Bali: That’s a difficult one to answer. Each industry or technology area has its own unique set of problems. But on a very high level, most areas are facing problems related to (not limited to):

(i) Data Quality: Even when organizations have data, the quality is a big question mark
(ii) Awareness: Most organizations are just getting started on their journeys and hence the top leadership sometimes lacks the awareness and motivation to accept data-driven solutions
(iii) Managing Data Science Projects: Data Science projects are highly iterative. Typical software project management techniques do not work out of the box.

Advice to Aspiring Data Scientists

According to you, what are the top skills, both technical and soft-skills that are needed for Data Analysts and Data Scientists?

Raghav Bali: A Data Scientist/Analyst is seen more like an amalgamation of multiple skills due to the nature of the role. A typical Data Scientist needs to be good at technical skills like basics/fundamentals of algorithms being used, software engineering skills to implement efficiently and soft-skills like storytelling, visualizations to convey the idea across. Of course, a Data Scientist also needs to be self-critical and analyze every output/report/model he/she generates.

How much focus should aspiring data practitioners do in working with messy, noisy data? What are the other areas that they must build their expertise in?

Raghav Bali: 100% would be less to say. Real-world is full of noisy, unclean, unstructured content without the comfort of clean datasets which we see in academics or Data Science competitions. Data Scientists should prepare well to handle different datasets both in terms of size and quality.

Apart from handling messy data, Data Scientists need to be well versed with different algorithms and their assumptions and read through seminal papers to understand where and how certain problems can be tackled in certain ways.

What is your advice for newbies, Data Science students or practitioners who are looking at building a career in Data AnalyticsScience industry?

(a) Programming and software skills – R, Python, SAS or Excel
(b) Visualization Tools
(c) Statistical foundation and applied knowledge
(d) Machine Learning

Raghav Bali:

(a) Programming and software skills – R, Python, SAS or Excel

Python is the most widely used language for solving Data Science problems but R, Julia, SAS, and even MATLAB are used depending upon the use-case and organization. My suggestion would be to understand the fundamentals, be expert in one and be open to explore newer platforms if there is a need.

(b) Visualization Tools

Matplotlib, SeaBorn, Plotly are most widely used. The world of R relies on GGplot with wrappers and extensions available in python as well. For storyboarding and dashboards tools like Tableau, Kibana, etc are widely utilized. Visualization is an important part of any Data Science project’s lifecycle. Having the capability to express visually helps to get trust and acceptance from businesses.

(c) Statistical Foundation & Applied Knowledge

R has a huge repository of amazing packages that cover most areas in statistics. The list is long enough for me to share here, but rest assured, if there is a concept, you will find its implementation in R.

Python had a different story a few years back. The ecosystem has matured enough to become the go-to platform for not only Deep Learning and Machine Learning but also for Statistical research.

To get to the basics and foundational concepts, I would recommend having a look at the amazing book “The Elements of Statistical Learning” by Friedman.

(d) Machine Learning

Machine Learning is growing so fast that it is difficult to keep a track of the latest and the greatest. Yet, most research relies on the basics of calculus, linear algebra, and probability. Of course, there are a number of other important concepts but you need to clearly build a solid foundation. Most machine learning and deep learning concepts and algorithms utilize these basics to build amazing learning systems.

Machine Learning algorithms categorized as supervised, unsupervised, semi-supervised and reinforcement learning have their own concepts that build on top of the basics. I would suggest getting started with online courses on ML from Coursera, Udemy, and other MOOCs.

What are the changing trends that you foresee in the field of Data Science and what do you recommend the current crop of data analysts do to keep pace?

Raghav Bali: As I mentioned earlier, the field of Data Science is changing at a break-neck speed. The amount of research and improvements we are seeing nowadays is not just incremental but phenomenal. Deep Learning is also seeing a massive amount of research and adoption. The best part about this current phase of Data Science/Machine Learning is the adoption. Today we have not just the concepts but the compute and other components to make it a reality.

I personally feel that trends are pointing towards Data Science getting more and more democratized. This would not only bring in more avenues of application but also more bright minds to the field. Another important trend that I see is the potential of transfer learning and optimization of deep learning architectures.

Now that we have the capability to build massive deep learning architectures, the next obvious step would be towards optimizing things. This would also help improve upon concepts such as Auto-ML but AGI still seems a bit far off goal. My recommendation to the people in data science/ml would be to be on their toes, stay updated and keep studying/contributing back, the journey has just started.

Download Detailed Curriculum and Get Complimentary access to Orientation Session

Date: 08th Aug, 2020 (Saturday)
Time: 10:30 AM - 11:30 AM (IST/GMT +5:30)
  • This field is for validation purposes and should be left unchanged.

Are you inspired by the opportunity of Data Science? Start your journey by attending our upcoming orientation session on Data Science for Career & Business Growth. It’s online and Free :).

Register for FREE Digital Marketing Orientation Class
Date: 05th Aug, 2020 (Wed) Time: 3:00 PM to 4:30 PM (IST/GMT +5:30)
  • This field is for validation purposes and should be left unchanged.
We are good people. We don't spam.

You May Also Like…

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *