Interview With Harshad Khadilkar, Scientist, Tata Consultancy Services

Harshad Khadilkar holds a BTech (2009) in Aerospace Engineering from IIT Bombay and an MS (2011) and PhD (2013) from the Department of Aeronautics and Astronautics at the Massachusetts Institute of Technology. His journey towards data science began in the latter stages of his bachelor’s degree when he became interested in aircraft control systems.

Like many other fields (such as operations research), formal work on aircraft controllers began during World War II, when aeroplanes began travelling faster and higher than what pure human muscle power could handle. This era also marked the advent of system identification, modelling, and empirical testing, all of which are essentially data science (from before the term was even imagined!).

During his graduate studies at MIT, he switched from aircraft controls to more generic control problems, eventually writing his thesis on the control of airport operations. A large part of his thesis uses approximate dynamic programming, a field from control theory closely related to what is known as reinforcement learning in the Artificial Intelligence community.

Therefore, it was a fluid change to move into RL when he joined IBM Research after his PhD, and subsequently TCS Research & Innovation. His work is somewhat different from the bulk of data analytics because he does not work on prediction, forecasting, or pattern recognition problems. His team focusses on using data science for decision-making in industrial systems, in domains such as supply chain and transportation networks.

What was the first data set you remember working with? What did you do with it?

Harshad Khadilkar: The first data set of any consequence that I worked with, was aircraft fuel consumption readouts from the onboard flight data recorder. I was working on my masters at the time, and we were trying to quantify the fuel consumption of aircraft during various phases of movement on the airport surface (before takeoff and after landing).

The most complex aspect of working with this data set was to disambiguate modes of movement (straight line taxiing, turning, braking, stopped) by correlating it with the latitude/longitude readings from the GPS system.

Once this task was accomplished, we had to model the effect of independent variables (total time from gate to runway, number of braking and stop/start events, etc.) on fuel consumption.

Our results showed that while the first order fuel consumption estimate could be obtained directly from the total movement time, a statistically significant effect was produced by stop/start events. The paper-based on this work is now widely cited and has encouraged researchers to develop optimization algorithms that compute ‘conflict-free’ trajectories for aircraft moving towards the runway for takeoff.

Was there a specific “aha” moment when you realized the power of data?

Harshad Khadilkar: Not really. I have been ‘data-driven’ from quite a young age, whether in my work or in my life. I tend to not accept traditional belief systems without substantial empirical evidence.

How do you stay updated on the latest trends in Data Analytics? Which are the Data Analytics resources (i.e. blogs/websites/apps) you visit regularly?

Harshad Khadilkar: My main sources of material on the state of the art are journal articles, conference proceedings, and arXiv papers.

Share the names of 3 people/publications/research that you follow in the field of Data Science or Big Data Analytics.

Harshad Khadilkar: The three principal groups that we keep an eye out for are:

1. Google DeepMind

2. OpenAI

3. UC Berkeley’s work on reinforcement learning

Team, Skills and Tools

Which are your favourite Data Analytics Tools that you use to perform in your job, and what are the other tools used widely in your team?

Harshad Khadilkar: Since most of my team, work on optimal control problems using data-driven techniques such as reinforcement learning, we use Python and its libraries as our primary implementation platform. Deep learning libraries such as Tensorflow and PyTorch are both used based on problem requirements (how complex the network is, how long the training runs are expected to be, and how soon it is expected to be deployed in production). While TensorFlow is very stable and can be used in client-server architectures, defining complex networks is easier in PyTorch.

What are the different roles and skills within your data team?

Harshad Khadilkar: My team consists mainly of people with PhDs or Masters, in areas such as machine learning, computer science, operations research, and electrical engineering. A diverse set of domain knowledge keeps the team vibrant and allows us to attack the problem in many different fields.

Help describe some examples of the kind of problems your team is solving in this year?

Harshad Khadilkar: Some examples from our current work are:

Computing policies for operating in multi-agent competitive environments such as Pommerman, a multi-agent reinforcement learning task hosted annually at NeurIPS.

Using multi-agent reinforcement learning in practical problems such as replenishment decisions in retail supply chains. This problem contains a rich variety of data and decisions, for every node in the supply chain (warehouses, logistics providers, stores), and all of them have to coordinate in order to maintain high customer satisfaction with minimal operating costs.

Automation of parcel loading in postal sorting centres, where packages arrive on a conveyor belt in real-time and we (the reinforcement learning algorithm) have to quickly decide where (and in what orientation) to place them in the container, so as to maximize space utilization. We are not building the robot hardware or manipulation system here, we are only developing the outer control loop (placement decisions).

How do you measure the performance of your team?

Harshad Khadilkar: The two traditional ways of measuring team performance in industrial research labs are intellectual property (publications & patents), and business impact (how much of the research was absorbed by the business arm of TCS). While we keep these metrics in mind, we prioritize learning of new skills and techniques as the primary metric.

The field of data science / AI is so fast-moving, that it is vitally important to keep on top of new technological developments and to bring them into our system before demand is generated from the customer end. Only by keeping ahead of trends can we maintain enough space and time for true research and exploration – if we followed the trends after they had become well known, we would be racing to deliver projects to customers instead.

Big Data Team, Skills and Tools

Are Analytical skills, Statistics, Machine Learning must have or good to have skills for Data Engineers?

Harshad Khadilkar: This is a very important question and one on which I happen to have strong opinions on. Specifically, I feel that there are no ‘must-have’ skills for data scientists or engineers. Instead, there is a ‘must-have’ mentality: as scientists and engineers, we must have

(i) the will to understand underlying mathematics rather than simply import libraries and use them

(ii) the initiative to hunt for solutions to coding-related problems online

(iii) the patience to ensure that algorithms are working correctly, rather than jumping to conclusions based on the final results only.

Industry Readiness for Data Science

Are the industries looking to understand what they can do with data? Do they have the required data in place?

Harshad Khadilkar: Industries are definitely waking up to the potential of data-driven decisions for operational improvements. They are aware that successful companies in the future will be the ones who can best leverage the latent value in their data. At the same time, there is less clarity on how exactly the value can be realized. This is where data science professionals and their offerings can help; once new clients are made aware of the concrete steps that they can take, they are very positive about trying new things.

Which are the top 3 problems that are on top of the Data Science, either based on industries or based on technology area?

Harshad Khadilkar:

1. Analyzing existing (internal) data sources, and identifying holes or discrepancies

2. Incorporating external variables – weather, socio-economic factors, demographics, etc. – into internal prediction or forecasting algorithms

3. Building a clear, quantitative understanding of one’s own business operations (it is hard to believe that this is a problem, but you would be shocked by how little companies know about the breadth of their own operations).

Advice to Aspiring Data Scientists

How much focus should aspiring data practitioners do in working with messy, noisy data? What are the other areas that they must build their expertise in?

Harshad Khadilkar: Every data practitioner MUST work with messy and noisy data. There is no getting around this step. In my experience, more than half of the challenges faced by today’s enterprises can be solved using very simple algorithms, if they are fed with clean data. Sophisticated analytics and machine learning / deep learning techniques are only secondary impact factors.

What is your advice for newbies, Data Science students or practitioners who are looking at building a career in Data Analytics industry?

Harshad Khadilkar:

1. Programming & Software Skills: R, Python, SAS or Excel: Python! Excel is good for management type of work only.

2. Visualization Tools: If you use machine learning algorithms, Tensorboard is an excellent tool to peek inside your TensorFlow codes.

3. Statistical foundation and applied knowledge: Be very, very thorough with linear algebra, probability, and statistics, before jumping into advanced analytics.

4. Machine Learning: Definitely, but only after completing step (3).

What are the changing trends that you foresee in the field of Data Science and what do you recommend the current crop of data analysts do to keep pace?

Harshad Khadilkar: Data Science is currently in the nascent gung-ho phase when throwing more data and more compute at a problem is considered to be the best solution. This is not sustainable from a long-term point of view, in terms of technical feasibility, cost, and ecological impact due to the huge level of energy consumption.

I would suggest scientists and engineers of the future focus on ‘lean’ data science techniques, where they create maximum impact with minimum resources. Sometimes, the marginal returns for moving from 99% to 99.9% accuracy are simply not worth the computational effort. Keep in mind that delivering net value to clients in reasonable time and cost, is the key factor that determines success.

Big Data Solution Space

What is the kind of structured and unstructured data companies have? What is the size that we are talking about?

Harshad Khadilkar: The move towards detailed data gathering within enterprises has been faster than the move towards reaping the rewards from the data. Most companies these days have a large amount of structured data – where applicable, through SCADA systems for example. However, many times the problem is that they use data-gathering systems from multiple vendors, which makes data fusion difficult.

Furthermore, if the companies don’t own the source (for example, they use hardware with closed communication protocols), there is very little hope of gathering the data centrally. Wherever possible, data engineers should emphasize to their clients how important it is to completely own their data, and that the benefits of open source technologies outweigh their problems.

Are there legacy systems that are being replaced? If yes, which legacy skills are being replaced?

Harshad Khadilkar: Legacy systems are certainly being replaced by companies in all fields, from airlines to manufacturing to banking & finance. The systems that are changing most rapidly are ones that function with a fixed set of thumb rules and a small amount of wiggle room for individual judgment.

So far, the last part (individual judgment) had prevented the automation of these systems and had generated demand for people with the skills to understand their portion of the company’s operations and make fast, reasonably accurate judgment calls.

For example, retail store managers typically decided when to re-stock their products, and in what quantities. Jobs such as these are quickly losing out to data-driven automation, the type that can use the same thumb rules but also use machine learning to maximize the impact within the available scope.

Are you inspired by the opportunity of Data Science? Start your journey by attending our upcoming orientation session on Data Science for Career & Business Growth. It’s online and Free :).

Interview with Harshad Khadilkar, Scientist, Tata Consultancy Services