Professionally Jaidev worked as a QA engineer, then as a software engineer and then as a data scientist. His background in data science goes right back to his college days. He was a research assistant and used to help with signal and image processing course material – especially converting exercises and tutorials into MATLAB code to be distributed to students. That got him interested in the underlying mathematics. Machine learning was almost a natural corollary to this. Initially, when he was not employed as a data scientist, he used to set aside a regular timetable to practice machine learning on his own until he started doing it for a living.
How did you get into Data Analytics? What interested you in learning Data Analytics?
Jaidev: Zed Shaw once published a very controversial article on why programmers need to learn statistics. I don’t share Shaw’s aggression, but it’s not difficult to understand his point. No matter what you do, analytics was required everywhere.
For me, even when I was in QA, it started with something as simple as keeping track of software packages that fail frequently – this has an element of predictive modeling. There’s immense value in knowing when and how frequently some library might fail. I’ve seen many examples of how basic exploratory data analysis reveals a lot about your process, your business and even your personal life.
In fact, I’m often surprised by how often I don’t need advanced technology. So even if you’re not a professional data scientist, there’s plenty you can do with analytics.
What was the first data set you remember working with? What did you do with it?
Jaidev: It was a dataset that contained counts of different types of vehicles within specific time periods on the roads of Pune. We did a lot of clustering, decomposition and statistical modeling on it which allowed for concise modeling of traffic flow. It led to my first paper 🙂
What is your typical day-in-a-life in your current job? Where do you spend most of your time?
Jaidev: I work roughly for eight hours on my day job. This usually starts with a quick call with colleagues, followed by some warm-up coding which consists of easy to fix bugs and small issues. More than half my working time is spent on software engineering, and the remainder is core data science. That’s actually a good position to be in – it maintains you as good hacker. Apart from this I usually set aside an hour or two in the mornings and evenings for my personal projects.
How do you stay updated on the latest trends in Data Analytics? Which are the Data Analytics resources (i.e. blogs/websites/apps) you visit regularly?
Jaidev: More than blogs or websites I follow people on Twitter and GitHub. Humans make the best recommendation systems. For staying up to date with the latest trends, too, I rely mostly on people. I’m fortunate enough to be a part of many vibrant and active communities. The best way to keep up is by simply talking to people in your community about what you and they are working on.
Share the names of 3 people that you follow in the field of Data Science or Big Data Analytics.
Jaidev: Randy Olson, Olivier Grisel and Sebastian Raschka.
Team, Skills and Tools
Which are your favorite Data Analytics Tools that you use to perform in your job, and what are the other tools used widely in your team?
Jaidev: Pandas and scikit-learn are my bread and butter. I also use a fair amount of MS Excel / Google Sheets for my daily computational needs.
What are the different roles and skills within your data team?
Help describe some examples of the kind of problems your team is solving in this year?
Jaidev: Our focus is on storytelling with data. That entails problems like how does one extract the most insight with the least effort. So that implies reducing the amount of code people write, and to make things more and more reusable. Another perpetual problem we try to solve is how to standardize analysis itself. For example, if you could ask only five different questions of your data, what would they be? We also work a lot on the side of consumption of analytics, and consequently there’s a lot of exciting work going on in data visualization, analysis and reporting, storytelling – and of course, automating all of the above.
How do you measure the performance of your team?
Jaidev: We have very well defined goals for teams as well as for individuals. They have a revenue component, but we also focus a lot on productivity improvements. A lot of what we do gets measured by how much it makes someone else’s life easier – who may not be in your team, may not be in your company or may not even be an analyst. That helps us keep our feet on the ground.
Big Data Teams, Skills and Tools
In the huge Big Data landscape, the skills are swiftly changing. Which is the technology do you see dominating in the ETL data space and real time?
Jaidev: Tools like Spark, the ELK stack, Kafka, Redshift, etc are clear winners in this space. Even if you ignore how easy they have become through managed services like Azure / AWS – they have clearly come a long way in terms of ease of use, APIs and management. The point is that they are no longer intimidating to someone who is not an expert in distributed computing. Any tech that grows along these lines – ease of use, management and extensibility – is tremendously useful.
How do aspiring Data Engineering demonstrate their capabilities of handling the tool, technology, data and domain? Is Certificate (Cloudera/Hortonworks) a clear differentiator?
Jaidev: I personally believe that certifications are not clear differentiators. They will get you past initial filtering, but many industry experts understand that it is easy to “game” the certification. The certification alone doesn’t make you more hirable. What would be a clear differentiator is a portfolio of work – these could even be relatively simple applications – which demonstrate that you are able to get things done.
Are Analytical skills, Statistics, Machine Learning must have or good to have skills for Data Engineers?
Jaidev: Somewhere between good to have and must have – don’t get distracted by them, but don’t ignore them either.
Industry Readiness for Data Science
Are the industries looking to understand what they can do with data? Do they have the required data in place?
Jaidev: There are organizations and individuals on both sides of the spectrum. There are organizations that have a lot of data with no idea of what to do with them, and there are people who have grand plans without data. The bigger problem is awareness – AI / ML is exciting, but many industry problems are not solvable with data or analytics. On the other hand, having a lot of data doesn’t automatically make you a data driven business. Knowing if and when to apply data science is major problem, and solving it is the responsibility of both the practitioners and consumers of data science. The practitioners’ responsibility is to communicate their offering very clearly, and the consumer’s’ responsibility is to have realistic expectations from their data and processes.
Which are the top 3 problems that are on top of the Data Science, either based on industries, or based on technology area.
(i) Interpretability of machine learning – we usually apply ML to automate processes, but it can also be used to interpret your data. There must be a focus on how well your model explains your data. We are almost drowning ourselves in black box models, which means we get a few things done, but most often we miss the bigger picture and the hidden truths in our data.
(ii) Better Tooling – There’s a huge gap between the pace at which academic research progresses and the pace at which this research is adopted by the industry. One way to bridge this gap is to find better ways of figuring out if a particular piece of technology is viable in the long run. Realistically, data science hasn’t become so ubiquitous that the common man is able to reap its benefits. All of these problems will benefit a lot from better and easier to use tools.
(iii) Awareness – Unfortunately, the hype around data science is at an all time high, and we don’t know if we have seen it peak yet. This has led to consumers, practitioners and students alike to have massively unrealistic expectations from data science. It’s not a silver bullet. It’s not a one-size-fits-all solution. As a community, we must do everything we can to dispel myths.
Advice to Aspiring Data Scientists
According to you, what are the top skills, both technical and soft-skills that are needed for Data Analysts and Data Scientists?
Jaidev: If you’re new to programming, learn one of Python, R or MATLAB. Usually learning the first programming language is difficult, but moving from one language to another is easier. Knowledge of probability, statistics and linear algebra is good to have – to the extent that it applies to machine learning. Whichever concept you learn, make sure you can write the corresponding code. As for soft skills, be a good communicator – read a lot, work on a lot of data, and speak or write about insights from that data.
How much focus should aspiring data practitioners do in working with messy, noisy data? What are the other areas that they must build their expertise in?
Jaidev: As much as possible. No one is going to drop clean datasets into your pocket. It is not an uncommon requirement for data scientists to have a good handle on not only the analytics but also ETL processes. Be a good hacker. A good hacker is anyone who’s not afraid to dirty their hands with multiple tech stacks, data sources and algorithms. It often gets uncomfortable, but that’s where breakthroughs happen.
What is your advice for newbies, Data Science students or practitioners who are looking at building a career in Data Analytics industry?
(i) Programming and software skills – R, Python, SAS or Excel. I highly recommend Python and Excel.
(ii) Visualization Tools – Matplotlib or ggplot – pick one and get good at it.
(iii) Statistical foundation and applied knowledge
(iv) Machine Learning
For both statistics and machine learning there is no dearth of learning material. Unfortunately, there is no answer to which resource is the best for a beginner. However, there is a simple approach one can use to deal with the overload of information. Pick any resource and study it. Keep testing yourself on the concepts you learn by coding them. If you feel overwhelmed, or that you are learning things in too much detail which is unnecessary at the time, drop that resource and pick something simpler. Come back to the original book/article/blog post when you feel more confident.
Would you like to share few words about the work we are doing at Digital Vidya in developing Data Analytics Talent for the industry?
Jaidev: Digital Vidya is doing a great job by imparting relevant and most updated knowledge when it comes to data science. They are providing a clear to aspiring data scientists.
Are you inspired by the opportunity of Data Science? Start your journey by attending our upcoming orientation session on Data Science for Career & Business Growth. It’s online and Free :).