Rajat Venkatesh has worked on Big Data platforms for more than 10 years. He has been part of two cycles of disruption in big data: first build Massively Parallel Processing databases on commodity hardware as an early member of the Vertica team and secondly architect an elastic MPP databases on cloud platforms like AWS and Azure. As part of this journey he has helped many companies build petabyte scale data warehouses and democratize data for analysts and data scientists.
Is Big Data becoming a reality in the industry beyond the social giants like Facebook, Google, Yahoo? If yes, which industries are moving towards the power of big data analytics? If no, what is the outlook for adoption?
Rajat Venkatesh: Yes. Many more industries are using big data to improve decision-making. Some of the notable industries are:
- Health Care
- Life Sciences
- Natural Resources
Apart from the ones mentioned above, we have also noticed that the Ad-Tech and Media industries are usually early movers to big data analytics.
Name Three Industries and the kind of problems that they are solving using Big Data?
- Finance: Financial companies are using big data for stock analysis, fraud detection and to target financial products to the right customers.
- Life Sciences: Life Science companies are using big data for drug R & D, clinical trials, and sales & marketing.
- Transportation: Transportation companies are using data to track their fleets, improve operational efficiencies and user experience. Companies are diverse from shipping companies to urban transportation companies like Uber and Ola.
I would also want to mention the retail, ad-tech/media and entertainment, and travel industries.
While the ad-tech/media industry is primarily into fraud detection and clickstream analytics, in the travel industry big data is being used to create profiles of travellers for segment based marketing and call centre performance analysis.
Who in the Industry is your typical client for Big Data? Is it the CTO, CIO, CMO or special data leaders?
Rajat Venkatesh: The technology leader depends on the use case.
- Data Engineering: Typically, the team reports to a CIO or VP of Engineering.
- Data Analytics or Data Science: Typically associated with a department like marketing or sales and therefore the client is a CMO or COO.
While the industries are looking to understand how can they leverage data, do they have the required data in the first place?
Rajat Venkatesh: Data, while collected by many organizations, resides as dark data and might not all be easily accessible either. A recent report suggests that 90% of all sensor, machine data is dark data i.e. never put to use.
While organizations see the value in a DataOps approach to designing, maintaining a distributed data architecture that is cloud native to prepare data, make data available for ad hoc analytics and other downstream use cases, many lack the competency do build such a platform and find value in a turnkey solution such as Qubole ADP.
Additionally, different companies are different stages of maturity in capturing and using the data.
The different stages are:
- Aspirational: Know the data that is available and the possibilities
- Experiment: Capture data and generate useful insights for a specific use case.
- Expansion: Capture data and generate useful insights for multiple use cases across most departments or teams.
People and Skills
In the huge Big Data landscape, the skills are swiftly changing. Which is the technology do you see dominating in the ETL, data space?
Rajat Venkatesh: Apache Hive and Apache Spark are the dominant data engines that can handle terabytes per day. The current challenges for ETL engineers have now moved to other aspects of data engineering:
Manage ETL pipelines: ETL pipelines are complex and consist of many steps. There are both commercial and open source technologies such as Apache Airflow. Some of the unaddressed challenges are better alerts, monitoring especially on SLAs and data quality.
Change Management of ETL Pipelines: ETL pipelines are typically brittle. ETL data pipelines do not go through QA or have good DevOps tools to manage change like other software system.
Consistency of data & metadata: Data is generated by multiple teams in a company. Data engineering teams collaborate with these teams to capture and manage the data. Since the data engineering team is the common factor, it is up to them to make sure that metadata is consistent across all the teams. For example, customer id has the same column name and format across all the teams.
No technologies solve these issues for data engineers and therefore this is an exciting time for experts in the industry to create solutions to these next generation problems.
Hive appears to be taking the leadership position, what attributes is making it the leader?
Rajat Venkatesh: Apache Hive is the premier data engine for date engineering. The attributes that date engineers love are:
- Scale Out: Apache Hive can scale out to the largest scale. The biggest clusters are measured in many thousands of machines
- Robust: At large scales, failures especially because of network and hardware are very common, Apache Hive, Hadoop and HDFS can handle these failures without any human intervention.
- Customizable: Apache Hive has a plugin system to read many types of data formats and implement user defined functions. This functionality is critical since at its foundation, data engineering involves bringing disparate data from many sources together.
In the recent past, Apache Spark is a worthy option for data engineering much for the main reasons. Apache Spark improves on these attributes based on learnings from a decade of big data engineering in the Hadoop ecosystem.
What is the skill set that a typical Data Engineering team consist of? What size and proportions?
- SQL: SQL is the lingua franca of data engines. Data engineers should be able to express business questions in SQL. Moreover, they should be able to understand how the data engine will interpret and execute SQL queries. They should be able to understand explain plans as well as strengths and weaknesses of the data engine.
- Data Modelling: Data engineers accept data generated by other teams and provide a master copy of all the data to the rest of the company. They should be able to understand the data, guide teams to model it so that the data is accessible to the rest of the company.
- Big Data Engines: ETL engineers should have expertise on at least one data engine like Apache Hive or Apache Spark. In depth knowledge is required if these technologies have to be used at terabyte or petabyte scale.
- Workflow management tools: ETL pipelines consist of complex dags. These dags are run by workflow managers like Apache Airflows. Data engineers should be able to use workflow tools to setup, run, monitor, and evolve ETL pipelines on these tools.
How do aspiring Data Engineering demonstrate their capabilities of handling the tool, technology, data, and domain?
Rajat Venkatesh: The best qualification is real-world experience. The best data engineers have worked in teams with experts and data at scale. Training does provide basic knowledge but it is hard to emulate issues that are only seen when data pipelines are managed at scale.
What is a typical day in a life for a Big Data Engineer. What are the different hats do they wear?
Rajat Venkatesh: In a typical day, a big data engineer will do the following things –
- Check up on ETL pipelines in production.
- Make changes to existing pipelines for new features or bugs.
- Collaborate with other teams to accept new data.
- Train analysts and data scientists to use data in the data lake.
Is Analytical skills, Statistics, Machine Learning must have or good to have skills for Data Engineers?
Rajat Venkatesh: These skills are good to have. Some basic knowledge is required so that Data engineers can help analysts and data engineers use the data. In small companies, data engineers may play the roles of analyst as well. In such cases, these skills maybe a requirement.
Big Data Solution Space
What is the kind of structured and unstructured data companies having? What is the size that we are talking about?
Rajat Venkatesh: Sources of data depend on the industry obviously. Unstructured data is a misnomer. There is no unstructured data. All data has some structure. The real challenge is that a data lake can contain data of vastly different structure. Some of the data sources that customers use are: GPS, DNA, Social Media feeds, CRM data, Machine Telemetry.
Are there legacy systems that are being replaced? If yes, which legacy skills are being replaced?
Rajat Venkatesh: Qubole provides big data infrastructure on the cloud. The legacy systems we replace is biased for Qubole’s technology space. Customers are typically looking to replace databases not designed for big data or big data systems not designed for public clouds.
What is the Big ‘Aha’ moment that you folks have helped solved for a client by moving to Big Data landscape? How did the client assess their ROI?
Rajat Venkatesh: Qubole’s customers are typically experienced in Big Data. Customers are looking to move to public clouds. The Aha moment is when they realize the possibilities of using elasticity of public clouds to dramatically improve performance or reduce costs of their current workloads. The result typically is that the companies run more analysis within their budget and become more data driven.
What is the size of clusters/environments that are being deployed for the clients? What are the production challenges?
Rajat Venkatesh: Our largest customers are running clusters of a few thousand machines each. They are ingesting terabytes of data and managing petabytes in aggregate. Overall Qubole helps customers process close to 750 PB a month using 1 million machines.
The main production challenges are:
- Scale: Similar to other big data deployments including in on-prem, failure rates are higher at scale.
- Higher error rates on public cloud: Network and hardware is much more susceptible to failures than in data centres. This exacerbates the situation at scale where failures are already high. The solution is to be able to recover from these failures automatically. Building systems to automatically recover from failures requires expertise and patience.
- Performance: Network, hardware and storage bandwidth is much slower than in data centres. Customers are used to a specific performance. Fundamental architectural changes are required to match performance while keeping costs low.
Are you inspired by the opportunity of Data Analytics? Start your journey by attending our upcoming orientation session on Data Analytics for Career & Business Growth. It’s online and Free :).