AIRFlow At Scale

AIRFlow at Scale

Introduction to Qubole and the Data Team.
Overview of major type of data pipelines.
How we manage deploys and upgrades of data pipelines in a multi-tenant environment
How we manage configuration for data pipelines in a multi-tenant environment.
Data quality issues we faced with data ingestion/transformation.
Approach we have adopted using Apache Airflow Check operators.
Enhancements we had to make to Check operators.
Integration of Apache Airflow Check operators with our ETLs.
Challenges faced in developing the alerting framework.
Lesson learned and best practices in using Apache Airflow for data quality checks.

Session Agenda

The Data Team at Qubole collects usage and telemetry data from a million machines a month. We run many complex ETL workflows to process this data and provide reports, insights, and recommendations to customers, analysts and data scientists. We use the open-source distribution of Apache Airflow to orchestrate our ETL and process more than 1 terabyte of data daily.

In this talk, we will be talking about how we have extended airflow to manage the operational inefficiencies that arise when you manage data pipelines in a multi-tenant environment. We will also be talking about how we have made the data pipelines robust by adding data quality checks using CheckOperators.

Get Recording