AIRFlow at Scale

  • Get Recording

  • Happened on :
    Duration : 60 mins

    The Data Team at Qubole collects usage and telemetry data from a million machines a month. We run many complex ETL workflows to process this data and provide reports, insights and recommendations to customers, analysts and data scientists. We use open source distribution of Apache Airflow to orchestrate our ETL and process more than 1 terabyte of data daily.
    In this talk, we will be talking about how we have extended airflow to manage the operational inefficiencies that arise when you manage data pipelines in a multi-tenant environment. We will also be talking about how we have made the data pipelines robust by adding data quality checks using CheckOperators.

    Key Takeaways:

    • Introduction to Qubole and the Data Team.
    • Overview of major type of data pipelines.
    • How we manage deploys and upgrades of data pipelines in a multi-tenant environment
    • How we manage configuration for data pipelines in a multi-tenant environment.
    • Data quality issues we faced with data ingestion/transformation.
    • Approach we have adopted using Apache Airflow Check operators.
    • Enhancements we had to make to Check operators.
    • Integration of Apache Airflow Check operators with our ETLs.
    • Challenges faced in developing the alerting framework.
    • Lesson learned and best practices in using Apache Airflow for data quality checks.

    Webinar Leader

    Sreenath Kamath & Sakshi Bansal

    Data Engineers, Qubole