Join Free
Online Orientation Session on

AIRFlow at Scale

  • Get Recording

  • About the Webinar

    Duration

    60 mins

    Day

    July 11, 2018

    Time

    3:00 pm

    Who is the Speaker?

    Sreenath Kamath is experienced member of Technical Staff with a demonstrated history of working in the ETL industry. Skilled in Hive, Spark, Big Data, Linux, Analytics, and C. Strong engineering professional with a Bachelor of Technology (B.Tech.) focused in Computer Science from National Institute of Technology Calicut.

    Sakshi Bansal is a Computer Science Engineering Graduate from BITS Pilani, with deep-rooted interests in data sciences, data structures and algorithms, right from her college days. After graduating and for the last two years, she has been working with Qubole for the last 2 years, she has worked with the data team at Qubole and has made significant contributions towards building data streaming platforms and data warehouse for the company.

    Sreenath Kamath & Sakshi Bansal

    Key Takeaways
    • Introduction to Qubole and the Data Team.
    • Overview of major type of data pipelines.
    • How we manage deploys and upgrades of data pipelines in a multi-tenant environment
    • How we manage configuration for data pipelines in a multi-tenant environment.
    • Data quality issues we faced with data ingestion/transformation.
    • Approach we have adopted using Apache Airflow Check operators.
    • Enhancements we had to make to Check operators.
    • Integration of Apache Airflow Check operators with our ETLs.
    • Challenges faced in developing the alerting framework.
    • Lesson learned and best practices in using Apache Airflow for data quality checks.

    Session Agenda

    The Data Team at Qubole collects usage and telemetry data from a million machines a month. We run many complex ETL workflows to process this data and provide reports, insights, and recommendations to customers, analysts and data scientists. We use the open-source distribution of Apache Airflow to orchestrate our ETL and process more than 1 terabyte of data daily.

    In this talk, we will be talking about how we have extended airflow to manage the operational inefficiencies that arise when you manage data pipelines in a multi-tenant environment. We will also be talking about how we have made the data pipelines robust by adding data quality checks using CheckOperators.

    Who Should Attend?



    Students

    Computer Science Graduates

    Aspiring Machine Learning Engineers



    CTO’s

    Aspiring Data Analysts

    Aspiring Data Scientists

    Software Engineers