Building the modern data stack on the data lakehouse

The world’s largest data and AI conference returns live, June 27-30 in San Francisco and virtually in our new hybrid format. Four days packed with keynotes by industry visionaries, technical sessions, hands-on training and networking opportunities.

Breakout sessions

Data scientists, data engineers, analysts, developers, researchers and ML practitioners all attend Summit to learn from the world’s leading experts on topics like:

Best practices and use cases for Apache Spark™, Delta Lake, MLflow, PyTorch, TensorFlow, dbt™
Data engineering for scale, including streaming architectures
Advanced SQL analytics and BI using data warehouses and data lakes
Data science, including the Python ecosystem
Machine learning and deep learning applications, MLOps

Training

Data + AI Summit 2022 training will be held on June 27 and 30, with an expanded curriculum of half and full day in-person and virtual classes. Most training classes will include both lecture and hands-on exercises. NEW certification bundles are also available that include courses and exams.

Databricks Lakehouse Overview

Role: All audiences
Format: Virtual, Half-day
Labs: None
Price: FREE

In this course, you’ll discover how the Databricks Lakehouse Platform can help you compete in the world of big data and artificial intelligence. In the first half of the course, we’ll introduce you to foundational concepts in big data, explain key roles and abilities to look for when building data teams, and familiarize you with all parts of a complete data landscape. In the second half, we’ll review how the Databricks Lakehouse Platform can help your organization streamline workflows, break down silos, and make the most of your data.

By the end of the course you will be able to:

Explain the characteristics, benefits, and challenges of Big Data
Compare and contrast AI, machine learning, and deep learning
Summarize organizational challenges in working with Big Data
Explain the benefits of the Lakehouse and Delta Lake
Describe the functionality of the Unified Data Analytics Platform

Please note: This course provides a high-level overview of big data concepts and the
Databricks Lakehouse platform. It does not contain hands-on labs or technical deep
dives into Databricks functionality.

Prerequisites:

No programming experience required
No experience with Databricks required

Lakehouse with Delta Lake Deep Dive

Role: All audiences
Format: Virtual, Half-day
Labs: None
Price: FREE

In this course, we will provide a brief overview of data architecture concepts, an introduction to the Lakehouse paradigm, and an in-depth look at Delta Lake features and functionality. You will learn about applying software engineering principles with Databricks as we demonstrate how to build end-to-end OLAP data pipelines using Delta Lake for batch and streaming data. The course also discusses serving data to end users through aggregate tables and Databricks SQL Analytics. Throughout the course, emphasis will be placed on using data engineering best practices with Databricks.By the end of the course, you will be able to:

Identify the core components of Delta Lake that make a Lakehouse possible.
Define commonly used optimizations available in Delta Engine.
Build end-to-end batch and streaming OLAP data pipeline using Delta Lake.
Make data available for consumption by downstream stakeholders using specified design patterns.
Document data at the table level to promote data discovery and cross-team communication.
Apply Databricks’ recommended best practices in engineering a single source of truth Delta architecture.

Prerequisites:

Familiarity with data engineering concepts
Basic knowledge of Delta Lake core features and use cases

Introduction to Databricks SQL

Role: SQL analysts, Data analysts, Business analysts
Format: Virtual and In-person, Half Day
Labs: Yes
Price: Virtual ($200), In-person ($350)

Meet Databricks SQL and find out how you can achieve high performance while querying directly on your organization’s data lake. Using Databricks SQL, learners will practice writing and visualizing queries. Students will leave this course with the ability to use Databricks SQL to write a variety of queries, create various visualizations, and combine their visualizations into a dashboard that can be shared with others.

By the end of the course you will be able to:

Navigate Databricks SQL
Write queries in Databricks SQL
Visualize query output
Produce a dashboard that combines multiple visualizations

NOTE: The course “Data Analysis with Databricks SQL” covers these concepts with additional hands-on and broader introduction to Databricks and is more suitable for students preparing to complete the Associate Data Analysis with Databricks certification exam.

Prerequisite:

Basic familiarity with ANSI SQL

Data Analysis with Databricks SQL

Role: SQL analysts, Data analysts, Business analysts
Duration: Virtual and In-person, full day
Labs: Yes
Price: Virtual ($400), In-person ($700)

Meet Databricks SQL and find out how you can achieve high performance while querying directly on your organization’s data lake. Using Databricks SQL, learners will practice writing and visualizing queries. Students will leave this course having created a personal dashboard, complete with parameterized queries and automated alerts.

By the end of the course you will be able to use Databricks SQL to:

Write queries that answer specific BI questions
Visualize query output
Produce a dashboard that combines multiple visualizations
Use parameterized queries to customize query output
Create alerts

Prerequisites:

Basic familiarity with ANSI SQL

Apache Spark™ Programming with Databricks

Role: Data engineers, Data scientists, Machine learning engineers, Data architects
Format: Virtual and In-person, Two full days
Labs: Yes
Price: Virtual ($800), In-person ($1400)

This course uses a case study-driven approach to explore the fundamentals of Spark Programming with Databricks, including Spark architecture, the DataFrame API, query optimization, and Structured Streaming. First, you will become familiar with Databricks and Spark, recognize their major components, and explore datasets for the case study using the Databricks environment. After ingesting data from various file formats, you will process and analyze datasets by applying a variety of DataFrame transformations, column expressions, and built-in functions. Lastly, you will execute streaming queries to process streaming data and highlight the advantages of using Delta Lake.

By the end of the course you will be able to:

Define the major components of Spark architecture and execution hierarchy
Describe how DataFrames are built, transformed, and evaluated in Spark
Apply the DataFrame API to explore, preprocess, join, and ingest data in Spark
Apply the Structured Streaming API to perform analytics on streaming data
Navigate the Spark UI and describe how the catalyst optimizer, partitioning and caching affect Spark’s execution performance

Prerequisites:

Familiarity with basic SQL concepts (select, filter, group by, join, and others)
Beginner programming experience with Python (syntax, conditions, loops, functions)

Performance Tuning on Apache Spark

Role: Data Engineer, ML Engineer, Data Scientist
Format: Virtual and In-person, Full Day
Labs: Yes
Price: Virtual ($400), In-person ($700)

Complete guided challenges as you learn to diagnose and fix poorly performing queries. Using Python/Scala, participants will review performance problems to uncover solutions and best practices to be applied to your queries.

By the end of the course you will be able to:

Deconstruct the Spark UI to aid in performance analysis, application debugging, and tuning of Spark applications.
Summarize some of the most common performance problems associated with data ingestion and how to mitigate them.
Configure a Spark cluster given specific requirements and various factors.

Prerequisites:

6+ months of experience working with the Spark DataFrame API is recommended
Intermediate programming experience in Python or Scala

Advanced Data Engineering with Databricks

Role: Data Engineers, BI Analysts, Analytic Engineers, Database Architects, Machine Learning Engineers
Format: Virtual and In-person, Two full days
Labs: Yes
Price: Virtual ($800), In-person ($1400)

In this course, students will build upon their existing knowledge of Apache Spark, Structured Streaming, and Delta Lake to unlock the full potential of the data lakehouse by utilizing the suite of tools provided by Databricks. This course places a heavy emphasis on designs favoring incremental data processing, enabling systems optimized to continuously ingest and analyze ever-growing data. By designing workloads that leverage built-in platform optimizations, data engineers can reduce the burden of code maintenance and on-call emergencies, and quickly adapt production code to new demands with minimal refactoring or downtime. The topics in this course should be mastered prior to attempting the Databricks Certified s Data Engineering Professional exam.

By the end of the course you will be able to:

Design databases and pipelines optimized for the Databricks Lakehouse Platform.
Implement efficient incremental data processing to validate and enrich data driving business decisions and applications.
Leverage Databricks-native features for managing access to sensitive data and fulfilling right-to-be-forgotten requests.
Manage error troubleshooting, code promotion, task orchestration, and production job monitoring using Databricks tools.

Prerequisites:

Comfort using PySpark APIs to perform advanced data transformations
Familiarity implementing classes with Python
Experience using SQL in production data warehouse or data lake implementations
Experience working in Databricks notebooks and configuring clusters
Familiarity with creating and manipulating data in Delta Lake tables with SQL
Ability to use Spark Structured Streaming to incrementally read from a Delta table

Data Engineering with Databricks

Data professionals from all walks of life will benefit from this comprehensive introduction to the components of the Databricks Lakehouse Platform that directly support putting ETL pipelines into production. Lessons will familiarize students with the Databricks Data Engineering & Data Science Workspace, Databricks SQL, Delta Live Tables, Databricks Repos, Databricks Task Orchestration, and the Unity Catalog. Students will leverage SQL and Python to define and schedule pipelines that incrementally process new data from a variety of data sources to power analytic applications and dashboards in the Lakehouse.

NOTE: The half-day courses “End-to-end with Spark SQL/PySpark” contain a subset of topics from this course focused toward experienced data practitioners with less hands-on exercises.

By the end of the course you will be able to:

Describe how Delta Lake transactional guarantees enable the Lakehouse architecture
Design and build databases, tables, and views in the Lakehouse
Ingest and enrich data for production applications, machine learning, and ad hoc analytic queries
Use Python and Spark SQL to build and deploy production data engineering pipelines
Leverage the Databricks platform for code development, workload orchestration, and analytic exploration and dashboarding

Prerequisites:

Beginner experience using Spark SQL
Beginner experience with Python (preferred)
Beginner knowledge of ETL, data warehousing, and data lakes
Beginner familiarity with the Databricks workspace

Databricks Platform Administration with Unity Catalog

Role: All audiences
Format: Virtual and In-person, Half Day
Labs: No
Price: Virtual ($200), In-person ($350)

The introduction of Unity Catalog simplifies the process of managing data permissions while empowering admins with new features for data governance, auditing, and sharing. This course instructs students in best practices for leveraging Unity Catalog to configure Databricks, whether you administer a single workspace or an enterprise deployment spanning many cloud regions. Basic platform administration tasks around IAM, ACLs, and workspace configuration will also be covered.

By the end of the course, you will be able to:

Describe how Unity Catalog fits into the Databricks platform architecture
Configure secure access to cloud object storage with Unity Catalog
Manage access to data and models with Unity Catalog
Configure groups and users in the Databricks workspace
Set permissions for groups on workspace assets

Prerequisites:

Basic familiarity with SQL
Beginner knowledge of concepts related to identity access management
Beginner knowledge of the Databricks workspace
Beginner familiarity with cloud computing concepts (virtual machines, object storage, etc.)

Advanced Machine Learning with Databricks

Role: Machine learning engineers, data scientists
Format: Virtual and In-person, Two full days
Labs: Yes
Price: Virtual ($800), In-person ($1400)

In this course, students will develop professional-level machine learning engineering skills for use with Databricks. In four separate modules, students will learn to apply the basics of the machine learning workflow, scale-out and speed up machine learning pipelines, apply machine learning operations, perform machine learning model operations using MLflow, and organize, package, and test end-to-end machine learning applications. By the end of this course, students should be capable of organizing, scaling, and operationalizing machine learning applications using Databricks.

By the end of this course you will be able to:

Complete each step of the data science process and the machine learning workflow.
Improve the efficiency of machine learning pipelines to streamline machine learning solution development and production.
Organize, package, and test end-to-end machine learning applications to ensure their reproducibility and stability.
Apply machine learning operations best practices using MLflow.

Prerequisites:

Intermediate level experience with Apache Spark (familiarity with Spark architecture and Spark DataFrame API).
Intermediate-level experience with Python (familiarity with libraries, iteration, control flow, operators, and classes).
Beginning-level knowledge of machine learning (familiarity with definitions, supervised learning vs. unsupervised learning, regression vs. classification, and clustering).

Managing Machine Learning Models

Role: Machine learning engineers, data scientists
Format: Virtual and In-person, Half Day
Labs: Yes
Price: Virtual ($200), In-person ($350)

In this course, learners will begin by describing the basics of Databricks Machine Learning for model management and operations. Next, learners will track the development of machine learning models using MLflow Tracking and Databricks Autologging. Third, learners will manage the model lifecycle using the MLflow Model Registry UI. Finally, learners will close out the course by learning to automate the model lifecycle using MLflow Model Registry Webhooks and Databricks Jobs.

By the end of this course you will be able to:

Describe the basics of Databricks Machine Learning for model management and operations.
Track machine learning model development with MLflow Tracking and Databricks Autologging.
Manage the model lifecycle using the MLflow Model Registry.
Automate the model lifecycle using MLflow Model Registry Webhooks and Databricks Jobs.

NOTE: The course “Machine Learning with Databricks” covers these concepts with additional hands-on and broader introduction to Databricks and is more suitable for students preparing to complete the Databricks Certified Professional Data Scientist exam.

Prerequisites:

Intermediate-level experience with Python (familiarity with Python libraries and programming).
Beginning-level knowledge of machine learning (simple model development, etc.).
Beginning-level experience with Databricks Machine Learning.

Deploying Machine Learning Models

Role: Machine learning engineers, data scientists
Format: Virtual and In-person, Half Day
Labs: Yes
Price: Virtual ($200), In-person ($350)

In this course, learners will begin by comparing and contrasting machine learning model deployment strategies. Next, learners will learn how to deploy a machine learning model in a batch environment using MLflow and Spark UDFs. Third, students will deploy a machine learning model in an incrementally processed streaming environment using MLflow and Spark UDFs. Finally, learners will use MLflow Model Serving to simply deploy a machine learning pipeline for real-time scoring.

By the end of this course you will be able to:

Compare and contrast machine learning deployment strategies.
Deploy a machine learning model in a batch environment using MLflow and Spark UDFs.
Deploy a machine learning model in an incrementally processed streaming environment using MLflow and Spark UDFs.
Deploy a machine learning pipeline in a real-time environment using MLflow Model Serving.

NOTE: The course “Machine Learning with Databricks” covers these concepts with additional hands-on and broader introduction to Databricks and is more suitable for students preparing to complete the Databricks Certified Professional Data Scientist exam.

Prerequisites:

Intermediate-level experience with PySpark (familiarity with Python libraries and programming, Spark architecture and PySpark DataFrame API).
Beginning-level knowledge of and experience in machine learning operations (familiarity with MLflow Model Registry).

End-to-End ELT with Spark SQL

Role: SQL-based data engineers and analytics professionals
Format: Virtual and In-person, Half Day
Labs: Yes
Price: Virtual ($200), In-person ($350)

This course prepares SQL data professionals to leverage the Databricks Lakehouse Platform to productionalize ETL pipelines. Students will use Delta Live Tables and Spark SQL to define and schedule pipelines that incrementally process new data from a variety of data sources into the Lakehouse. Students will also orchestrate tasks with Databricks Jobs and promote code with Databricks Repos.

NOTE: The course “Data Engineering with Databricks” covers these concepts with additional hands-on and broader introduction to Databricks and is more suitable for students preparing to complete the Databricks Certified Associate Data Engineer exam.

By the end of the course you will be able to:

Ingest and enrich data for production applications
Use Python and Spark SQL to build and deploy production data engineering pipelines
Leverage the Databricks platform for code development and workload orchestration

Prerequisites:

Experience building and maintaining production ETL pipelines with SQL
Beginner familiarity with cloud computing concepts (virtual machines, object storage, etc.)
Production experience working with data warehouses and data lakes
Beginner knowledge of the Databricks workspace

End-to-End ELT with PySpark

Role: Data engineers
Format: Virtual and In-person, Half Day
Labs: Yes
Price: Virtual ($200), In-person ($350)

This course prepares Python data professionals to leverage the Databricks Lakehouse Platform to productionalize ETL pipelines. Students will use Delta Live Tables and PySpark to define and schedule pipelines that incrementally process new data from a variety of data sources into the Lakehouse. Students will also orchestrate tasks with Databricks Jobs and promote code with Databricks Repos.

NOTE: The course “Data Engineering with Databricks” covers these concepts with additional hands-on and broader introduction to Databricks and is more suitable for students preparing to complete the Databricks Certified Associate Data Engineer exam.

By the end of the course you will be able to:

Ingest and enrich data for production applications
Use PySpark to build and deploy production data engineering pipelines
Leverage the Databricks platform for code development and workload orchestration

Prerequisites:

Experience building and maintaining production ETL pipelines with PySpark
Beginner familiarity with cloud computing concepts (virtual machines, object storage, etc.)
Production experience working with data warehouses and data lakes
Beginner knowledge of the Databricks workspace

https://databricks.com/dataaisummit/north-america-2022/agenda

https://databricks.com/dataaisummit/north-america-2022

Registration –

https://register.dataaisummit.com/flow/db/nas2021/prodreg/login?_ga=2.60964524.1477609296.1642524468-965849216.1621658996