SlideShare a Scribd company logo
Data Teams Unite!
@ItaiYaffe, @RTeveth
Take a look at your data pipeline...
@ItaiYaffe, @RTeveth
… Now back to me
@ItaiYaffe, @RTeveth
… Now back at your pipeline
@ItaiYaffe, @RTeveth
… Now back to me
@ItaiYaffe, @RTeveth
Sadly, your pipeline isn’t me...
@ItaiYaffe, @RTeveth
But if you migrate:
Airflow
Spark jobs
to Kubernetes...
@ItaiYaffe, @RTeveth
You can boost your
data pipeline like me
SAVE
$10,000’s/month
GAIN visibility
MAKE your systems robust
Migrating Airflow-based Spark
jobs to K8s
the native way
Roi Teveth, Nielsen
Itai Yaffe, Imply
@ItaiYaffe, @RTeveth
Introduction
Roi TevethItai Yaffe
● Principal Solutions Architect @ Imply
Prev. Big Data Tech Lead @ Nielsen
● Itai Yaffe @ItaiYaffe
● Big Data developer @ Nielsen Identity
● Kubernetes evangelist
● Roi Teveth @RTeveth
@ItaiYaffe, @RTeveth
What will you learn?
How to easily migrate your Spark jobs to K8s
@ItaiYaffe, @RTeveth
What will you learn?
How to easily migrate your Spark jobs to K8s
to reduce costs, gain visibility and robustness
@ItaiYaffe, @RTeveth
What will you learn?
How to easily migrate your Spark jobs to K8s
to reduce costs, gain visibility and robustness
using Airflow as your workflow management platform
@ItaiYaffe, @RTeveth
Nielsen Identity
● Data and Measurement company
● Media consumption
● Single source of truth of individuals and households
○ Unifies many proprietary datasets
○ Generates holistic view of a consumer
@ItaiYaffe, @RTeveth
Nielsen Identity in numbers
>10B events/day 60TB/day
S3
6000’s nodes/day
10’s of TB
ingested/day
druid
@ItaiYaffe, @RTeveth
The challenges
Scalability
Cost Efficiency
Fault-tolerance
@ItaiYaffe, @RTeveth
Why do we need Airflow?
● Dozens of ETL workflows running around the clock
● Originally used AWS Data Pipeline for workflow management
● But we also wanted:
○ Better visibility of configuration and workflow
○ Better monitoring and statistics
○ Share common configuration/code between workflows
@ItaiYaffe, @RTeveth
Why do we Airflow?
~20 automatic DAG
deployments/day
~1000 DAG
Runs/day
~2 years in
production
Met all
requirements &
more
~40 users across 4
groups
6 contributions
to open-source
@ItaiYaffe, @RTeveth
Common data pipeline pattern - Airflow DAG
@ItaiYaffe, @RTeveth
Common data pipeline pattern - high-level architecture
1.
Read input
files
Data Lake
2.
Write output
files
3.
Ingest
to DB
Intermediate StorageData Processing OLAP
@ItaiYaffe, @RTeveth
Spark clusters
● Available cluster managers
○ Mesos, YARN, Standalone and K8s
● Managed Spark on public clouds
○ AWS EMR, Databricks, GCP Dataproc, etc.
@ItaiYaffe, @RTeveth
Common data pipeline pattern - high-level architecture
1.
Read input
files
Data Lake
2.
Write output
files
3.
Ingest
to DB
Intermediate StorageData Processing OLAP
@ItaiYaffe, @RTeveth
What is EMR?
EMR is an AWS
managed service to
run Hadoop & Spark
clusters
@ItaiYaffe, @RTeveth
What is EMR?
EMR is an AWS
managed service to
run Hadoop & Spark
clusters
Allows you to reduce
costs by using Spot
instances
@ItaiYaffe, @RTeveth
What is EMR?
EMR is an AWS
managed service to
run Hadoop & Spark
clusters
Allows you to reduce
costs by using Spot
instances
Charges
management cost
for each instance in a
cluster
@ItaiYaffe, @RTeveth
EMR pricing - example
Cluster Cost
$1000
@ItaiYaffe, @RTeveth
EMR pricing - example*
Cluster Cost
* Based on current i3.8xlarge Spot pricing. This may vary depending on the region, instance type, etc.
EC2 Cost
$1000 = $650
@ItaiYaffe, @RTeveth
EMR pricing - example*
Cluster Cost
* Based on current i3.8xlarge Spot pricing. This may vary depending on the region, instance type, etc.
EC2 Cost EMR Cost
$1000 = $650 + $350
@ItaiYaffe, @RTeveth
Running Airflow-based Spark jobs on EMR
● EMR has official Airflow support
● Open-source, remember?
○ Allows us to fix existing components
■ EmrStepSensor fixes (AIRFLOW-3297)
○ … As well as add new components
■ AWS Athena Sensor (AIRFLOW-3403)
■ OpenFaaS hook (AIRFLOW-3411)
emr_create_job_flow_operator emr_add_steps_operator emr_step_sensor
Creates new emr cluster Adds Spark step to the cluster Checks if the step succeeded
@ItaiYaffe, @RTeveth
Running Airflow-based Spark jobs on EMR
● EMR has official Airflow support
● Open-source, remember?
○ Allows us to fix existing components
■ EmrStepSensor fixes (AIRFLOW-3297)
○ … As well as add new components
■ AWS Athena Sensor (AIRFLOW-3403)
■ OpenFaaS hook (AIRFLOW-3411)
emr_create_job_flow_operator emr_add_steps_operator emr_step_sensor
Creates new emr cluster Adds Spark step to the cluster Checks if the step succeeded
This was great...
@ItaiYaffe, @RTeveth
But we wanted MORE!
$$$ Visibility Robustness
@ItaiYaffe, @RTeveth
Introducing - Spark-on-Kubernetes
+
@ItaiYaffe, @RTeveth
Let’s explain what is Kubernetes (a.k.a K8s)
● Open source platform for running and managing containerized workloads
● Includes
○ Built-in controllers to support various workloads (e.g micro-services)
○ Additional extensions (called “operators”) to support custom workloads
● Highly scalable
@ItaiYaffe, @RTeveth
Basic Kubernetes terminology
ClusterControl plane Worker nodes
(EC2 in our case)
Pods
group of one or more
containers (such as
Docker containers)
@ItaiYaffe, @RTeveth
Basic Kubernetes terminology
● kubectl - K8s CLI
@ItaiYaffe, @RTeveth
Basic Kubernetes terminology
● The term “operator” exists both in Airflow and in Kubernetes
@ItaiYaffe, @RTeveth
Basic Kubernetes terminology
● The term “operator” exists both in Airflow and in Kubernetes
● operator
○ Represents a single task
○ Operators determine what is actually executed when your DAG runs
○ Example:
■ bash-operator - executes a bash command
@ItaiYaffe, @RTeveth
Basic Kubernetes terminology
● The term “operator” exists both in Airflow and in Kubernetes
● operator
○ Additional extensions to Kubernetes
○ Holds the knowledge of how to manage a specific application
○ Example:
■ postgres-operator - defines and manages a PostgreSQL cluster
@ItaiYaffe, @RTeveth
Basic Kubernetes terminology
● The term “operator” exists both in Airflow and in Kubernetes
● operator
○ A non-core Kubernetes controller
○ Holds the knowledge of how to manage a specific application
○ Example:
■ postgres-operator - defines and manages a PostgreSQL cluster
Operator != Operator
@ItaiYaffe, @RTeveth
Kubernetes auto-scale
ClusterControl plane
Phase 1: no applications are running on the cluster
@ItaiYaffe, @RTeveth
Kubernetes auto-scale
ClusterControl plane
Phase 2: application #1 starts running
@ItaiYaffe, @RTeveth
Kubernetes auto-scale
ClusterControl plane
Phase 3: the cluster scales-up as needed
@ItaiYaffe, @RTeveth
Kubernetes auto-scale
ClusterControl plane
Phase 4: application #2 starts running
@ItaiYaffe, @RTeveth
Kubernetes auto-scale
ClusterControl plane
Phase 5: the cluster scales-up as needed
@ItaiYaffe, @RTeveth
Kubernetes auto-scale
ClusterControl plane
Phase 6: applications finished running, cluster scales down
@ItaiYaffe, @RTeveth
Kubernetes in a nutshell
● A platform for running and managing containerized workloads
● Each cluster has
○ 1 control plane
○ 0..X worker nodes
○ 0..Y pods
○ 0..Z applications running concurrently
● Kubernetes operator != Airflow operator
● Automatically scales out and in
@ItaiYaffe, @RTeveth
Cool, so… Back to Spark-on-Kubernetes?
@ItaiYaffe, @RTeveth
Spark-On-Kubernetes overview
● From Spark 2.3.0, K8s is supported as a cluster manager
● No additional management cost per instance
○ You only pay a small fee for the K8s cluster itself (e.g $60/month on AWS)
● This is still experimental, and some features are missing
○ E.g Dynamic Resource Allocation and External Shuffle Service
@ItaiYaffe, @RTeveth
Submitting a Spark application to Kubernetes - alternatives
1. Using spark-submit script
2. Using Spark-On-Kubernetes Operator
@ItaiYaffe, @RTeveth
Spark-submit example - SparkPi
./bin/spark-submit 
--master
k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> 
--deploy-mode cluster 
--name spark-pi 
--class org.apache.spark.examples.SparkPi 
--conf spark.executor.instances =3 
--conf spark.kubernetes.container.image =<spark-image> 
local:///path/to/examples.jar
@ItaiYaffe, @RTeveth
Spark-submit example - SparkPi
./bin/spark-submit 
--master
k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> 
--deploy-mode cluster 
--name spark-pi 
--class org.apache.spark.examples.SparkPi 
--conf spark.executor.instances =3 
--conf spark.kubernetes.container.image =<spark-image> 
local:///path/to/examples.jar
@ItaiYaffe, @RTeveth
Spark-submit example - SparkPi
./bin/spark-submit 
--master
k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> 
--deploy-mode cluster 
--name spark-pi 
--class org.apache.spark.examples.SparkPi 
--conf spark.executor.instances =3 
--conf spark.kubernetes.container.image =<spark-image> 
local:///path/to/examples.jar
Kubernetes control plane
Kubernetes cluster
SparkPi
driver
Executor
1
Executor
2
Executor
3
@ItaiYaffe, @RTeveth
Spark-On-Kubernetes operator
● A Kubernetes operator
● Extends Kubernetes API to support Spark applications natively
● Built by GCP as an open-source project
github.com/GoogleCloudPlatform/spark-on-k8s-operator
@ItaiYaffe, @RTeveth
Spark-On-Kubernetes operator example - SparkPi
apiVersion:
"sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: "spark-pi”
namespace: default
spec:
...
driver:
...
executor:
...
Spark-pi.yaml
@ItaiYaffe, @RTeveth
Spark-On-Kubernetes operator example - SparkPi
apiVersion:
"sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: "spark-pi”
namespace: default
spec:
...
driver:
...
executor:
...
Spark-pi.yaml Kubernetes control plane
Kubernetes cluster
kubectl
Spark-
on-K8s
operator
@ItaiYaffe, @RTeveth
Spark-On-Kubernetes operator example - SparkPi
apiVersion:
"sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: "spark-pi”
namespace: default
spec:
...
driver:
...
executor:
...
Spark-pi.yaml Kubernetes control plane
Kubernetes cluster
SparkPi
driver
Executor
1
Executor
2
Executor
3
kubectl
Spark-
on-K8s
operator
@ItaiYaffe, @RTeveth
Submitting a Spark application to K8s
Topic Spark-submit
Airflow built-in integration
V
Customize Spark-pods
X*
Easy access to Spark UI
X
Submit and view application
from kubectl
X
@ItaiYaffe, @RTeveth
Submitting a Spark application to K8s
Topic Spark-submit Spark-On-K8s operator
Airflow built-in integration
V X
Customize Spark-pods
X* V
Easy access to Spark UI
X V
Submit and view application
from kubectl
X V
@ItaiYaffe, @RTeveth
Integrate it with Airflow
@ItaiYaffe, @RTeveth
So… we decided to take the road less traveled
@ItaiYaffe, @RTeveth
So… we decided to take the road less traveled
github.com/apache/airflow/pull/7163
@ItaiYaffe, @RTeveth
A special thanks to Airflow committers
@CzerwonyElmo (Kamil Breguła)
@kaxil (Kaxil Naik)
@AshBerlin (Ash Berlin-Taylor)
@higrys (Jarek Potiuk)
@ItaiYaffe, @RTeveth
Airflow SparkKubernetes integration
KubernetesHook
SparkKubernetes
operator
SparkKubernetes
sensor
@ItaiYaffe, @RTeveth
What have we gained by building this integration?
● Official built-in Airflow support
● Security
○ Save Kubernetes credentials inside Airflow connection mechanism
● Portability
○ Use templated Kubernetes object so the same app can be migrated
easily to Airflow and also be run manually
● Kubernetes native
○ Communicate directly with the Kubernetes API
@ItaiYaffe, @RTeveth
Common data pipeline pattern - revised
@ItaiYaffe, @RTeveth
Common data pipeline pattern - revised
1.
Read input
files
Data Lake
2.
Write output
files
3.
Ingest
to DB
OLAPData Processing Intermediate Storage
@ItaiYaffe, @RTeveth
Common data pipeline pattern - revised
1.
Read input
files
Data Lake
2.
Write output
files
3.
Ingest
to DB
OLAPData Processing Intermediate Storage
@ItaiYaffe, @RTeveth
Common data pipeline pattern - revised
1.
Read input
files
Data Lake
2.
Write output
files
3.
Ingest
to DB
OLAPData Processing Intermediate Storage
What’s missing?
@ItaiYaffe, @RTeveth
Connecting the dots… making it production-ready
@ItaiYaffe, @RTeveth
Visibility
● Spark History Server
○ Each K8s namespace has a dedicated Spark History Server
● Metrics
○ Spark metrics are exposed via JmxSink (github.com/prometheus/jmx_exporter)
○ System metrics are collected using github.com/kubernetes/kube-state-metrics
● Dashboards
○ Aggregating both Spark and system metrics
@ItaiYaffe, @RTeveth
Visibility
● Logging
○ All logs are collected with Filebeat to Elasticsearch
● Alerting
○ Airflow callbacks emit metrics which trigger alerts when needed
@ItaiYaffe, @RTeveth
Robustness
● Running a Spark job on multiple AZs
○ Can be beneficial when using Spot instances (depending on the amount of
shuffling)
● AWS Node Termination Handler
○ Allows K8s to gracefully handle events such as EC2 Spot interruptions
○ Open source (github.com/aws/aws-node-termination-handler)
@ItaiYaffe, @RTeveth
Benefits from migrating to Kubernetes
● ~30% cost reduction
○ No additional cost per instance
● Better visibility
● Robustness
@ItaiYaffe, @RTeveth
Airflow integration current status
● Will be available in Airflow 2.0
● Can’t wait? Check out the backport package for Airflow 1.10.12
tinyurl.com/y6xb7s3h
@ItaiYaffe, @RTeveth
So with minimal changes...
@ItaiYaffe, @RTeveth
You can boost your
data pipeline like me
SAVE
$10,000’s/month
GAIN visibility
MAKE your systems robust
@ItaiYaffe, @RTeveth
DRUID
ES
Want to know more?
● Women in Big Data
○ A world-wide program that aims :
■ To inspire, connect, grow, and champion success of women in the Big Data & analytics field
○ 30+ chapters and 17,000+ members world-wide
○ Everyone can join (regardless of gender), so find a chapter near you -
https://www.womeninbigdata.org/wibd-structure/
● Our Tech Blog - medium.com/nmc-techblog
○ Spark Dynamic Partition Inserts part 1 - https://tinyurl.com/yd94ztz5
○ Spark Dynamic Partition Inserts Part 2 - https://tinyurl.com/y8uembml
QUESTIONS
THANK YOU
Roi Teveth Roi Teveth
Itai Yaffe Itai Yaffe
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot (20)

PPTX
Terraform
Pathum Fernando ☁
 
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
PDF
Dive into PySpark
Mateusz Buśkiewicz
 
PDF
Introduction to Docker Compose
Ajeet Singh Raina
 
PDF
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Databricks
 
PDF
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
PDF
Building a Scalable Record Linkage System with Apache Spark, Python 3, and Ma...
Databricks
 
PDF
Introduction to docker
Instruqt
 
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
PDF
Apache flink
pranay kumar
 
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
PDF
Introduction to kubernetes
Raffaele Di Fazio
 
PDF
Introduction to Spark with Python
Gokhan Atil
 
PDF
Building an open data platform with apache iceberg
Alluxio, Inc.
 
ODP
An introduction to Apache Thrift
Mike Frampton
 
PDF
Introduction to Apache Airflow
mutt_data
 
PDF
Docker 101: Introduction to Docker
Docker, Inc.
 
PDF
Introduction to PySpark
Russell Jurney
 
PDF
Apache Nifi Crash Course
DataWorks Summit
 
PPTX
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Dive into PySpark
Mateusz Buśkiewicz
 
Introduction to Docker Compose
Ajeet Singh Raina
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Databricks
 
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Building a Scalable Record Linkage System with Apache Spark, Python 3, and Ma...
Databricks
 
Introduction to docker
Instruqt
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Apache flink
pranay kumar
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Introduction to kubernetes
Raffaele Di Fazio
 
Introduction to Spark with Python
Gokhan Atil
 
Building an open data platform with apache iceberg
Alluxio, Inc.
 
An introduction to Apache Thrift
Mike Frampton
 
Introduction to Apache Airflow
mutt_data
 
Docker 101: Introduction to Docker
Docker, Inc.
 
Introduction to PySpark
Russell Jurney
 
Apache Nifi Crash Course
DataWorks Summit
 
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 

Similar to Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way (20)

PDF
Webinar kubernetes and-spark
cnvrg.io AI OS - Hands-on ML Workshops
 
PPTX
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
PDF
Scaling Apache Spark on Kubernetes at Lyft
Databricks
 
PDF
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
PPTX
Transfer to kubernetes data platform from EMR
창언 정
 
PDF
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Timothy Chen
 
PDF
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
Athens Big Data
 
PDF
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
 
PDF
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
Databricks
 
PDF
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
DoKC
 
PPTX
Why Kubernetes as a container orchestrator is a right choice for running spar...
DataWorks Summit
 
PDF
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks
 
PDF
Big data and Kubernetes
Anirudh Ramanathan
 
PDF
Hybrid Apache Spark Architecture with YARN and Kubernetes
Databricks
 
PDF
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Holden Karau
 
PDF
Scaling spark on kubernetes at Lyft
Li Gao
 
PDF
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Anya Bida
 
PDF
Spark on Kubernetes
datamantra
 
PDF
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Itai Yaffe
 
PDF
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
Webinar kubernetes and-spark
cnvrg.io AI OS - Hands-on ML Workshops
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
Scaling Apache Spark on Kubernetes at Lyft
Databricks
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
Transfer to kubernetes data platform from EMR
창언 정
 
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Timothy Chen
 
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
Athens Big Data
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
Databricks
 
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
DoKC
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
DataWorks Summit
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks
 
Big data and Kubernetes
Anirudh Ramanathan
 
Hybrid Apache Spark Architecture with YARN and Kubernetes
Databricks
 
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Holden Karau
 
Scaling spark on kubernetes at Lyft
Li Gao
 
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Anya Bida
 
Spark on Kubernetes
datamantra
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Itai Yaffe
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Ad

Recently uploaded (20)

PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 

Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way