SlideShare a Scribd company logo
Spark and Cassandra
Anti-Patterns
© DataStax, All Rights Reserved.
Russell Spitzer

Russell (left) and Cara (right)
• Software Engineer



• Spark-Cassandra
Integration since 

Spark 0.9
• Cassandra since
Cassandra 1.2
• 3 Year Scala Convert
• Still not comfortable
talking about Monads
in public
Avoiding the Sharp Edges
•Out of Memory Errors
•RPC Failures
•"It is Slow"
•Serialization
•Understanding what Catalyst does
After working with customers for several years,
most problems boil down to a few common scenarios.
Most Common Performance Pitfall
val	there	=	rdd.map(doStuff).collect()

val	backAgain	=	someWork.map(otherStuff)

val	thereAgain	=	sc.parallelize(backAgain)
The Hobbit (1977)
OOM Slow RPC Failures
There and Back Again:

Don't Collect and Parallelize
Don't Do
val	there	=	rdd.map(doStuff).collect()

val	backAgain	=	someWork.map(otherStuff)

val	thereAgain	=	sc.parallelize(backAgain)	
Instead
val	there	=	rdd

		.map(doStuff)

		.map(otherStuff)
The Hobbit (1977)
There and Back Again:

Don't Collect and Parallelize
Don't Do
val	there	=	rdd.map(doStuff).collect()

val	backAgain	=	someWork.map(otherStuff)

val	thereAgain	=	sc.parallelize(backAgain)	
Instead
val	there	=	rdd

		.map(doStuff)

		.map(otherStuff)
The Hobbit (1977)
Lord of the Rings, 2001-2003
Driver JVM
Your Cluster
Parallelize
Collect
Why Not?
1. You are using Spark for a Reason
Lord of the Rings, 2001-2003
Driver JVM
Your Cluster
Parallelize
Collect
Dependable

Easy to work with

Easy to understand
Not very big
Only 1
Why Not?
1. You are using Spark for a Reason
Lord of the Rings, 2001-2003
Driver JVM
Your Cluster
Parallelize
Collect
Dependable

Easy to work with

Easy to understand
Not very big
Only 1
The Entire Reason
Behind Using Spark
Why Not?
1. You are using Spark for a Reason
Lord of the Rings, 2001-2003
Driver JVM
Your Cluster
Parallelize
Collect
Dependable

Easy to work with

Easy to understand
Not very big
Only 1
The Entire Reason
Behind Using Spark
Why Not?
1. You are using Spark for a Reason
OOM
Why Not?
2. Moving data between machines is slow
Jim Gray, 

http://loci.cs.utk.edu/dsi/netstore99/docs/presentations/keynote/sld023.htm
Why Not?
2. Moving data between machines is slow
Jim Gray, 

http://loci.cs.utk.edu/dsi/netstore99/docs/presentations/keynote/sld023.htm
Why Not?
2. Moving data between machines is slow
Jim Gray, 

http://loci.cs.utk.edu/dsi/netstore99/docs/presentations/keynote/sld023.htm
The Lord of the Rings, 1978
Why Not?
3. Parallelize sends data in task metadata
parallelize()
Why Not?
3. Parallelize sends data in task metadata
List[Dwarves] -> RDD[Dwarves]
ENIAC Programmers, 1946, University of Pennsylvania
?
Why Not?
3. Parallelize sends data in task metadata
List[Dwarves] -> RDD[Dwarves]
Minimum of one
Dwarf per
Partition
RPC Warns on Task
Metadata over
100kb
Why Not?
3. Parallelize sends data in task metadata
scala>	val	treasure	=	1	to	100	map	(_	=>	"x"	*	1024)	
scala>	sc.parallelize(Seq(treasure)).count	
WARN		2018-05-21	14:13:08,035	org.apache.spark.scheduler.TaskSetManager:		
Stage	0	contains	a	task	of	very	large	size	(105	KB).	The	maximum	recommended	task	size	is	100	KB.	
res0:	Long	=	1
Storing indefinitely growing state in a single object will
continue growing in size until you run into heap problems.
J.R.R. Tolkien, “Conversation with Smaug” (The Hobbit, 1937)
Keep the work Distributed
Don't Do
val	there	=	rdd.map(doStuff).collect()

val	backAgain	=	someWork.map(otherStuff)

val	thereAgain	=	sc.parallelize(backAgain)



1.We	won't	be	doing	distributed	work	
2.We	end	up	sending	things	over	the	wire	
3.Parallelize	doesn't	handle	large	objects	well	
4.We	don't	need	to!	
Everyday
val	there	=	rdd

		.map(doStuff)

		.map(otherStuff)
The Hobbit (1977)
Start Distributed if Possible
Other alternatives to Parallelize
Start	Data	out	Distributed		
(Cassandra,	HDFS,	S3,	…)
The Hobbit (1977)
Predicate Pushdowns Failing!
SELECT * FROM escapes WHERE time = 11-27-1977
No
Pushdown
Slow
What have I got in my pocket?

Make your literals' types explicit!
SELECT * FROM escapes 

WHERE time = 11-27-1977
Catalyst
No precious
predicate pushdowns
Catalyst Transforms SQL
into Distributed Work
Distributed Work
?
?
?
SO

MYSTERY
MUCH MAGICCatalyst
SELECT * FROM escapes WHERE time = 11-27-1977
SELECT * FROM escapes WHERE time = 11-27-1977
'Project [*]
'Filter ('time = 1977-11-27)
'UnresolvedRelation `test`.`escapes`
Logical Plan Describes
What Needs to Happen
'UnresolvedRelation `test`.`escapes`
SELECT * FROM escapes WHERE time = 11-27-1977
'Project [*]
'Filter ('time = 1977-11-27)
It is transformed
SELECT * FROM escapes WHERE time = 11-27-1977
*Filter (cast(time#5 as string) = 1977-11-27)
*Scan
CassandraSourceRelation
test.escapes[time#5,method#6]
ReadSchema:
struct<time:date,method:string>
Into a Physical Plan which defines
How it will be accomplished
SELECT * FROM escapes WHERE time = 11-27-1977
*Filter (cast(time#5 as string) = 1977-11-27)
*Scan
CassandraSourceRelation
test.escapes[time#5,method#6]
ReadSchema:
struct<time:date,method:string>
What happened to 

predicate pushdown?
Into a Physical Plan which defines
How it will be accomplished
Catalyst Needs to make
Types Match
'1977-11-27'
Compare this to time#5
'1977-11-27'
This is a string?
time#5
This is a date?
Catalyst Needs to make
Types Match
'1977-11-27'
This is a string?
time#5
This is a date?
Cast(time#5 as String)
MAKE THEM BOTH
STRINGS
Catalyst Needs to make
Types Match
'1977-11-27'
This is a string?
time#5
This is a date?
Cast(time#5 as String)
Functions Cannot be
Pushed to 

Datasources
MAKE THEM BOTH
STRINGS
Catalyst Needs to make
Types Match
Let's try again with
explicitly typed literals
SELECT * FROM test.escapes
WHERE time = cast('1977-11-27' as date)
Catalyst
Hmmmm….
Distributed Work
?
?
?
SO

MYSTERY
MUCH MAGICCatalyst
SELECT * FROM test.escapes WHERE time = cast('1977-11-27' as date)
Catalyst Transforms SQL
into Distributed Work
Same Logical Plan
SELECT * FROM escapes WHERE time = 11-27-1977
'Project [*]
'Filter ('time = 1977-11-27)
'UnresolvedRelation `test`.`escapes`
Transform
SELECT * FROM escapes WHERE time = 11-27-1977
'Project [*]
'Filter ('time = 1977-11-27)
'UnresolvedRelation `test`.`escapes`
Different Physical Plan
SELECT * FROM escapes WHERE time = 11-27-1977
*Scan 

CassandraSourceRelation 

test.escapes[time#5,method#6] 

PushedFilters: [*EqualTo(time,1977-11-27)], 

ReadSchema: struct<time:date,method:string>
SELECT * FROM escapes WHERE time = 11-27-1977
*Scan 

CassandraSourceRelation 

test.escapes[time#5,method#6] 

PushedFilters: [*EqualTo(time,1977-11-27)], 

ReadSchema: struct<time:date,method:string>
1. PushedFilters is populated
2. There is no Spark Side Filter at all
*Means that the Filter is Handled By the Datasource and not Catalyst
Different Physical Plan
Succesful Pushdown
SELECT * FROM test.escapes
WHERE time = cast('1977-11-27' as date)
Catalyst
+----------+-----------------------------+	
|time						|method																							|	
+----------+-----------------------------+	
|1977-11-27|Ask	a	totally	not	fair	riddle|	
+----------+-----------------------------+
Writing to X is Slow
Slow
Bad Resource
Utilization
RDD.foreach(x => SlowIO(x))
You shall not pass!

Concurrency in Spark
Functions are applied to iterators
Iterator[Balrog]
.map( balrog => moveAcrossBridge(balrog))
Iterator[Balrog]
.map( balrog => moveAcrossBridge(balrog))
No other elements will have a
function applied to them until
the current element is done
One Item is Processed at a Time
Native Spark Parallelism is
Based on Cores
Itera
.map( balrog
Itera
.map( balrog
Itera
.map( balrog
Core 1
Core 2
Core 3
Max Number of Balrogs
crossing in parallel
is limited by the number
of cores
Increase Parallelism
without Increasing Cores
Iterator[Balrog]
.map( balrog => moveAcrossBridge(balrog))

.foreach( balrog => balrog.eat(nearestHobbit))
Slow, Bottleneck
Fast
Grouping or Futures
Iterator[Balrog]Process 

in groups
grouped.map(balrogGroup => moveGroup(balrogGroup))
Slow elements
Will slow down the group
Grouping or Futures
Iterator[Balrog]
Iterator[Future[MovedBalrog]]
Process 

in groups
Return Futures
map(balrog => asyncMove(balrog))
grouped.map(balrogGroup => moveGroup(balrogGroup))
Still need to draw
elements multiple at a
time (if not forEach)
Slow elements
Will slow down the group
DSE Spark Connector's Sliding Iterator
Buffer Futures
		/**	Prefetches	a	batchSize	of	elements	at	a	time	**/	
		protected	def	slidingPrefetchIterator[T](it:	Iterator[Future[T]],	batchSize:	Int):	Iterator[T]	=	{	
				val	(firstElements,	lastElement)	=		it	
						.grouped(batchSize)	
						.sliding(2)	
						.span(_	=>	it.hasNext)	
				(firstElements.map(_.head)	++	lastElement.flatten).flatten.map(_.get)	
		}	
Group
Buffer Futures
		/**	Prefetches	a	batchSize	of	elements	at	a	time	**/	
		protected	def	slidingPrefetchIterator[T](it:	Iterator[Future[T]],	batchSize:	Int):	Iterator[T]	=	{	
				val	(firstElements,	lastElement)	=		it	
						.grouped(batchSize)	
						.sliding(2)	
						.span(_	=>	it.hasNext)	
				(firstElements.map(_.head)	++	lastElement.flatten).flatten.map(_.get)	
		}	
Group
DSE Spark Connector's Sliding Iterator
Buffer Futures
		/**	Prefetches	a	batchSize	of	elements	at	a	time	**/	
		protected	def	slidingPrefetchIterator[T](it:	Iterator[Future[T]],	batchSize:	Int):	Iterator[T]	=	{	
				val	(firstElements,	lastElement)	=		it	
						.grouped(batchSize)	
						.sliding(2)	
						.span(_	=>	it.hasNext)	
				(firstElements.map(_.head)	++	lastElement.flatten).flatten.map(_.get)	
		}	
Sliding(2)
Group
Buffer
DSE Spark Connector's Sliding Iterator
Buffer Futures
		/**	Prefetches	a	batchSize	of	elements	at	a	time	**/	
		protected	def	slidingPrefetchIterator[T](it:	Iterator[Future[T]],	batchSize:	Int):	Iterator[T]	=	{	
				val	(firstElements,	lastElement)	=		it	
						.grouped(batchSize)	
						.sliding(2)	
						.span(_	=>	it.hasNext)	
				(firstElements.map(_.head)	++	lastElement.flatten).flatten.map(_.get)	
		}	
Sliding(2)
Group
Flatten
get
Buffer
DSE Spark Connector's Sliding Iterator
Slow Transformations
Slow
Bad Resource
Utilization
rdd.cache.map.cache.map.cache.map
My Precious!

Don't Cache without Reuse
The Hobbit, 1966
Cache is not Free
scala>	time(sc.cassandraTable("ks",	"test").map(	r	=>	r	).count)	
Elapsed	time:	35.773478836	
res54:	Long	=	15436998	
scala>	time(sc.cassandraTable("ks",	"test").map(	r	=>	r	).cache.count)	
Elapsed	time:	58.657585144	
res55:	Long	=	15436998
scala>	time(sc.cassandraTable("ks",	"test").map(	r	=>	r	).count)	
Elapsed	time:	35.773478836	
res54:	Long	=	15436998	
scala>	time(sc.cassandraTable("ks",	"test").map(	r	=>	r	).cache.count)	
Elapsed	time:	58.657585144	
res55:	Long	=	15436998
Cache is not Free
When does Caching for Resilience Make Sense?
Let's MATH
When does Caching for Resilience Make Sense?
Let's MATH
Lets Assume our Shuffle/Read partially fails 1/10 times


Cache costs c seconds
Normal run costs r seconds
Failures happen at a rate of f

If (c + r < r + r * f)

Caching helps us out

When does Caching for Resilience Make Sense?
Let's MATH
Lets Assume our Shuffle/Read partially fails 1/10 times


Cache costs c seconds
Normal run costs r seconds
Failures happen at a rate of f

If (c + r < r + r * f)

Caching helps us out

When does Caching for Resilience Make Sense?
Let's MATH
Lets Assume our Shuffle/Read partially fails 1/10 times


Cache costs c seconds
Normal run costs r seconds
Failures happen at a rate of f

If (c + r < (r +1) * f)

Caching helps us out

When does Caching for Resilience Make Sense?
Let's MATH
Lets Assume our Shuffle/Read partially fails 1/10 times


Cache costs c seconds
Normal run costs r seconds
Failures happen at a rate of f

If ( (c + r) / (r + 1) < f )

Caching helps us out

When does Caching for Resilience Make Sense?
Let's MATH
For Our Example Caching is worth it.
if (.6 > failures)
If ( (c + r) / (r + 1) < f )

Caching helps us out
My Precious!

Why is Caching so Expensive?
1. Serialize Everything
2. Hold all the data at once
3. Expensive disk access
But it's so pretty
My Precious!

Why is Caching so Expensive?
But it's so pretty
1. Your pre-cache computation is

very 

very

very 

expensive
2. You are re-using the data
What have we learned?
Don't Do
Parallelize	and	Collect

Rely	on	Spark	to	do	our	Types	in	Literals

Do	slow	blocking	actions	in	map/foreach

Cache	all	the	time	
Instead
Keep	in	Distributed	Actions

Specify	our	Types

Do	concurrent	actions	when	it	makes	sense

Cache	only	when	we	re-use	data
The Hobbit (1977)
DSE 6
!63
Thank you
!64
© DataStax, All Rights Reserved.
http://www.russellspitzer.com/

@RussSpitzer
Come chat with us at DataStax Academy: 

https://academy.datastax.com/slack

More Related Content

What's hot (20)

PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Delta Lake Streaming: Under the Hood
Databricks
 
PDF
Beautiful Monitoring With Grafana and InfluxDB
leesjensen
 
PDF
Data Source API in Spark
Databricks
 
PDF
PySpark in practice slides
Dat Tran
 
PDF
Top 5 mistakes when writing Spark applications
hadooparchbook
 
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
PPTX
Exactly-once Stream Processing with Kafka Streams
Guozhang Wang
 
PDF
Oracle Database 12c Multitenant for Consolidation
Yudi Herdiana
 
PPT
Spring ppt
Mumbai Academisc
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PDF
Introduction to Spark Streaming
datamantra
 
PPTX
Tuning and Debugging in Apache Spark
Patrick Wendell
 
PDF
Oracle Client Failover - Under The Hood
Ludovico Caldara
 
PDF
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
PDF
Spark access control on Amazon EMR with AWS Lake Formation
Anoop Johnson
 
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
PDF
Introduction to PySpark
Russell Jurney
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Delta Lake Streaming: Under the Hood
Databricks
 
Beautiful Monitoring With Grafana and InfluxDB
leesjensen
 
Data Source API in Spark
Databricks
 
PySpark in practice slides
Dat Tran
 
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
Exactly-once Stream Processing with Kafka Streams
Guozhang Wang
 
Oracle Database 12c Multitenant for Consolidation
Yudi Herdiana
 
Spring ppt
Mumbai Academisc
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Introduction to Spark Streaming
datamantra
 
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Oracle Client Failover - Under The Hood
Ludovico Caldara
 
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Spark access control on Amazon EMR with AWS Lake Formation
Anoop Johnson
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Introduction to PySpark
Russell Jurney
 

Similar to Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatterns with Russell Spitzer (20)

PDF
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 
PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
Codemotion Tel Aviv
 
PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
Codemotion
 
PPTX
Spark Overview and Performance Issues
Antonios Katsarakis
 
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
PPTX
Spark Introduction
DataStax Academy
 
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
PDF
Kindling: Getting Started with Spark and Cassandra
DataStax Academy
 
PDF
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
PPTX
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
PDF
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
PDF
TriHUG talk on Spark and Shark
trihug
 
ODP
Spark Deep Dive
Corey Nolet
 
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
PPT
Spark training-in-bangalore
Kelly Technologies
 
PPTX
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
PDF
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
Codemotion Tel Aviv
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
Codemotion
 
Spark Overview and Performance Issues
Antonios Katsarakis
 
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Spark Introduction
DataStax Academy
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Kindling: Getting Started with Spark and Cassandra
DataStax Academy
 
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Apache Spark: What? Why? When?
Massimo Schenone
 
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
TriHUG talk on Spark and Shark
trihug
 
Spark Deep Dive
Corey Nolet
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
Spark training-in-bangalore
Kelly Technologies
 
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 

Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatterns with Russell Spitzer