SlideShare a Scribd company logo
1 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Spark	and	Object	Stores	
—What	you	need	to	know
Steve	Loughran
stevel@hortonworks.com	
@steveloughran
October	2016
Steve Loughran,
Hadoop committer, PMC member, …
Chris Nauroth,
Apache Hadoop committer & PMC
ASF member
Rajesh Balamohan
Tez Committer, PMC Member
3 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
ORC, Parquet
datasets
inbound
Elastic	ETL
HDFS
external
4 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
datasets
external
Notebooks
library
5 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Streaming
6 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
A	Filesystem:	Directories,	Files	à Data
/
work
pending
part-00
part-01
00
00
00
01
01
01
complete
part-01
rename("/work/pending/part-01", "/work/complete")
7 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Object	Store:	hash(name)->blob
00
00
00
01
01
s01 s02
s03 s04
hash("/work/pending/part-01")
["s02", "s03", "s04"]
copy("/work/pending/part-01",
"/work/complete/part01")
01
01
01
01
delete("/work/pending/part-01")
hash("/work/pending/part-00")
["s01", "s02", "s04"]
8 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
REST	APIs
00
00
00
01
01
s01 s02
s03 s04
HEAD /work/complete/part-01
PUT /work/complete/part01
x-amz-copy-source: /work/pending/part-01
01
DELETE /work/pending/part-01
PUT /work/pending/part-01
... DATA ...
GET /work/pending/part-01
Content-Length: 1-8192
GET /?prefix=/work&delimiter=/
9 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Often:	Eventually	Consistent
00
00
00
01
01
s01 s02
s03 s04
01
DELETE /work/pending/part-00
GET /work/pending/part-00
GET /work/pending/part-00
200
200
200
10 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
org.apache.hadoop.fs.FileSystem
hdfs s3awasb adlswift gs
11 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
s3:// —“inode on S3”
s3n://
“Native” S3
s3a://
Replaces s3n
swift://
OpenStack
wasb://
Azure WASB
s3a:// Stabilize
oss://
Aliyun
gs://
Google Cloud
s3a://
Speed and consistency adl://
Azure Data Lake
2006
2008
2013
2014
2015
2016
s3://
Amazon EMR S3
History of Object Storage Support
12 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Cloud	Storage	Connectors
Azure WASB ● Strongly consistent
● Good performance
● Well-tested on applications (incl. HBase)
ADL ● Strongly consistent
● Tuned for big data analytics workloads
Amazon Web Services S3A ● Eventually consistent - consistency work in
progress by Hortonworks
● Performance improvements in progress
● Active development in Apache
EMRFS ● Proprietary connector used in EMR
● Optional strong consistency for a cost
Google Cloud Platform GCS ● Multiple configurable consistency policies
● Currently Google open source
● Good performance
● Could improve test coverage
13 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Four	Challenges
1. Classpath
2. Credentials
3. Code
4. Commitment
Let's look At S3 and Azure
14 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Use	S3A	to	work	with	S3	
(EMR: use	Amazon's	s3://	)
15 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Classpath:	fix	“No	FileSystem for	scheme:	s3a”
hadoop-aws-2.7.x.jar
aws-java-sdk-1.7.4.jar
joda-time-2.9.3.jar
(jackson-*-2.6.5.jar)
See SPARK-7481
Get Spark with
Hadoop 2.7+ JARs
16 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Credentials
core-site.xml or	spark-default.conf
spark.hadoop.fs.s3a.access.key MY_ACCESS_KEY
spark.hadoop.fs.s3a.secret.key MY_SECRET_KEY
spark-submit automatically	propagates	Environment	Variables
export AWS_ACCESS_KEY=MY_ACCESS_KEY
export AWS_SECRET_KEY=MY_SECRET_KEY
NEVER: share, check in to SCM, paste in bug reports…
17 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Authentication	Failure:	403
com.amazonaws.services.s3.model.AmazonS3Exception:
The request signature we calculated does not match
the signature you provided.
Check your key and signing method.
1. Check joda-time.jar & JVM version
2. Credentials wrong
3. Credentials not propagating
4. Local system clock (more likely on VMs)
18 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Code:	Basic	IO
// Read in public dataset
val lines = sc.textFile("s3a://landsat-pds/scene_list.gz")
val lineCount = lines.count()
// generate and write data
val numbers = sc.parallelize(1 to 10000)
numbers.saveAsTextFile("s3a://hwdev-stevel-demo/counts")
All you need is the URL
19 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Code:	just	use	the	URL	of	the	object	store
val csvdata = spark.read.options(Map(
"header" -> "true",
"inferSchema" -> "true",
"mode" -> "FAILFAST"))
.csv("s3a://landsat-pds/scene_list.gz")
...read time O(distance)
20 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
DataFrames
val landsat = "s3a://stevel-demo/landsat"
csvData.write.parquet(landsat)
val landsatOrc = "s3a://stevel-demo/landsatOrc"
csvData.write.orc(landsatOrc)
val df = spark.read.parquet(landsat)
val orcDf = spark.read.parquet(landsatOrc)
21 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Finding	dirty	data	with	Spark	SQL	
val sqlDF = spark.sql(
"SELECT id, acquisitionDate, cloudCover"
+ s" FROM parquet.`${landsat}`")
val negativeClouds = sqlDF.filter("cloudCover < 0")
negativeClouds.show()
* filter columns and data early
* whether/when to cache()?
* copy popular data to HDFS
22 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
spark-default.conf
spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true
23 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Notebooks? Classpath & Credentials
24 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
The	Commitment	Problem
⬢ rename() used	for	atomic	commitment	transaction
⬢ time	to	copy()	+	delete()	proportional	to	data	*	files
⬢ S3:	6+	MB/s	
⬢ Azure:	a	lot	faster	—usually
spark.speculation false
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
25 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What about Direct Output Committers?
26 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Recent	S3A	Performance	(Hadoop	2.8,	HDP	2.5,	CDH	5.9	(?))
// forward seek by skipping stream
spark.hadoop.fs.s3a.readahead.range 157810688
// faster backward seek for ORC and Parquet input
spark.hadoop.fs.s3a.experimental.input.fadvise random
// PUT blocks in separate threads
spark.hadoop.fs.s3a.fast.output.enabled true
27 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Azure	Storage:	wasb://	
A	full	substitute	for	HDFS
28 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Classpath:	fix	“No	FileSystem for	scheme:	wasb”
wasb:// :	Consistent,	with	very	fast	rename	(hence:	commits)
hadoop-azure-2.7.x.jar
azure-storage-2.2.0.jar
+ (jackson-core; http-components, hadoop-common)
29 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Credentials:	core-site.xml /		spark-default.conf
<property>
<name>fs.azure.account.key.example.blob.core.windows.net</name>
<value>0c0d44ac83ad7f94b0997b36e6e9a25b49a1394c</value>
</property>
spark.hadoop.fs.azure.account.key.example.blob.core.windows.net
0c0d44ac83ad7f94b0997b36e6e9a25b49a1394c
wasb://demo@example.blob.core.windows.net
30 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Example:	Azure	Storage	and	Streaming
val streaming = new StreamingContext(sparkConf,Seconds(10))
val azure = "wasb://demo@example.blob.core.windows.net/in"
val lines = streaming.textFileStream(azure)
val matches = lines.map(line => {
println(line)
line
})
matches.print()
streaming.start()
* PUT into the streaming directory
* keep the dir clean
* size window for slow scans
31 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Not	Covered
⬢ Partitioning/directory	layout
⬢ Infrastructure	Throttling
⬢ Optimal	path	names
⬢ Error	handling
⬢ Metrics
32 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Summary
⬢ Object	Stores	look	just	like	any	other	URL
⬢ …but	do	need	classpath	and	configuration
⬢ Issues:	performance,	commitment
⬢ Use	Hadoop	2.7+	JARs
⬢ Tune	to	reduce	I/O
⬢ Keep	those	credentials	secret!
Spark Summit EU talk by Steve Loughran
34 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Backup	Slides
35 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Dependencies	in	Hadoop	2.8
hadoop-aws-2.8.x.jar
aws-java-sdk-core-1.10.6.jar
aws-java-sdk-kms-1.10.6.jar
aws-java-sdk-s3-1.10.6.jar
joda-time-2.9.3.jar
(jackson-*-2.6.5.jar)
hadoop-aws-2.8.x.jar
azure-storage-4.2.0.jar
36 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
S3	Server-Side	Encryption
⬢ Encryption	of	data	at	rest	at	S3
⬢ Supports	the	SSE-S3	option:	each	object	encrypted	by	a	unique	key	
using	AES-256	cipher
⬢ Now	covered	in	S3A	automated	test	suites
⬢ Support	for	additional	options	under	development	(SSE-KMS	and	SSE-C)
37 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Advanced	authentication
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider,
com.amazonaws.auth.EnvironmentVariableCredentialsProvider,
com.amazonaws.auth.InstanceProfileCredentialsProvider,
org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
</value>
</property>
+encrypted credentials in JECKS files on
HDFS
38 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What Next? Performance and
integration
39 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Next	Steps	for	all Object	Stores
⬢ Output	Committers
– Logical	commit	operation	decoupled	from	rename	(non-atomic	and	costly	in	object	stores)
⬢ Object	Store	Abstraction	Layer
– Avoid	impedance	mismatch	with	FileSystem API
– Provide	specific	APIs	for	better	integration	with	object	stores:	saving,	listing,	copying
⬢ Ongoing	Performance	Improvement
⬢ Consistency

More Related Content

What's hot (20)

PDF
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
 
PDF
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
PPTX
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
PPTX
A Developer’s View into Spark's Memory Model with Wenchen Fan
Databricks
 
PDF
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
PPTX
Hive on spark is blazing fast or is it final
Hortonworks
 
PDF
Tachyon and Apache Spark
rhatr
 
PPTX
Hadoop, Hive, Spark and Object Stores
Steve Loughran
 
PDF
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
PDF
How To Connect Spark To Your Own Datasource
MongoDB
 
PPTX
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
PDF
Cost-based Query Optimization
DataWorks Summit/Hadoop Summit
 
PDF
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Databricks
 
PPTX
Apache Spark and Object Stores
Steve Loughran
 
PDF
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 
PDF
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
PDF
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
PDF
Apache Eagle - Monitor Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
PDF
Reactive app using actor model & apache spark
Rahul Kumar
 
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
Databricks
 
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
Hive on spark is blazing fast or is it final
Hortonworks
 
Tachyon and Apache Spark
rhatr
 
Hadoop, Hive, Spark and Object Stores
Steve Loughran
 
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
How To Connect Spark To Your Own Datasource
MongoDB
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
Cost-based Query Optimization
DataWorks Summit/Hadoop Summit
 
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Databricks
 
Apache Spark and Object Stores
Steve Loughran
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
Apache Eagle - Monitor Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
Reactive app using actor model & apache spark
Rahul Kumar
 

Viewers also liked (19)

PDF
Spark Summit EU talk by Ted Malaska
Spark Summit
 
PDF
Spark Summit EU talk by Luca Canali
Spark Summit
 
PPTX
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
PDF
Beyond Parallelize and Collect by Holden Karau
Spark Summit
 
PDF
Spark Summit EU talk by Herman van Hovell
Spark Summit
 
PDF
Spark Summit EU talk by John Musser
Spark Summit
 
PDF
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
PDF
Spark Summit EU talk by Qifan Pu
Spark Summit
 
PDF
2do boletin emancipacion de la mujer
Colectivo chamampi
 
PDF
あいにきて IoT
Yuki Higuchi
 
PDF
Leanforms folder panterra
Anton Schaerlaeckens
 
PPT
Walden3 twin slideshare 01
Avi Dey
 
ODP
MSII service global
Gilles ROULIN
 
PPT
Afl presentation
annacb19
 
PDF
Science and Nature Portfolio
ian cuming
 
PDF
9789740333616
CUPress
 
PPTX
PJD101 First Class
Yoshiaki Fujita
 
PDF
Culinary Arts Institute - programme
Hasmik Rostomyan
 
PPT
Techno-Freedom Seder Haggadah
martine
 
Spark Summit EU talk by Ted Malaska
Spark Summit
 
Spark Summit EU talk by Luca Canali
Spark Summit
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
Beyond Parallelize and Collect by Holden Karau
Spark Summit
 
Spark Summit EU talk by Herman van Hovell
Spark Summit
 
Spark Summit EU talk by John Musser
Spark Summit
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
Spark Summit EU talk by Qifan Pu
Spark Summit
 
2do boletin emancipacion de la mujer
Colectivo chamampi
 
あいにきて IoT
Yuki Higuchi
 
Leanforms folder panterra
Anton Schaerlaeckens
 
Walden3 twin slideshare 01
Avi Dey
 
MSII service global
Gilles ROULIN
 
Afl presentation
annacb19
 
Science and Nature Portfolio
ian cuming
 
9789740333616
CUPress
 
PJD101 First Class
Yoshiaki Fujita
 
Culinary Arts Institute - programme
Hasmik Rostomyan
 
Techno-Freedom Seder Haggadah
martine
 
Ad

Similar to Spark Summit EU talk by Steve Loughran (20)

PPTX
Spark Summit East 2017: Apache spark and object stores
Steve Loughran
 
PPTX
Apache Spark and Object Stores —for London Spark User Group
Steve Loughran
 
PDF
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark Summit
 
PPTX
Put is the new rename: San Jose Summit Edition
Steve Loughran
 
PPTX
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
DataWorks Summit
 
PPTX
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
PPTX
PUT is the new rename()
Steve Loughran
 
PPTX
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Steve Loughran
 
PPTX
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
DataWorks Summit/Hadoop Summit
 
PPTX
Intro to Spark with Zeppelin
Hortonworks
 
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
PPTX
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit
 
PPTX
Intro to Spark
Kyle Burke
 
PPTX
Big Data processing with Spark, Scala or Java?
Erik-Berndt Scheper
 
PPTX
Apache Spark Crash Course
DataWorks Summit
 
PPTX
Spark Summit EMEA - Arun Murthy's Keynote
Hortonworks
 
PPTX
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark Summit
 
PPTX
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
PDF
Introduction to Apache Spark
Samy Dindane
 
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Spark Summit East 2017: Apache spark and object stores
Steve Loughran
 
Apache Spark and Object Stores —for London Spark User Group
Steve Loughran
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark Summit
 
Put is the new rename: San Jose Summit Edition
Steve Loughran
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
DataWorks Summit
 
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
PUT is the new rename()
Steve Loughran
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Steve Loughran
 
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
DataWorks Summit/Hadoop Summit
 
Intro to Spark with Zeppelin
Hortonworks
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit
 
Intro to Spark
Kyle Burke
 
Big Data processing with Spark, Scala or Java?
Erik-Berndt Scheper
 
Apache Spark Crash Course
DataWorks Summit
 
Spark Summit EMEA - Arun Murthy's Keynote
Hortonworks
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark Summit
 
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
Introduction to Apache Spark
Samy Dindane
 
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

PPTX
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPTX
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
PPTX
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
PDF
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
PPTX
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
PDF
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 

Spark Summit EU talk by Steve Loughran