SlideShare a Scribd company logo
Matei Zaharia
@matei_zaharia
Simplifying Big Data in
Apache Spark 2.0
A Great Year for Apache Spark
2015 2016
Meetup
Members
2015 2016
Developers
Contributing
225K
66K
600
1100
2.0
New Major
Version #
About Spark 2.0
Remains highly compatible with 1.x
Builds on key lessonsand simplifies API
2000 patches from 280 contributors
What’s Hard About Big Data?
Complex combination of processing tasks, storage systems & modes
• ETL, aggregation,machine learning,streaming,etc
Hard to get both productivity and performance
Apache Spark’s Approach
Unified engine
• Express entire workflow in one API
• Connectexisting libraries& storage
High-level APIs with space to optimize
• RDDs, DataFrames, ML pipelines
SQLStreaming ML Graph
…
New in 2.0
Structured API improvements
(DataFrame, Dataset, SQL)
Whole-stage code generation
Structured Streaming
Simpler setup (SparkSession)
SQL 2003 support
MLlib model persistence
MLlib R bindings
SparkR user-defined functions
…
Original Spark API
Arbitrary Java functions on Java objects
+ Can organize your app using functions, classesand types
– Difficult for the engine to optimize
• Inefficientin-memory format
• Hard to do cross-operatoroptimizations
val lines = sc.textFile(“s3://...”)
val points = lines.map(line => new Point(line))
Structured APIs
New APIs for data with a fixed schema (table-like)
• Efficientstorage taking advantage ofschema (e.g.columnar)
• Operators take expressionsin a special DSL thatSpark can optimize
DataFrames (untyped), Datasets (typed), and SQL
Structured API Example
events =
sc.read.json(“/logs”)
stats =
events.join(users)
.groupBy(“loc”,“status”)
.avg(“duration”)
errors = stats.where(
stats.status == “ERR”)
DataFrame API Optimized Plan Specialized Code
SCAN logs SCAN users
JOIN
AGG
FILTER
while(logs.hasNext) {
e = logs.next
if(e.status == “ERR”) {
u = users.get(e.uid)
key = (u.loc, e.status)
sum(key) += e.duration
count(key) += 1
}
}
...
Structured API Example
events =
sc.read.json(“/logs”)
stats =
events.join(users)
.groupBy(“loc”,“status”)
.avg(“duration”)
errors = stats.where(
stats.status == “ERR”)
DataFrame API Optimized Plan Specialized Code
FILTERED
SCAN
SCAN users
JOIN
AGG
while(logs.hasNext) {
e = logs.next
if(e.status == “ERR”) {
u = users.get(e.uid)
key = (u.loc, e.status)
sum(key) += e.duration
count(key) += 1
}
}
...
New in 2.0
Whole-stage code generation
• Fuse across multiple operators
• Optimized Parquet I/O
Spark 1.6 14M
rows/s
Spark 2.0 125M
rows/s
Parquet
in 1.6
11M
rows/s
Parquet
in 2.0
90M
rows/s
Merging DataFrame & Dataset
• DataFrame = Dataset[Row]
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Beyond Batch & Interactive:
Higher-Level API for Streaming
What’s Hard In Using Streaming?
Complex semantics
• What possible resultscan the programgive?
• What happensif a node runs slowly? If one fails?
Integration into a complete application
• Serve real-time querieson resultof stream
• Give consistentresultswith batch jobs
Structured Streaming
High-levelstreaming APIbasedon DataFrames / Datasets
• Same semantics& results as batch APIs
• Eventtime, windowing,sessions,transactionalI/O
Rich integration with complete Apache Spark apps
• Memory sink forad-hoc queries
• Joinswith static data
• Change queriesat runtime
Not just streaming, but
“continuous applications”
Structured Streaming API
Incrementalizean existing DataFrame/Dataset/SQL query
logs = ctx.read.format(“json”).open(“hdfs://logs”)
logs.groupBy(“userid”, “hour”).avg(“latency”)
.write.format(”parquet”)
.save(“s3://...”)
Example
batch job:
Structured Streaming API
Incrementalizean existing DataFrame/Dataset/SQL query
logs = ctx.readStream.format(“json”).load(“hdfs://logs”)
logs.groupBy(“userid”, “hour”).avg(“latency”)
.writeStream.format(”parquet")
.start(“s3://...”)
Example as
streaming:
Results always same as a batch job on a prefixof the data
Under the Hood
Scan Files
Aggregate
Write to S3
Scan New Files
Stateful
Aggregate
Update S3
Batch Plan Continuous Plan
Automatically
transformed
Ad-hoc
Queries
Input
Stream
Output
Sink
Streaming
Computation
Input
Stream
Output
Sink
Continuous
Application
Static Data
Batch
Job
>_
Pure Streaming System ContinuousApplication
consistent
with
End Goal: Full Continuous Apps
Development Status
2.0.1: supports ETL workloads from file systems and S3
2.0.2: Kafka input source,monitoring metrics
2.1.0: eventtime aggregation workloads & watermarks
Greg Owen
Demo

More Related Content

What's hot (20)

PDF
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
PDF
Recent Developments In SparkR For Advanced Analytics
Databricks
 
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
PDF
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
PDF
Productionizing your Streaming Jobs
Databricks
 
PDF
Spark streaming state of the union
Databricks
 
PDF
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
PPTX
Multi dimension aggregations using spark and dataframes
Romi Kuntsman
 
PPTX
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
PDF
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
PDF
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PDF
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
PPTX
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Alton Alexander
 
PDF
End-to-end Data Pipeline with Apache Spark
Databricks
 
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Productionizing your Streaming Jobs
Databricks
 
Spark streaming state of the union
Databricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Multi dimension aggregations using spark and dataframes
Romi Kuntsman
 
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Building a modern Application with DataFrames
Spark Summit
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Alton Alexander
 
End-to-end Data Pipeline with Apache Spark
Databricks
 

Viewers also liked (20)

PPTX
Parallelizing Existing R Packages with SparkR
Databricks
 
PDF
Spark Summit Europe 2016 Keynote - Databricks CEO
Databricks
 
PDF
Exceptions are the Norm: Dealing with Bad Actors in ETL
Databricks
 
PPTX
Apache Spark and Online Analytics
Databricks
 
PDF
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Databricks
 
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PDF
Insights Without Tradeoffs: Using Structured Streaming
Databricks
 
PPTX
Introducing apache prediction io (incubating) (bay area spark meetup at sales...
Databricks
 
PPTX
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Databricks
 
PDF
What to Expect for Big Data and Apache Spark in 2017
Databricks
 
PPTX
Optimizing Apache Spark SQL Joins
Databricks
 
PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
PPTX
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
PDF
Making Structured Streaming Ready for Production
Databricks
 
PPTX
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
PPTX
Apache Spark Model Deployment
Databricks
 
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
PDF
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Parallelizing Existing R Packages with SparkR
Databricks
 
Spark Summit Europe 2016 Keynote - Databricks CEO
Databricks
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Databricks
 
Apache Spark and Online Analytics
Databricks
 
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Databricks
 
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Insights Without Tradeoffs: Using Structured Streaming
Databricks
 
Introducing apache prediction io (incubating) (bay area spark meetup at sales...
Databricks
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Databricks
 
What to Expect for Big Data and Apache Spark in 2017
Databricks
 
Optimizing Apache Spark SQL Joins
Databricks
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
Making Structured Streaming Ready for Production
Databricks
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Apache Spark Model Deployment
Databricks
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Ad

Similar to Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0 (20)

PPTX
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
PDF
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
PDF
Apache spark 2.4 and beyond
Xiao Li
 
PDF
Spark streaming , Spark SQL
Yousun Jeong
 
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
PPTX
What’s new in Apache Spark 2.3
DataWorks Summit
 
PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Databricks
 
PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman
 
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
PDF
Dev Ops Training
Spark Summit
 
PDF
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
PDF
20170126 big data processing
Vienna Data Science Group
 
PDF
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
PDF
Apache Spark - A High Level overview
Karan Alang
 
PDF
Spark what's new what's coming
Databricks
 
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PDF
Data Processing with Apache Spark Meetup Talk
Eren Avşaroğulları
 
PPTX
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
Apache spark 2.4 and beyond
Xiao Li
 
Spark streaming , Spark SQL
Yousun Jeong
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
What’s new in Apache Spark 2.3
DataWorks Summit
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Databricks
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Dev Ops Training
Spark Summit
 
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
20170126 big data processing
Vienna Data Science Group
 
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
Apache Spark - A High Level overview
Karan Alang
 
Spark what's new what's coming
Databricks
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Data Processing with Apache Spark Meetup Talk
Eren Avşaroğulları
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PPTX
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
PDF
GridView,Recycler view, API, SQLITE& NetworkRequest.pdf
Nabin Dhakal
 
PPTX
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
GridView,Recycler view, API, SQLITE& NetworkRequest.pdf
Nabin Dhakal
 
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0

  • 2. A Great Year for Apache Spark 2015 2016 Meetup Members 2015 2016 Developers Contributing 225K 66K 600 1100 2.0 New Major Version #
  • 3. About Spark 2.0 Remains highly compatible with 1.x Builds on key lessonsand simplifies API 2000 patches from 280 contributors
  • 4. What’s Hard About Big Data? Complex combination of processing tasks, storage systems & modes • ETL, aggregation,machine learning,streaming,etc Hard to get both productivity and performance
  • 5. Apache Spark’s Approach Unified engine • Express entire workflow in one API • Connectexisting libraries& storage High-level APIs with space to optimize • RDDs, DataFrames, ML pipelines SQLStreaming ML Graph …
  • 6. New in 2.0 Structured API improvements (DataFrame, Dataset, SQL) Whole-stage code generation Structured Streaming Simpler setup (SparkSession) SQL 2003 support MLlib model persistence MLlib R bindings SparkR user-defined functions …
  • 7. Original Spark API Arbitrary Java functions on Java objects + Can organize your app using functions, classesand types – Difficult for the engine to optimize • Inefficientin-memory format • Hard to do cross-operatoroptimizations val lines = sc.textFile(“s3://...”) val points = lines.map(line => new Point(line))
  • 8. Structured APIs New APIs for data with a fixed schema (table-like) • Efficientstorage taking advantage ofschema (e.g.columnar) • Operators take expressionsin a special DSL thatSpark can optimize DataFrames (untyped), Datasets (typed), and SQL
  • 9. Structured API Example events = sc.read.json(“/logs”) stats = events.join(users) .groupBy(“loc”,“status”) .avg(“duration”) errors = stats.where( stats.status == “ERR”) DataFrame API Optimized Plan Specialized Code SCAN logs SCAN users JOIN AGG FILTER while(logs.hasNext) { e = logs.next if(e.status == “ERR”) { u = users.get(e.uid) key = (u.loc, e.status) sum(key) += e.duration count(key) += 1 } } ...
  • 10. Structured API Example events = sc.read.json(“/logs”) stats = events.join(users) .groupBy(“loc”,“status”) .avg(“duration”) errors = stats.where( stats.status == “ERR”) DataFrame API Optimized Plan Specialized Code FILTERED SCAN SCAN users JOIN AGG while(logs.hasNext) { e = logs.next if(e.status == “ERR”) { u = users.get(e.uid) key = (u.loc, e.status) sum(key) += e.duration count(key) += 1 } } ...
  • 11. New in 2.0 Whole-stage code generation • Fuse across multiple operators • Optimized Parquet I/O Spark 1.6 14M rows/s Spark 2.0 125M rows/s Parquet in 1.6 11M rows/s Parquet in 2.0 90M rows/s Merging DataFrame & Dataset • DataFrame = Dataset[Row]
  • 13. Beyond Batch & Interactive: Higher-Level API for Streaming
  • 14. What’s Hard In Using Streaming? Complex semantics • What possible resultscan the programgive? • What happensif a node runs slowly? If one fails? Integration into a complete application • Serve real-time querieson resultof stream • Give consistentresultswith batch jobs
  • 15. Structured Streaming High-levelstreaming APIbasedon DataFrames / Datasets • Same semantics& results as batch APIs • Eventtime, windowing,sessions,transactionalI/O Rich integration with complete Apache Spark apps • Memory sink forad-hoc queries • Joinswith static data • Change queriesat runtime Not just streaming, but “continuous applications”
  • 16. Structured Streaming API Incrementalizean existing DataFrame/Dataset/SQL query logs = ctx.read.format(“json”).open(“hdfs://logs”) logs.groupBy(“userid”, “hour”).avg(“latency”) .write.format(”parquet”) .save(“s3://...”) Example batch job:
  • 17. Structured Streaming API Incrementalizean existing DataFrame/Dataset/SQL query logs = ctx.readStream.format(“json”).load(“hdfs://logs”) logs.groupBy(“userid”, “hour”).avg(“latency”) .writeStream.format(”parquet") .start(“s3://...”) Example as streaming: Results always same as a batch job on a prefixof the data
  • 18. Under the Hood Scan Files Aggregate Write to S3 Scan New Files Stateful Aggregate Update S3 Batch Plan Continuous Plan Automatically transformed
  • 20. Development Status 2.0.1: supports ETL workloads from file systems and S3 2.0.2: Kafka input source,monitoring metrics 2.1.0: eventtime aggregation workloads & watermarks