SlideShare a Scribd company logo
Real-time Spark
From interactive queries to streaming
Michael Armbrust- @michaelarmbrust
Strata + Hadoop World 2016
Real-time Analytics
Goal: freshest answer, as fast as possible
2
Challenges
• Implementing the analysis
• Making sureit runs efficiently
• Keeping the answer is up to date
Develop Productively
with powerful, simple APIs in
3
Write Less Code: Compute an Average
private IntWritable one =
new IntWritable(1)
private IntWritable output =
new IntWritable()
proctected void map(
LongWritable key,
Text value,
Context context) {
String[] fields = value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one, output)
}
IntWritable one = new IntWritable(1)
DoubleWritable average = new DoubleWritable()
protected void reduce(
IntWritable key,
Iterable<IntWritable> values,
Context context) {
int sum = 0
int count = 0
for(IntWritable value : values) {
sum += value.get()
count++
}
average.set(sum / (double) count)
context.Write(key, average)
}
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [x.[1], 1])) 
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) 
.map(lambda x: [x[0], x[1][0] / x[1][1]]) 
.collect()
4
Write Less Code: Compute an Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1])) 
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) 
.map(lambda x: [x[0], x[1][0] / x[1][1]]) 
.collect()
Using DataFrames
sqlCtx.table("people") 
.groupBy("name") 
.agg("name", avg("age")) 
.map(lambda …) 
.collect()
Full API Docs
• Python
• Scala
• Java
• R
5
Using SQL
SELECT name, avg(age)
FROM people
GROUP BY name
Dataset
noun – [dey-tuh-set]
6
1. A distributed collection of data with a known
schema.
2. A high-level abstraction for selecting, filtering,
mapping, reducing,aggregating and plotting
structured data (cf. Hadoop,RDDs, R, Pandas).
3. Related: DataFrame – a distributed collection of
genericrow objects (i.e. the resultof a SQL query)
Standards based:
Supports most popular
constructsin HiveQL,
ANSISQL
Compatible: Use JDBC
to connectwith popular
BI tools
7
Datasets with SQL
SELECT name, avg(age)
FROM people
GROUP BY name
Concise: greatfor ad-
hoc interactive analysis
Interoperable: based
on and R / pandas, easy
to go back and forth
8
Dynamic Datasets (DataFrames)
sqlCtx.table("people") 
.groupBy("name") 
.agg("name", avg("age")) 
.map(lambda …) 
.collect()
No boilerplate:
automatically convert
to/from domain objects
Safe: compile time
checksfor correctness
9
Static Datasets
val df = ctx.read.json("people.json")
// Convert data to domain objects.
case class Person(name: String, age: Int)
val ds: Dataset[Person] = df.as[Person]
ds.filter(_.age > 30)
// Compute histogram of age by name.
val hist = ds.groupBy(_.name).mapGroups {
case (name, people: Iter[Person]) =>
val buckets = new Array[Int](10)
people.map(_.age).foreach { a =>
buckets(a / 10) += 1
}
(name, buckets)
}
Unified, Structured APIs in Spark 2.0
10
SQL DataFrames Datasets
Syntax
Errors
Analysis
Errors
Runtime Compile
Time
Runtime
Compile
Time
Compile
Time
Runtime
Unified: Input & Output
Unified interface to reading/writing data in a variety of formats:
df = sqlContext.read 
.format("json") 
.option("samplingRatio", "0.1") 
.load("/home/michael/data.json")
df.write 
.format("parquet") 
.mode("append") 
.partitionBy("year") 
.saveAsTable("fasterData")
11
Unified: Input & Output
Unified interface to reading/writing data in a variety of formats:
df = sqlContext.read 
.format("json") 
.option("samplingRatio", "0.1") 
.load("/home/michael/data.json")
df.write 
.format("parquet") 
.mode("append") 
.partitionBy("year") 
.saveAsTable("fasterData")
read and write
functions create
new builders for
doing I/O
12
Unified: Input & Output
Unified interface to reading/writing data in a variety of formats:
Builder methods
specify:
• Format
• Partitioning
• Handling of
existing data
df = sqlContext.read 
.format("json") 
.option("samplingRatio", "0.1") 
.load("/home/michael/data.json")
df.write 
.format("parquet") 
.mode("append") 
.partitionBy("year") 
.saveAsTable("fasterData")
13
Unified: Input & Output
Unified interface to reading/writing data in a variety of formats:
load(…), save(…) or
saveAsTable(…) to
finish the I/O
df = sqlContext.read 
.format("json") 
.option("samplingRatio", "0.1") 
.load("/home/michael/data.json")
df.write 
.format("parquet") 
.mode("append") 
.partitionBy("year") 
.saveAsTable("fasterData")
14
Unified: Data Source API
Spark SQL’s Data Source API can read and write DataFrames
usinga variety of formats.
15
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/
ORCplain text*
Bridge Objects with Data Sources
16
{
"name": "Michael",
"zip": "94709"
"languages": ["scala"]
}
case class Person(
name: String,
languages: Seq[String],
zip: Int)
Automatically map
columns to fields by
name
Execute Efficiently
using the catalyst optimizer & tungsten enginein
17
Shared Optimization & Execution
18
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames, Datasets and SQL
sharethe same optimization/execution pipeline
Dataset
Not Just Less Code, Faster Too!
19
0 2 4 6 8 10
RDDScala
RDDPython
DataFrameScala
DataFramePython
DataFrameR
DataFrameSQL
Time to Aggregate 10 million int pairs (secs)
• 100+ native functionswith
optimized codegen
implementations
– String manipulation – concat,
format_string, lower, lpad
– Data/Time – current_timestamp,
date_format, date_add, …
– Math – sqrt, randn, …
– Other –
monotonicallyIncreasingId,
sparkPartitionId, …
20
Complex Columns With Functions
from pyspark.sql.functions import *
yesterday = date_sub(current_date(), 1)
df2 = df.filter(df.created_at > yesterday)
import org.apache.spark.sql.functions._
val yesterday = date_sub(current_date(), 1)
val df2 = df.filter(df("created_at") > yesterday)
Operate Directly On Serialized Data
21
df.where(df("year") > 2015)
GreaterThan(year#234, Literal(2015))
bool filter(Object baseObject) {
int offset = baseOffset + bitSetWidthInBytes + 3*8L;
int value = Platform.getInt(baseObject, offset);
return value34 > 2015;
}
DataFrame Code / SQL
Catalyst Expressions
Low-level bytecode
JVM intrinsic JIT-ed to
pointer arithmetic
Platform.getInt(baseObject, offset);
The overheads of JVM objects
“abcd”
22
• Native: 4 byteswith UTF-8 encoding
• Java: 48 bytes
java.lang.String object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 4 (object header) ...
4 4 (object header) ...
8 4 (object header) ...
12 4 char[] String.value []
16 4 int String.hash 0
20 4 int String.hash32 0
Instance size: 24 bytes (reported by Instrumentation API)
12 byte object header
8 byte hashcode
20 bytes data + overhead
6 “bricks”
Tungsten’s Compact Encoding
23
0x0 123 32L 48L 4 “data”
(123, “data”, “bricks”)
Null bitmap
Offset to data
Offset to data Field lengths
Encoders
24
6 “bricks”0x0 123 32L 48L 4 “data”
JVM Object
Internal Representation
MyClass(123, “data”, “bricks”)
Encoders translate between domain
objects and Spark's internal format
Space Efficiency
25
Serialization performance
26
Update Automatically
using Structured Streaming in
27
The simplest way to perform streaming analytics
is not having to reason about streaming.
Spark 2.0
Infinite DataFrames
Spark 1.3
Static DataFrames
Single API
logs = ctx.read.format("json").open("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.save("jdbc:mysql//...")
Example: Batch Aggregation
logs = ctx.read.format("json").stream("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.startStream("jdbc:mysql//...")
Example: Continuous Aggregation
Structured Streaming in
• High-level streaming API built on SparkSQL engine
• Runsthe same querieson DataFrames
• Eventtime, windowing,sessions,sources& sinks
• Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serve using JDBC
• Change queriesatruntime
• Build and apply ML models
output for
data at 1
Result
Query
Time
data up
to PT 1
Input
complete
output
Output
1 2 3
Trigger: every 1 sec
data up
to PT 2
output for
data at 2
data up
to PT 3
output for
data at 3
Model
delta
output
output for
data at 1
Result
Query
Time
data up
to PT 2
data up
to PT 3
data up
to PT 1
Input
output for
data at 2
output for
data at 3
Output
1 2 3
Trigger: every 1 sec
Model
Integration End-to-End
Streaming
engine
Stream
(home.html, 10:08)
(product.html, 10:09)
(home.html, 10:10)
. . .
What could go wrong?
• Late events
• Partial outputs to MySQL
• State recovery on failure
• Distributed reads/writes
• ...
MySQL
Page Minute Visits
home 10:09 21
pricing 10:10 30
... ... ...
Rest of Spark will follow
• Interactive queriesshould just work
• Spark’s data sourceAPI will be updated to support
seamless
streaming integration
• Exactly once semantics end-to-end
• Different outputmodes (complete,delta, update-in-place)
• ML algorithms will be updated too
What can we do with this that’s hard with
other engines?
• Ad-hoc, interactive queries
• Dynamic changing queries
• Benefits of Spark: elastic scaling, stragglermitigation, etc
38
Demo
Running in
• Hosted Sparkin the cloud
• Notebooks with integrated visualization
• Scheduled production jobs
Community Edition isFree for everyone!
http://go.databricks.com/databricks-
community-edition-beta-waitlist
39
Simple and fast real-time analytics
• Develop Productively
• Execute Efficiently
• Update Automatically
Questions?
Learn More
Up Next TD
Today, 1:50-2:30 AMA
@michaelarmbrust

More Related Content

What's hot (20)

PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
PDF
Log Structured Merge Tree
University of California, Santa Cruz
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PPTX
Free Training: How to Build a Lakehouse
Databricks
 
PDF
Spark overview
Lisa Hua
 
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
PDF
The Apache Spark File Format Ecosystem
Databricks
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
Physical Plans in Spark SQL
Databricks
 
PPTX
Spark architecture
GauravBiswas9
 
PPTX
Programming in Spark using PySpark
Mostafa
 
PPTX
Catalyst optimizer
Ayub Mohammad
 
PPTX
Flink Streaming
Gyula Fóra
 
PDF
An introduction to MongoDB
Universidade de São Paulo
 
PPTX
Apache Flink and what it is used for
Aljoscha Krettek
 
PDF
Cassandra Database
YounesCharfaoui
 
PDF
MongodB Internals
Norberto Leite
 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
PDF
Inside Parquet Format
Yue Chen
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Log Structured Merge Tree
University of California, Santa Cruz
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Free Training: How to Build a Lakehouse
Databricks
 
Spark overview
Lisa Hua
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
The Apache Spark File Format Ecosystem
Databricks
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Physical Plans in Spark SQL
Databricks
 
Spark architecture
GauravBiswas9
 
Programming in Spark using PySpark
Mostafa
 
Catalyst optimizer
Ayub Mohammad
 
Flink Streaming
Gyula Fóra
 
An introduction to MongoDB
Universidade de São Paulo
 
Apache Flink and what it is used for
Aljoscha Krettek
 
Cassandra Database
YounesCharfaoui
 
MongodB Internals
Norberto Leite
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Inside Parquet Format
Yue Chen
 

Viewers also liked (20)

PDF
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
PDF
Understanding Memory Management In Spark For Fun And Profit
Spark Summit
 
PPTX
Interactive Analytics using Apache Spark
Sachin Aggarwal
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PPTX
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Caserta
 
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
POTX
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
PDF
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
PPTX
Spark: The Good, the Bad, and the Ugly
Sarah Guido
 
PPTX
Interactive query in hadoop
Rommel Garcia
 
PDF
Ad hoc analytics with Cassandra and Spark
Mohammed Guller
 
PPTX
Informix MQTT Streaming
Pradeep Natarajan
 
PDF
The Future of Real-Time in Spark
Databricks
 
PDF
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
TUMRA | Big Data Science - Gain a competitive advantage through Big Data & Data Science
 
ODT
ACADGILD:: HADOOP LESSON - File formats in apache hive
Padma shree. T
 
PPTX
2013 year of real-time hadoop
Geoff Hendrey
 
PDF
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
DataStax Academy
 
PDF
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
PDF
Spark streaming , Spark SQL
Yousun Jeong
 
PPTX
Complex Analytics with NoSQL Data Store in Real Time
Nati Shalom
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
Understanding Memory Management In Spark For Fun And Profit
Spark Summit
 
Interactive Analytics using Apache Spark
Sachin Aggarwal
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Caserta
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
Spark: The Good, the Bad, and the Ugly
Sarah Guido
 
Interactive query in hadoop
Rommel Garcia
 
Ad hoc analytics with Cassandra and Spark
Mohammed Guller
 
Informix MQTT Streaming
Pradeep Natarajan
 
The Future of Real-Time in Spark
Databricks
 
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
TUMRA | Big Data Science - Gain a competitive advantage through Big Data & Data Science
 
ACADGILD:: HADOOP LESSON - File formats in apache hive
Padma shree. T
 
2013 year of real-time hadoop
Geoff Hendrey
 
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
DataStax Academy
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
Spark streaming , Spark SQL
Yousun Jeong
 
Complex Analytics with NoSQL Data Store in Real Time
Nati Shalom
 
Ad

Similar to Real-Time Spark: From Interactive Queries to Streaming (20)

PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
PDF
A look ahead at spark 2.0
Databricks
 
PDF
Spark sql
Freeman Zhang
 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PDF
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Databricks
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PPTX
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
PPTX
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
PDF
A Deep Dive into Structured Streaming in Apache Spark
Anyscale
 
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
PDF
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
A look ahead at spark 2.0
Databricks
 
Spark sql
Freeman Zhang
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
A Deep Dive into Structured Streaming in Apache Spark
Anyscale
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PPTX
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Executive Business Intelligence Dashboards
vandeslie24
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 

Real-Time Spark: From Interactive Queries to Streaming

  • 1. Real-time Spark From interactive queries to streaming Michael Armbrust- @michaelarmbrust Strata + Hadoop World 2016
  • 2. Real-time Analytics Goal: freshest answer, as fast as possible 2 Challenges • Implementing the analysis • Making sureit runs efficiently • Keeping the answer is up to date
  • 4. Write Less Code: Compute an Average private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) } data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [x.[1], 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() 4
  • 5. Write Less Code: Compute an Average Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people") .groupBy("name") .agg("name", avg("age")) .map(lambda …) .collect() Full API Docs • Python • Scala • Java • R 5 Using SQL SELECT name, avg(age) FROM people GROUP BY name
  • 6. Dataset noun – [dey-tuh-set] 6 1. A distributed collection of data with a known schema. 2. A high-level abstraction for selecting, filtering, mapping, reducing,aggregating and plotting structured data (cf. Hadoop,RDDs, R, Pandas). 3. Related: DataFrame – a distributed collection of genericrow objects (i.e. the resultof a SQL query)
  • 7. Standards based: Supports most popular constructsin HiveQL, ANSISQL Compatible: Use JDBC to connectwith popular BI tools 7 Datasets with SQL SELECT name, avg(age) FROM people GROUP BY name
  • 8. Concise: greatfor ad- hoc interactive analysis Interoperable: based on and R / pandas, easy to go back and forth 8 Dynamic Datasets (DataFrames) sqlCtx.table("people") .groupBy("name") .agg("name", avg("age")) .map(lambda …) .collect()
  • 9. No boilerplate: automatically convert to/from domain objects Safe: compile time checksfor correctness 9 Static Datasets val df = ctx.read.json("people.json") // Convert data to domain objects. case class Person(name: String, age: Int) val ds: Dataset[Person] = df.as[Person] ds.filter(_.age > 30) // Compute histogram of age by name. val hist = ds.groupBy(_.name).mapGroups { case (name, people: Iter[Person]) => val buckets = new Array[Int](10) people.map(_.age).foreach { a => buckets(a / 10) += 1 } (name, buckets) }
  • 10. Unified, Structured APIs in Spark 2.0 10 SQL DataFrames Datasets Syntax Errors Analysis Errors Runtime Compile Time Runtime Compile Time Compile Time Runtime
  • 11. Unified: Input & Output Unified interface to reading/writing data in a variety of formats: df = sqlContext.read .format("json") .option("samplingRatio", "0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") 11
  • 12. Unified: Input & Output Unified interface to reading/writing data in a variety of formats: df = sqlContext.read .format("json") .option("samplingRatio", "0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") read and write functions create new builders for doing I/O 12
  • 13. Unified: Input & Output Unified interface to reading/writing data in a variety of formats: Builder methods specify: • Format • Partitioning • Handling of existing data df = sqlContext.read .format("json") .option("samplingRatio", "0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") 13
  • 14. Unified: Input & Output Unified interface to reading/writing data in a variety of formats: load(…), save(…) or saveAsTable(…) to finish the I/O df = sqlContext.read .format("json") .option("samplingRatio", "0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") 14
  • 15. Unified: Data Source API Spark SQL’s Data Source API can read and write DataFrames usinga variety of formats. 15 { JSON } Built-In External JDBC and more… Find more sources at http://spark-packages.org/ ORCplain text*
  • 16. Bridge Objects with Data Sources 16 { "name": "Michael", "zip": "94709" "languages": ["scala"] } case class Person( name: String, languages: Seq[String], zip: Int) Automatically map columns to fields by name
  • 17. Execute Efficiently using the catalyst optimizer & tungsten enginein 17
  • 18. Shared Optimization & Execution 18 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames, Datasets and SQL sharethe same optimization/execution pipeline Dataset
  • 19. Not Just Less Code, Faster Too! 19 0 2 4 6 8 10 RDDScala RDDPython DataFrameScala DataFramePython DataFrameR DataFrameSQL Time to Aggregate 10 million int pairs (secs)
  • 20. • 100+ native functionswith optimized codegen implementations – String manipulation – concat, format_string, lower, lpad – Data/Time – current_timestamp, date_format, date_add, … – Math – sqrt, randn, … – Other – monotonicallyIncreasingId, sparkPartitionId, … 20 Complex Columns With Functions from pyspark.sql.functions import * yesterday = date_sub(current_date(), 1) df2 = df.filter(df.created_at > yesterday) import org.apache.spark.sql.functions._ val yesterday = date_sub(current_date(), 1) val df2 = df.filter(df("created_at") > yesterday)
  • 21. Operate Directly On Serialized Data 21 df.where(df("year") > 2015) GreaterThan(year#234, Literal(2015)) bool filter(Object baseObject) { int offset = baseOffset + bitSetWidthInBytes + 3*8L; int value = Platform.getInt(baseObject, offset); return value34 > 2015; } DataFrame Code / SQL Catalyst Expressions Low-level bytecode JVM intrinsic JIT-ed to pointer arithmetic Platform.getInt(baseObject, offset);
  • 22. The overheads of JVM objects “abcd” 22 • Native: 4 byteswith UTF-8 encoding • Java: 48 bytes java.lang.String object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 4 (object header) ... 4 4 (object header) ... 8 4 (object header) ... 12 4 char[] String.value [] 16 4 int String.hash 0 20 4 int String.hash32 0 Instance size: 24 bytes (reported by Instrumentation API) 12 byte object header 8 byte hashcode 20 bytes data + overhead
  • 23. 6 “bricks” Tungsten’s Compact Encoding 23 0x0 123 32L 48L 4 “data” (123, “data”, “bricks”) Null bitmap Offset to data Offset to data Field lengths
  • 24. Encoders 24 6 “bricks”0x0 123 32L 48L 4 “data” JVM Object Internal Representation MyClass(123, “data”, “bricks”) Encoders translate between domain objects and Spark's internal format
  • 28. The simplest way to perform streaming analytics is not having to reason about streaming.
  • 29. Spark 2.0 Infinite DataFrames Spark 1.3 Static DataFrames Single API
  • 32. Structured Streaming in • High-level streaming API built on SparkSQL engine • Runsthe same querieson DataFrames • Eventtime, windowing,sessions,sources& sinks • Unifies streaming, interactive and batch queries • Aggregate data in a stream, then serve using JDBC • Change queriesatruntime • Build and apply ML models
  • 33. output for data at 1 Result Query Time data up to PT 1 Input complete output Output 1 2 3 Trigger: every 1 sec data up to PT 2 output for data at 2 data up to PT 3 output for data at 3 Model
  • 34. delta output output for data at 1 Result Query Time data up to PT 2 data up to PT 3 data up to PT 1 Input output for data at 2 output for data at 3 Output 1 2 3 Trigger: every 1 sec Model
  • 35. Integration End-to-End Streaming engine Stream (home.html, 10:08) (product.html, 10:09) (home.html, 10:10) . . . What could go wrong? • Late events • Partial outputs to MySQL • State recovery on failure • Distributed reads/writes • ... MySQL Page Minute Visits home 10:09 21 pricing 10:10 30 ... ... ...
  • 36. Rest of Spark will follow • Interactive queriesshould just work • Spark’s data sourceAPI will be updated to support seamless streaming integration • Exactly once semantics end-to-end • Different outputmodes (complete,delta, update-in-place) • ML algorithms will be updated too
  • 37. What can we do with this that’s hard with other engines? • Ad-hoc, interactive queries • Dynamic changing queries • Benefits of Spark: elastic scaling, stragglermitigation, etc
  • 38. 38 Demo Running in • Hosted Sparkin the cloud • Notebooks with integrated visualization • Scheduled production jobs Community Edition isFree for everyone! http://go.databricks.com/databricks- community-edition-beta-waitlist
  • 39. 39 Simple and fast real-time analytics • Develop Productively • Execute Efficiently • Update Automatically Questions? Learn More Up Next TD Today, 1:50-2:30 AMA @michaelarmbrust