Real-Time Spark: From Interactive Queries to Streaming

Real-time Spark
From interactive queries to streaming
Michael Armbrust- @michaelarmbrust
Strata + Hadoop World 2016

Real-time Analytics
Goal: freshest answer, as fast as possible
2
Challenges
• Implementing the analysis
• Making sureit runs efficiently
• Keeping the answer is up to date

Develop Productively
with powerful, simple APIs in
3

Write Less Code: Compute an Average
private IntWritable one =
new IntWritable(1)
private IntWritable output =
new IntWritable()
proctected void map(
LongWritable key,
Text value,
Context context) {
String[] fields = value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one, output)
}
IntWritable one = new IntWritable(1)
DoubleWritable average = new DoubleWritable()
protected void reduce(
IntWritable key,
Iterable<IntWritable> values,
Context context) {
int sum = 0
int count = 0
for(IntWritable value : values) {
sum += value.get()
count++
}
average.set(sum / (double) count)
context.Write(key, average)
}
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [x.[1], 1]))
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
.map(lambda x: [x[0], x[1][0] / x[1][1]])
.collect()
4

Write Less Code: Compute an Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1]))
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
.map(lambda x: [x[0], x[1][0] / x[1][1]])
.collect()
Using DataFrames
sqlCtx.table("people")
.groupBy("name")
.agg("name", avg("age"))
.map(lambda …)
.collect()
Full API Docs
• Python
• Scala
• Java
• R
5
Using SQL
SELECT name, avg(age)
FROM people
GROUP BY name

Dataset
noun – [dey-tuh-set]
6
1. A distributed collection of data with a known
schema.
2. A high-level abstraction for selecting, filtering,
mapping, reducing,aggregating and plotting
structured data (cf. Hadoop,RDDs, R, Pandas).
3. Related: DataFrame – a distributed collection of
genericrow objects (i.e. the resultof a SQL query)

Standards based:
Supports most popular
constructsin HiveQL,
ANSISQL
Compatible: Use JDBC
to connectwith popular
BI tools
7
Datasets with SQL
SELECT name, avg(age)
FROM people
GROUP BY name

Concise: greatfor ad-
hoc interactive analysis
Interoperable: based
on and R / pandas, easy
to go back and forth
8
Dynamic Datasets (DataFrames)
sqlCtx.table("people")
.groupBy("name")
.agg("name", avg("age"))
.map(lambda …)
.collect()

No boilerplate:
automatically convert
to/from domain objects
Safe: compile time
checksfor correctness
9
Static Datasets
val df = ctx.read.json("people.json")
// Convert data to domain objects.
case class Person(name: String, age: Int)
val ds: Dataset[Person] = df.as[Person]
ds.filter(_.age > 30)
// Compute histogram of age by name.
val hist = ds.groupBy(_.name).mapGroups {
case (name, people: Iter[Person]) =>
val buckets = new Array[Int](10)
people.map(_.age).foreach { a =>
buckets(a / 10) += 1
}
(name, buckets)
}

Unified, Structured APIs in Spark 2.0
10
SQL DataFrames Datasets
Syntax
Errors
Analysis
Errors
Runtime Compile
Time
Runtime
Compile
Time
Compile
Time
Runtime

Unified: Input & Output
Unified interface to reading/writing data in a variety of formats:
df = sqlContext.read
.format("json")
.option("samplingRatio", "0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
11

.format("json")
df.write
.format("parquet")
.mode("append")
read and write
functions create
new builders for
doing I/O
12

Builder methods
specify:
• Format
• Partitioning
• Handling of
existing data
.format("json")
df.write
.format("parquet")
.mode("append")
13

load(…), save(…) or
saveAsTable(…) to
finish the I/O
.format("json")
df.write
.format("parquet")
.mode("append")
14

Unified: Data Source API
Spark SQL’s Data Source API can read and write DataFrames
usinga variety of formats.
15
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/
ORCplain text*

Bridge Objects with Data Sources
16
{
"name": "Michael",
"zip": "94709"
"languages": ["scala"]
}
case class Person(
name: String,
languages: Seq[String],
zip: Int)
Automatically map
columns to fields by
name

Execute Efficiently
using the catalyst optimizer & tungsten enginein
17

Shared Optimization & Execution
18
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames, Datasets and SQL
sharethe same optimization/execution pipeline
Dataset

Not Just Less Code, Faster Too!
19
0 2 4 6 8 10
RDDScala
RDDPython
DataFrameScala
DataFramePython
DataFrameR
DataFrameSQL
Time to Aggregate 10 million int pairs (secs)

• 100+ native functionswith
optimized codegen
implementations
– String manipulation – concat,
format_string, lower, lpad
– Data/Time – current_timestamp,
date_format, date_add, …
– Math – sqrt, randn, …
– Other –
monotonicallyIncreasingId,
sparkPartitionId, …
20
Complex Columns With Functions
from pyspark.sql.functions import *
yesterday = date_sub(current_date(), 1)
df2 = df.filter(df.created_at > yesterday)
import org.apache.spark.sql.functions._
val yesterday = date_sub(current_date(), 1)
val df2 = df.filter(df("created_at") > yesterday)

Operate Directly On Serialized Data
21
df.where(df("year") > 2015)
GreaterThan(year#234, Literal(2015))
bool filter(Object baseObject) {
int offset = baseOffset + bitSetWidthInBytes + 3*8L;
int value = Platform.getInt(baseObject, offset);
return value34 > 2015;
}
DataFrame Code / SQL
Catalyst Expressions
Low-level bytecode
JVM intrinsic JIT-ed to
pointer arithmetic
Platform.getInt(baseObject, offset);

The overheads of JVM objects
“abcd”
22
• Native: 4 byteswith UTF-8 encoding
• Java: 48 bytes
java.lang.String object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 4 (object header) ...
12 4 char[] String.value []
16 4 int String.hash 0
20 4 int String.hash32 0
Instance size: 24 bytes (reported by Instrumentation API)
12 byte object header
8 byte hashcode
20 bytes data + overhead

6 “bricks”
Tungsten’s Compact Encoding
23
0x0 123 32L 48L 4 “data”
(123, “data”, “bricks”)
Null bitmap
Offset to data
Offset to data Field lengths

Encoders
24
6 “bricks”0x0 123 32L 48L 4 “data”
JVM Object
Internal Representation
MyClass(123, “data”, “bricks”)
Encoders translate between domain
objects and Spark's internal format

Update Automatically
using Structured Streaming in
27

The simplest way to perform streaming analytics
is not having to reason about streaming.

Spark 2.0
Infinite DataFrames
Spark 1.3
Static DataFrames
Single API

logs = ctx.read.format("json").open("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.save("jdbc:mysql//...")
Example: Batch Aggregation

logs = ctx.read.format("json").stream("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.startStream("jdbc:mysql//...")
Example: Continuous Aggregation

Structured Streaming in
• High-level streaming API built on SparkSQL engine
• Runsthe same querieson DataFrames
• Eventtime, windowing,sessions,sources& sinks
• Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serve using JDBC
• Change queriesatruntime
• Build and apply ML models

output for
data at 1
Result
Query
Time
data up
to PT 1
Input
complete
output
Output
1 2 3
Trigger: every 1 sec
data up
to PT 2
output for
data at 2
data up
to PT 3
output for
data at 3
Model

delta
output
output for
data at 1
Result
Query
Time
data up
to PT 2
data up
to PT 3
data up
to PT 1
Input
output for
data at 2
output for
data at 3
Output
1 2 3
Trigger: every 1 sec
Model

Integration End-to-End
Streaming
engine
Stream
(home.html, 10:08)
(product.html, 10:09)
(home.html, 10:10)
. . .
What could go wrong?
• Late events
• Partial outputs to MySQL
• State recovery on failure
• Distributed reads/writes
• ...
MySQL
Page Minute Visits
home 10:09 21
pricing 10:10 30
... ... ...

Rest of Spark will follow
• Interactive queriesshould just work
• Spark’s data sourceAPI will be updated to support
seamless
streaming integration
• Exactly once semantics end-to-end
• Different outputmodes (complete,delta, update-in-place)
• ML algorithms will be updated too

What can we do with this that’s hard with
other engines?
• Ad-hoc, interactive queries
• Dynamic changing queries
• Benefits of Spark: elastic scaling, stragglermitigation, etc

38
Demo
Running in
• Hosted Sparkin the cloud
• Notebooks with integrated visualization
• Scheduled production jobs
Community Edition isFree for everyone!
http://go.databricks.com/databricks-
community-edition-beta-waitlist

39
Simple and fast real-time analytics
• Develop Productively
• Execute Efficiently
• Update Automatically
Questions?
Learn More
Up Next TD
Today, 1:50-2:30 AMA
@michaelarmbrust

Real-Time Spark: From Interactive Queries to Streaming

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Real-Time Spark: From Interactive Queries to Streaming (20)

More from Databricks (20)

Recently uploaded (20)

Real-Time Spark: From Interactive Queries to Streaming