Running Spark In Production in the Cloud is Not Easy with Nayur Khan

Running Spark in Production
in the Cloud is not easy
Nayur Khan
#SAISEnt12

2All content copyright © 2017 QuantumBlack, a McKinsey company
Who am I?
Nayur Khan
Head of Platform Engineering

All content copyright © 2017 QuantumBlack, a McKinsey company
Born and proven in F1
where the smallest margins are
the difference between winning
and losing
Data has emerged as a
fundamental element of
competitive advantage
QuantumBlack
“Exploit Data, Analytics and Design to
help our clients be the best they can be”

Advanced
Industries
Financial
Services
Healthcare
Infra-
structure
Telecoms Natural
Resources
Sport Consumer
Not just Formula One…

1 – Background
2 – Observations of using Cloud Object Storage
3 – Design To Operate
4 – Pipelines!

Some of the Platforms we use
Nerve Live
Powered By

Typical Flow for production
Apply Advanced
Analytics and
ML to features
Leverage unconnected, unstructured
data by preparing and linking data and
creating features
Advanced
analytics
User modules
Analytics Platform
Raw Data
...
Better & more
consistent decision
making
Prepared Data
Cleaned,
Enriched and
Linked
Infrastructure / Hosting
• Data Ingested into a Raw storage area
• Data is Prepared
• Cleansed
• Enriched
• Linked
• etc
• Models are run
• User Tools consume outputs
• Rich custom built tools
• Self service BI tools
2
3
4
1
1 2 3 4

Analytic Modules Analytic Modules
Zookeeper
Mesos / Marathon
Master Servers
Slave Servers
Nerve Live Analytics Cluster (AWS)
Zookeeper
Mesos / Marathon
Zookeeper
Mesos / Marathon
Mesos Slave / Chronos
Analytic Modules
Mesos Slave / Chronos
Analytic Modules
Spark AnalyticsSpark Analytics
…
Data Buckets
• Mesos based Spark cluster
• Open Source versions
• Data stored mainly in S3
• Raw
• Prepared
• Features
• Models
• Data published at end of Pipeline (RDS)
• Spark Analytic workloads
• Scheduled via Marathon or Chronos
Example of physical Architecture of Nerve Live in AWS
1
2
3
4
Published
Data
Storage
ServicesServices
Logs Metastore

Spark-Base-2.3.1 Image
Dependencies
+ Config
+ Config
O/S Image
Java-Base Image
• Use Docker
• Provide a Spark-Base-2.x.x Image to teams to
package code into
2
1
Example of how we package Spark Jobs

Spark-Base-2.3.1 Image
Dependencies
+ Config
+ Config
O/S Image
Java-Base Image
App Image
+ ConfigSpark Code
• Use Docker
• Provide a Spark-Base-2.x.x Image to teams to
package code into
• Teams package up their code into provided
image
• Can tell Spark to tell Mesos which Image to use
for executors
spark.mesos.executor.docker.image
• Bonus - We can run different versions of Spark
workloads in the cluster at same time!
2
3
1
4
Example of how we package Spark Jobs
5

Observations of using Cloud
Object Storage
2

Background – File Storage
/
foo/ a.csv
Directory Physical
b.csv
c.avrobar/
temp/
emp/
any/
File Storage
• Organize (file) data using
hierarchical structure (folder & files)
• Allows
• Traverse a path
• Quick to “list” files in a directory
• Quick to ”rename” files
1
2
+
+
+

Background – Object Storage
Object Storage
Key Physical
/foo/a.csv
/bar/b.csv
/bar/c.avro
• No concept of organsiation
Keys à Data
• Unlike File Storage
• Slow to “list” items in a “directory”
• Slow to ”rename” files
• Use REST calls
1
2
-
-
-

First Observation – The “Intent” to Read Data
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work...
# Custom log4j properties
...
log4j.logger.com.amazonaws.request=DEBUG
...

First Observation – The “Intent” to Read Data
Spark Code Datasource API
// Do some work...
Hadoop Libs AWS Libs
load( path )
S3
HTTPS:// HEAD / GET
…
(depends on number of files/partitions)
s
ms
key
REST
...
...

First Observation – The “Intent” to Read Data – How many HTTPS calls?
With Schema
// Do some work
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7

With Schema
// Do some work
7
6
4
6
5
4
1 1
0
0
2
4
6
8
2.1 2.2 2.3
Path to 1 x CSV File
TOTAL
HEAD
GET
Spark Versions

With Schema
// Do some work
7
6
4
6
5
4
1 1
0
0
2
4
6
8
2.1 2.2 2.3
TOTAL
HEAD
GET
15
11
14
9
7
9
6
4 5
0
5
10
15
20
2.1 2.2 2.3
Path to 200 x CSV Files
TOTAL
HEAD
GET
Spark Versions

With Schema vs. Infer Schema
// Do some work
.option( “inferSchema”, true )
.load( “s3a://...” )
// Do some work
15
11
14
9
7
9
6
4 5
0
5
10
15
20
2.1 2.2 2.3
TOTAL
HEAD
GET
7
6
4
6
5
4
1 1
0
0
2
4
6
8
2.1 2.2 2.3
TOTAL
HEAD
GET
Spark Versions

With Schema vs. Infer Schema
// Do some work
.load( “s3a://...” )
// Do some work
13
14 14
10 10 10
3
4 4
0
5
10
15
2.1 2.2 2.3
TOTAL
HEAD
GET
15
11
14
9
7
9
6
4 5
0
5
10
15
20
2.1 2.2 2.3
TOTAL
HEAD
GET
7
6
4
6
5
4
1 1
0
0
2
4
6
8
2.1 2.2 2.3
TOTAL
HEAD
GET
Spark Versions

With Schema
7
6
4
6
5
4
1 1
0
0
2
4
6
8
2.1 2.2 2.3
TOTAL
HEAD
GET
vs. Infer Schema
13
14 14
10 10 10
3
4 4
0
5
10
15
2.1 2.2 2.3
TOTAL
HEAD
GET
15
11
14
9
7
9
6
4 5
0
5
10
15
20
2.1 2.2 2.3
TOTAL
HEAD
GET
817
613
1082
610
408
643
207 205
439
0
200
400
600
800
1000
1200
2.1 2.2 2.3
TOTAL
HEAD
GET
// Do some work
.load( “s3a://...” )
// Do some work
Spark Versions

Second Observation – Writing Data – why does it take so long?
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
...
...

Second Observation – Writing Data – why does it take so long?
Spark Code Datasource API Hadoop Libs AWS Libs
save( path )
S3
HTTPS:// PUT / HEAD / GET / DELETE
…
s
ms
key
REST
// Do some work...
...
...
…
(depends on number of files/partitions)

Second Observation – Writing Data – How many HTTPS calls?
Default
// Do some work...
Spark Versions

Default
165
151 152
103 95 96
53 47 47
7 7 72 2 2
0
50
100
150
200
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
// Do some work...
Spark Versions

Default
165
151 152
103 95 96
53 47 47
7 7 72 2 2
0
50
100
150
200
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
// Do some work...
1092 1078
381
697 689
244
323 317
114
52 52 1720 20 6
0
200
400
600
800
1000
1200
2.1 2.2 2.3
10 Partitions
TOTAL
HEAD
GET
PUT
DELETE
Spark Versions

Default
165
151 152
103 95 96
53 47 47
7 7 72 2 2
0
50
100
150
200
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
// Do some work...
1092 1078
381
697 689
244
323 317
114
52 52 1720 20 6
0
200
400
600
800
1000
1200
2.1 2.2 2.3
10 Partitions
TOTAL
HEAD
GET
PUT
DELETE
Go Faster
spark.conf.set(
“spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version”, 2)
Spark Versions

Default
165
151 152
103 95 96
53 47 47
7 7 72 2 2
0
50
100
150
200
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
// Do some work...
126
112 113
79
71 72
40 34 34
5 5 52 2 2
0
20
40
60
80
100
120
140
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
1092 1078
381
697 689
244
323 317
114
52 52 1720 20 6
0
200
400
600
800
1000
1200
2.1 2.2 2.3
10 Partitions
TOTAL
HEAD
GET
PUT
DELETE
Go Faster
spark.conf.set(
Spark Versions

Default
165
151 152
103 95 96
53 47 47
7 7 72 2 2
0
50
100
150
200
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
// Do some work...
126
112 113
79
71 72
40 34 34
5 5 52 2 2
0
20
40
60
80
100
120
140
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
1092 1078
381
697 689
244
323 317
114
52 52 1720 20 6
0
200
400
600
800
1000
1200
2.1 2.2 2.3
10 Partitions
TOTAL
HEAD
GET
PUT
DELETE
738 724
272
475 467
176
220 214
81
32 32 1111 11 4
0
200
400
600
800
2.1 2.2 2.3
10 Partitions
TOTAL
HEAD
GET
PUT
DELETE
Go Faster
spark.conf.set(
Spark Versions

Conclusion
• Cloud object storage has tradeoffs
• Gives incredible storage capacity and scalability
• However, do not expect same performance/
or characteristics as a file system
• There are lots of configuration settings that
can affect your performance with reading/writing
to cloud object storage
Note: Some are Hadoop version specific!
1
2
3
4
spark.hadoop.fs.s3a.fast.upload=true
spark.hadoop.fs.s3a.connection.maximum=xxx
spark.hadoop.fs.s3a.multipart.size=xxx
spark.hadoop.fs.s3a.attempts.maximum=xxx
spark.hadoop.fs.s3a.multipart.threshold=xxx
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
etc
...

Design to Operate – Choices
• Our teams have faced choices for
Loading / Saving data
• What URI scheme to use?
─ i.e. S3://… S3n://… S3a://…, JDBC:// etc
• What file format?
─ i.e. Parquet, Avro, CSV, etc
• What compression to use?
─ i.e. Snappy, LZO, Gzip, etc
• What additional options….
─ i.e. inferSchema, mergeSchema, pushDownPredicate…etc
• Even what path to load/save data
1
?
? ?

Design to Operate – Choices
• Our teams have faced choices for
Loading / Saving data
─ i.e. S3://… S3n://… S3a://…, JDBC:// etc
• What file format?
─ i.e. inferSchema, mergeSchema, pushDownPredicate…etc
• Choices end up being “embedded” into code
1
2
val path = "s3n://my-bucket/csv-data/people"
val peopleDF =
spark.read.format( "csv" )
.option( "sep", ";" )
.option( "inferSchema", "true" )
.option( "header", "true" )
.load( path )
...
?
? ?

Design to Operate – Challenges
• Different teams working in same cluster
• Working on different use cases
• Likely that Spark Jobs are running with different settings in
same cluster
1

same cluster
1
• Those that build may not be the ones that end up
running/operating Spark Jobs
• Support Ops
• Or Client IT teams
2

same cluster
1
• Those that build may not be the ones that end up
running/operating Spark Jobs
• Support Ops
• Or Client IT teams
• Difficult for those operating to understand
• What configuration settings
• What does it depend on? i.e. Data it needs
• What depends on it? i.e. Data it writes
• How can I change things quickly?
3
2
?

Design to Operate – Our Approach
• Borrow ideas from Devops world
• "Strict separation of config from code”
1
Dev Ops

Spark Code
Nerve Load/Save Library
+ Config
• Custom library and config for loading and saving2
1
USES
USES
"system" : ”bank”,
"table": "customers",
”type": ”input",
”description": ”A Table with banking customer details.",
"format": {
"type": ”csv",
"compression": "snappy",
"partitions": "20”
"mode": "overwrite"
}
"location": {
"type": "s3",
"properties": { ... }
}
"schema": [
{
"name": "id",
"type": "int",
"doc": "User Id"
},
...

Spark Code
+ Config
• Custom library and config for loading and saving
• For Loading / Saving data
─ i.e. S3://… S3n://… S3a://…, JDBC:// etc
• What format?
─ i.e. inferSchema, mergeSchema, overwrite or append….
2
3
1
USES
USES
"system" : ”bank”,
"table": "customers",
”type": ”input",
”description": ”A Table with banking customer details.",
"format": {
"type": ”csv",
"compression": "snappy",
"partitions": "20”
"mode": "overwrite"
}
"location": {
"type": "s3",
"properties": { ... }
}
"schema": [
{
"name": "id",
"type": "int",
"doc": "User Id"
},
...

// Load a dataframe
val df = nerve.load( “system”, “table” )
// Save a dataframe
nerve.save( df, “system”, “table” )
Spark Code
+ Config
• Engineers use this library in their code
• Allows Engineers to focus on business logic and
not worry about concerns such as:
• Data formats
• Compressions
• Schemas
• Partitions
• Weird optimisation settings
• Allows teams that focus on operating/running to
focus on
• Consistency
• Runtime optimisations
2
3 USES
USES
1

Pipelines
• Pipelines mean different things
to different people
• Ultimately, its about the
Flow of data
• We need to
• Monitor
• Understand
• React
• It looks simple,
but in fact its not
1
2
4
3
Ingest Prepare Analytics Publish
Flow
Flow

Pipelines – Example
Events
Sources
Raw Prepared Published

Pipelines – Example
Sources
Events

Pipelines – Example – Agreed Schedule
Sources
0500
• Agree a schedule with
client when to send data
1
Events

Sources
0500 0700
• Configure schedule to kick
off Spark Jobs
1
2
Events

Sources
0500 0700
off Spark Jobs
• Subsequent Spark Jobs
kick off when previous
have completed
1
2
3
Events

Sources
0500 0700 0900
off Spark Jobs
• Subsequent Spark Jobs
kick off when previous
have completed
1
2
3
Events

Pipelines – Example – Challenge 1 – Untimely Data
Sources
0500 0700 0900
• Data failed to be sent up on
time
• Impact to rest of the
Pipeline
• Impact to end user!
1
2
3
Events

Pipelines – Example – Challenge 2 – Bad Data
Sources
0500 0700 0900
• Bad/Unexpected Data
• Schema change
• Change in Business rules
• Data Quality Issue
• Impact to Pipeline
1
2
3
Events

Pipelines – Example – Challenge 3 – Jobs Taking Longer
Sources
0500 0700 0900
• Spark Job’s taking longer
and longer
• Impact to Pipeline
1
2
3
Events

Pipelines – Example - Challenges
Sources
0500
Events
• We need to
• Monitor
• Understand
• React
• It looks simple,
but in fact its not
1
2
0700 0900

Pipelines – Example – Our Approach
Sources
• Monitor all Raw sources for
timely data
1
Monitoring
Events
+

Sources
Raw Prepared
timely data
• Monitor quality of data
• Config driven
• Agreed business rules
1
Monitoring
Events
+
Metadata
2
Published

Sources
Raw
timely data
• Monitor quality of data
• Config driven
• Agreed business rules
• Monitor Spark Metrics
• At least how long
• Additional Metrics incl:
─ Intent to load data J
─ Saving data J
1
Monitoring
Events
+
Metadata
2
3
PublishedPrepared

3 – Pipelines!
Recap

3 – Pipelines!
Lets Talk!
Recap

Running Spark In Production in the Cloud is Not Easy with Nayur Khan

More Related Content

What's hot (15)

Similar to Running Spark In Production in the Cloud is Not Easy with Nayur Khan (20)

More from Databricks (20)

Recently uploaded (20)

Running Spark In Production in the Cloud is Not Easy with Nayur Khan