Running Spark in Production
in the Cloud is not easy
Nayur Khan
#SAISEnt12
2All content copyright © 2017 QuantumBlack, a McKinsey company
Who am I?
Nayur Khan
Head of Platform Engineering
All content copyright © 2017 QuantumBlack, a McKinsey company
Born and proven in F1
where the smallest margins are
the difference between winning
and losing
Data has emerged as a
fundamental element of
competitive advantage
3All content copyright © 2017 QuantumBlack, a McKinsey company
QuantumBlack
“Exploit Data, Analytics and Design to
help our clients be the best they can be”
All content copyright © 2018 QuantumBlack, a McKinsey company
Advanced
Industries
Financial
Services
Healthcare
Infra-
structure
Telecoms Natural
Resources
Sport Consumer
Not just Formula One…
1 – Background
2 – Observations of using Cloud Object Storage
3 – Design To Operate
4 – Pipelines!
Background
1
7All content copyright © 2017 QuantumBlack, a McKinsey company
Some of the Platforms we use
Nerve Live
Powered By
All content copyright © 2018 QuantumBlack, a McKinsey company
Typical Flow for production
Apply Advanced
Analytics and
ML to features
Leverage unconnected, unstructured
data by preparing and linking data and
creating features
Advanced
analytics
User modules
Analytics Platform
Raw Data
...
Better & more
consistent decision
making
Prepared Data
Cleaned,
Enriched and
Linked
Infrastructure / Hosting
• Data Ingested into a Raw storage area
• Data is Prepared
• Cleansed
• Enriched
• Linked
• etc
• Models are run
• User Tools consume outputs
• Rich custom built tools
• Self service BI tools
2
3
4
1
1 2 3 4
All content copyright © 2018 QuantumBlack, a McKinsey company
Analytic Modules Analytic Modules
Zookeeper
Mesos / Marathon
Master Servers
Slave Servers
Nerve Live Analytics Cluster (AWS)
Zookeeper
Mesos / Marathon
Zookeeper
Mesos / Marathon
Mesos Slave / Chronos
Analytic Modules
Mesos Slave / Chronos
Analytic Modules
Spark AnalyticsSpark Analytics
…
Data Buckets
• Mesos based Spark cluster
• Open Source versions
• Data stored mainly in S3
• Raw
• Prepared
• Features
• Models
• Data published at end of Pipeline (RDS)
• Spark Analytic workloads
• Scheduled via Marathon or Chronos
Example of physical Architecture of Nerve Live in AWS
1
2
3
4
Published
Data
Storage
ServicesServices
Logs Metastore
All content copyright © 2018 QuantumBlack, a McKinsey company
Spark-Base-2.3.1 Image
Dependencies
+ Config
+ Config
O/S Image
Java-Base Image
• Use Docker
• Provide a Spark-Base-2.x.x Image to teams to
package code into
2
1
Example of how we package Spark Jobs
All content copyright © 2018 QuantumBlack, a McKinsey company
Spark-Base-2.3.1 Image
Dependencies
+ Config
+ Config
O/S Image
Java-Base Image
App Image
+ ConfigSpark Code
• Use Docker
• Provide a Spark-Base-2.x.x Image to teams to
package code into
• Teams package up their code into provided
image
• Can tell Spark to tell Mesos which Image to use
for executors
spark.mesos.executor.docker.image
• Bonus - We can run different versions of Spark
workloads in the cluster at same time!
2
3
1
4
Example of how we package Spark Jobs
5
Observations of using Cloud
Object Storage
2
All content copyright © 2018 QuantumBlack, a McKinsey company
Background – File Storage
/
foo/ a.csv
Directory Physical
b.csv
c.avrobar/
temp/
emp/
any/
File Storage
• Organize (file) data using
hierarchical structure (folder & files)
• Allows
• Traverse a path
• Quick to “list” files in a directory
• Quick to ”rename” files
1
2
+
+
+
All content copyright © 2018 QuantumBlack, a McKinsey company
Background – Object Storage
Object Storage
Key Physical
/foo/a.csv
/bar/b.csv
/bar/c.avro
• No concept of organsiation
Keys à Data
• Unlike File Storage
• Slow to “list” items in a “directory”
• Slow to ”rename” files
• Use REST calls
1
2
-
-
-
All content copyright © 2018 QuantumBlack, a McKinsey company
First Observation – The “Intent” to Read Data
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work...
# Custom log4j properties
...
log4j.logger.com.amazonaws.request=DEBUG
...
All content copyright © 2018 QuantumBlack, a McKinsey company
First Observation – The “Intent” to Read Data
Spark Code Datasource API
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work...
Hadoop Libs AWS Libs
load( path )
S3
HTTPS:// HEAD / GET
…
(depends on number of files/partitions)
s
ms
key
REST
# Custom log4j properties
...
log4j.logger.com.amazonaws.request=DEBUG
...
All content copyright © 2018 QuantumBlack, a McKinsey company
First Observation – The “Intent” to Read Data – How many HTTPS calls?
With Schema
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
First Observation – The “Intent” to Read Data – How many HTTPS calls?
With Schema
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work
7
6
4
6
5
4
1 1
0
0
2
4
6
8
2.1 2.2 2.3
Path to 1 x CSV File
TOTAL
HEAD
GET
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
First Observation – The “Intent” to Read Data – How many HTTPS calls?
With Schema
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work
7
6
4
6
5
4
1 1
0
0
2
4
6
8
2.1 2.2 2.3
Path to 1 x CSV File
TOTAL
HEAD
GET
15
11
14
9
7
9
6
4 5
0
5
10
15
20
2.1 2.2 2.3
Path to 200 x CSV Files
TOTAL
HEAD
GET
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
First Observation – The “Intent” to Read Data – How many HTTPS calls?
With Schema vs. Infer Schema
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work
// Define a dataframe
val df = spark.read.format( “csv” )
.option( “inferSchema”, true )
.load( “s3a://...” )
// Do some work
15
11
14
9
7
9
6
4 5
0
5
10
15
20
2.1 2.2 2.3
Path to 200 x CSV Files
TOTAL
HEAD
GET
7
6
4
6
5
4
1 1
0
0
2
4
6
8
2.1 2.2 2.3
Path to 1 x CSV File
TOTAL
HEAD
GET
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
First Observation – The “Intent” to Read Data – How many HTTPS calls?
With Schema vs. Infer Schema
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work
// Define a dataframe
val df = spark.read.format( “csv” )
.option( “inferSchema”, true )
.load( “s3a://...” )
// Do some work
13
14 14
10 10 10
3
4 4
0
5
10
15
2.1 2.2 2.3
Path to 1 x CSV File
TOTAL
HEAD
GET
15
11
14
9
7
9
6
4 5
0
5
10
15
20
2.1 2.2 2.3
Path to 200 x CSV Files
TOTAL
HEAD
GET
7
6
4
6
5
4
1 1
0
0
2
4
6
8
2.1 2.2 2.3
Path to 1 x CSV File
TOTAL
HEAD
GET
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
First Observation – The “Intent” to Read Data – How many HTTPS calls?
With Schema
7
6
4
6
5
4
1 1
0
0
2
4
6
8
2.1 2.2 2.3
Path to 1 x CSV File
TOTAL
HEAD
GET
vs. Infer Schema
13
14 14
10 10 10
3
4 4
0
5
10
15
2.1 2.2 2.3
Path to 1 x CSV File
TOTAL
HEAD
GET
15
11
14
9
7
9
6
4 5
0
5
10
15
20
2.1 2.2 2.3
Path to 200 x CSV Files
TOTAL
HEAD
GET
817
613
1082
610
408
643
207 205
439
0
200
400
600
800
1000
1200
2.1 2.2 2.3
Path to 200 x CSV Files
TOTAL
HEAD
GET
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work
// Define a dataframe
val df = spark.read.format( “csv” )
.option( “inferSchema”, true )
.load( “s3a://...” )
// Do some work
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
Second Observation – Writing Data – why does it take so long?
// Define a dataframe
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
# Custom log4j properties
...
log4j.logger.com.amazonaws.request=DEBUG
...
All content copyright © 2018 QuantumBlack, a McKinsey company
Second Observation – Writing Data – why does it take so long?
Spark Code Datasource API Hadoop Libs AWS Libs
save( path )
S3
HTTPS:// PUT / HEAD / GET / DELETE
…
s
ms
key
REST
// Define a dataframe
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
# Custom log4j properties
...
log4j.logger.com.amazonaws.request=DEBUG
...
…
(depends on number of files/partitions)
All content copyright © 2018 QuantumBlack, a McKinsey company
Second Observation – Writing Data – How many HTTPS calls?
Default
// Define a dataframe
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
Second Observation – Writing Data – How many HTTPS calls?
Default
165
151 152
103 95 96
53 47 47
7 7 72 2 2
0
50
100
150
200
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
// Define a dataframe
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
Second Observation – Writing Data – How many HTTPS calls?
Default
165
151 152
103 95 96
53 47 47
7 7 72 2 2
0
50
100
150
200
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
// Define a dataframe
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
1092 1078
381
697 689
244
323 317
114
52 52 1720 20 6
0
200
400
600
800
1000
1200
2.1 2.2 2.3
10 Partitions
TOTAL
HEAD
GET
PUT
DELETE
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
Second Observation – Writing Data – How many HTTPS calls?
Default
165
151 152
103 95 96
53 47 47
7 7 72 2 2
0
50
100
150
200
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
// Define a dataframe
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
1092 1078
381
697 689
244
323 317
114
52 52 1720 20 6
0
200
400
600
800
1000
1200
2.1 2.2 2.3
10 Partitions
TOTAL
HEAD
GET
PUT
DELETE
Go Faster
spark.conf.set(
“spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version”, 2)
df.write.format( “parquet” ).save( “s3a://...” )
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
Second Observation – Writing Data – How many HTTPS calls?
Default
165
151 152
103 95 96
53 47 47
7 7 72 2 2
0
50
100
150
200
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
// Define a dataframe
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
126
112 113
79
71 72
40 34 34
5 5 52 2 2
0
20
40
60
80
100
120
140
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
1092 1078
381
697 689
244
323 317
114
52 52 1720 20 6
0
200
400
600
800
1000
1200
2.1 2.2 2.3
10 Partitions
TOTAL
HEAD
GET
PUT
DELETE
Go Faster
spark.conf.set(
“spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version”, 2)
df.write.format( “parquet” ).save( “s3a://...” )
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
Second Observation – Writing Data – How many HTTPS calls?
Default
165
151 152
103 95 96
53 47 47
7 7 72 2 2
0
50
100
150
200
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
// Define a dataframe
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
126
112 113
79
71 72
40 34 34
5 5 52 2 2
0
20
40
60
80
100
120
140
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
1092 1078
381
697 689
244
323 317
114
52 52 1720 20 6
0
200
400
600
800
1000
1200
2.1 2.2 2.3
10 Partitions
TOTAL
HEAD
GET
PUT
DELETE
738 724
272
475 467
176
220 214
81
32 32 1111 11 4
0
200
400
600
800
2.1 2.2 2.3
10 Partitions
TOTAL
HEAD
GET
PUT
DELETE
Go Faster
spark.conf.set(
“spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version”, 2)
df.write.format( “parquet” ).save( “s3a://...” )
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
Conclusion
• Cloud object storage has tradeoffs
• Gives incredible storage capacity and scalability
• However, do not expect same performance/
or characteristics as a file system
• There are lots of configuration settings that
can affect your performance with reading/writing
to cloud object storage
Note: Some are Hadoop version specific!
1
2
3
4
spark.hadoop.fs.s3a.fast.upload=true
spark.hadoop.fs.s3a.connection.maximum=xxx
spark.hadoop.fs.s3a.multipart.size=xxx
spark.hadoop.fs.s3a.attempts.maximum=xxx
spark.hadoop.fs.s3a.multipart.threshold=xxx
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
etc
...
Design to Operate
3
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Choices
• Our teams have faced choices for
Loading / Saving data
• What URI scheme to use?
─ i.e. S3://… S3n://… S3a://…, JDBC:// etc
• What file format?
─ i.e. Parquet, Avro, CSV, etc
• What compression to use?
─ i.e. Snappy, LZO, Gzip, etc
• What additional options….
─ i.e. inferSchema, mergeSchema, pushDownPredicate…etc
• Even what path to load/save data
1
?
? ?
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Choices
• Our teams have faced choices for
Loading / Saving data
• What URI scheme to use?
─ i.e. S3://… S3n://… S3a://…, JDBC:// etc
• What file format?
─ i.e. Parquet, Avro, CSV, etc
• What compression to use?
─ i.e. Snappy, LZO, Gzip, etc
• What additional options….
─ i.e. inferSchema, mergeSchema, pushDownPredicate…etc
• Even what path to load/save data
• Choices end up being “embedded” into code
1
2
val path = "s3n://my-bucket/csv-data/people"
val peopleDF =
spark.read.format( "csv" )
.option( "sep", ";" )
.option( "inferSchema", "true" )
.option( "header", "true" )
.load( path )
...
?
? ?
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Challenges
• Different teams working in same cluster
• Working on different use cases
• Likely that Spark Jobs are running with different settings in
same cluster
1
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Challenges
• Different teams working in same cluster
• Working on different use cases
• Likely that Spark Jobs are running with different settings in
same cluster
1
• Those that build may not be the ones that end up
running/operating Spark Jobs
• Support Ops
• Or Client IT teams
2
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Challenges
• Different teams working in same cluster
• Working on different use cases
• Likely that Spark Jobs are running with different settings in
same cluster
1
• Those that build may not be the ones that end up
running/operating Spark Jobs
• Support Ops
• Or Client IT teams
• Difficult for those operating to understand
• What configuration settings
• What does it depend on? i.e. Data it needs
• What depends on it? i.e. Data it writes
• How can I change things quickly?
3
2
?
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Our Approach
• Borrow ideas from Devops world
• "Strict separation of config from code”
1
Dev Ops
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Our Approach
Spark Code
Nerve Load/Save Library
+ Config
• Borrow ideas from Devops world
• "Strict separation of config from code”
• Custom library and config for loading and saving2
1
USES
USES
"system" : ”bank”,
"table": "customers",
”type": ”input",
”description": ”A Table with banking customer details.",
"format": {
"type": ”csv",
"compression": "snappy",
"partitions": "20”
"mode": "overwrite"
}
"location": {
"type": "s3",
"properties": { ... }
}
"schema": [
{
"name": "id",
"type": "int",
"doc": "User Id"
},
...
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Our Approach
Spark Code
Nerve Load/Save Library
+ Config
• Borrow ideas from Devops world
• "Strict separation of config from code”
• Custom library and config for loading and saving
• For Loading / Saving data
• What URI scheme to use?
─ i.e. S3://… S3n://… S3a://…, JDBC:// etc
• What format?
─ i.e. Parquet, Avro, CSV, etc
• What compression to use?
─ i.e. Snappy, LZO, Gzip, etc
• What additional options….
─ i.e. inferSchema, mergeSchema, overwrite or append….
• Even what path to load/save data
2
3
1
USES
USES
"system" : ”bank”,
"table": "customers",
”type": ”input",
”description": ”A Table with banking customer details.",
"format": {
"type": ”csv",
"compression": "snappy",
"partitions": "20”
"mode": "overwrite"
}
"location": {
"type": "s3",
"properties": { ... }
}
"schema": [
{
"name": "id",
"type": "int",
"doc": "User Id"
},
...
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Our Approach
// Load a dataframe
val df = nerve.load( “system”, “table” )
// Save a dataframe
nerve.save( df, “system”, “table” )
Spark Code
Nerve Load/Save Library
+ Config
• Engineers use this library in their code
• Allows Engineers to focus on business logic and
not worry about concerns such as:
• Data formats
• Compressions
• Schemas
• Partitions
• Weird optimisation settings
• Allows teams that focus on operating/running to
focus on
• Consistency
• Runtime optimisations
2
3 USES
USES
1
Pipelines!
4
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines
• Pipelines mean different things
to different people
• Ultimately, its about the
Flow of data
• We need to
• Monitor
• Understand
• React
• It looks simple,
but in fact its not
1
2
4
3
Ingest Prepare Analytics Publish
Flow
Flow
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example
Events
Sources
Raw Prepared Published
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example
Sources
Raw Prepared Published
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example
Sources
Raw Prepared Published
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example
Sources
Raw Prepared Published
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example
Sources
Raw Prepared Published
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example
Sources
Raw Prepared Published
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example
Sources
Raw Prepared Published
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Agreed Schedule
Sources
Raw Prepared Published
0500
• Agree a schedule with
client when to send data
1
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Agreed Schedule
Sources
Raw Prepared Published
0500 0700
• Agree a schedule with
client when to send data
• Configure schedule to kick
off Spark Jobs
1
2
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Agreed Schedule
Sources
Raw Prepared Published
0500 0700
• Agree a schedule with
client when to send data
• Configure schedule to kick
off Spark Jobs
• Subsequent Spark Jobs
kick off when previous
have completed
1
2
3
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Agreed Schedule
Sources
Raw Prepared Published
0500 0700 0900
• Agree a schedule with
client when to send data
• Configure schedule to kick
off Spark Jobs
• Subsequent Spark Jobs
kick off when previous
have completed
1
2
3
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Challenge 1 – Untimely Data
Sources
Raw Prepared Published
0500 0700 0900
• Data failed to be sent up on
time
• Impact to rest of the
Pipeline
• Impact to end user!
1
2
3
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Challenge 2 – Bad Data
Sources
Raw Prepared Published
0500 0700 0900
• Bad/Unexpected Data
• Schema change
• Change in Business rules
• Data Quality Issue
• Impact to Pipeline
• Impact to end user!
1
2
3
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Challenge 3 – Jobs Taking Longer
Sources
Raw Prepared Published
0500 0700 0900
• Spark Job’s taking longer
and longer
• Impact to Pipeline
• Impact to end user!
1
2
3
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example - Challenges
Sources
Raw Prepared Published
0500
Events
• We need to
• Monitor
• Understand
• React
• It looks simple,
but in fact its not
1
2
0700 0900
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Our Approach
Sources
Raw Prepared Published
• Monitor all Raw sources for
timely data
1
Monitoring
Events
+
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Our Approach
Sources
Raw Prepared
• Monitor all Raw sources for
timely data
• Monitor quality of data
• Config driven
• Agreed business rules
1
Monitoring
Events
+
Metadata
2
Published
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Our Approach
Sources
Raw
• Monitor all Raw sources for
timely data
• Monitor quality of data
• Config driven
• Agreed business rules
• Monitor Spark Metrics
• At least how long
• Additional Metrics incl:
─ Intent to load data J
─ Saving data J
1
Monitoring
Events
+
Metadata
2
3
PublishedPrepared
1 – Observations of using Cloud Object Storage
2 – Design To Operate
3 – Pipelines!
Recap
1 – Observations of using Cloud Object Storage
2 – Design To Operate
3 – Pipelines!
Lets Talk!
Recap

More Related Content

PPTX
Episode 3: Kubernetes and Big Data Services
PDF
Top 8 WCM Trends 2010
PPTX
In Memory Analytics with Apache Spark
PDF
Technical tips for secure Apache Hadoop cluster #ApacheConAsia #ApacheCon
PDF
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
PDF
AWS Study Group - Chapter 01 - Introducing AWS [Solution Architect Associate ...
PDF
Monoliths to the cloud!
PPTX
Feedback on Big Compute & HPC on Windows Azure
Episode 3: Kubernetes and Big Data Services
Top 8 WCM Trends 2010
In Memory Analytics with Apache Spark
Technical tips for secure Apache Hadoop cluster #ApacheConAsia #ApacheCon
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
AWS Study Group - Chapter 01 - Introducing AWS [Solution Architect Associate ...
Monoliths to the cloud!
Feedback on Big Compute & HPC on Windows Azure

What's hot (15)

PDF
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
PDF
Working with Terraform on Azure
PDF
Hadoop on-mesos
PDF
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
PPTX
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
PDF
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
PDF
Gartner evaluation criteria_for_clou_security_networking
PDF
How we broke Apache Ignite by adding persistence, by Stephen Darlington (Grid...
PPTX
Apache ignite v1.3
PDF
Consul 1.6: Layer 7 Traffic Management and Mesh Gateways
PPTX
Kafka for DBAs
PDF
Serverless Data Platform
PPT
Implementing SharePoint on Azure, Lessons Learnt from a Real World Project
PDF
Openstack Summit Vancouver 2018 - Multicloud Networking
PPT
Mashups
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Working with Terraform on Azure
Hadoop on-mesos
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Gartner evaluation criteria_for_clou_security_networking
How we broke Apache Ignite by adding persistence, by Stephen Darlington (Grid...
Apache ignite v1.3
Consul 1.6: Layer 7 Traffic Management and Mesh Gateways
Kafka for DBAs
Serverless Data Platform
Implementing SharePoint on Azure, Lessons Learnt from a Real World Project
Openstack Summit Vancouver 2018 - Multicloud Networking
Mashups
Ad

Similar to Running Spark In Production in the Cloud is Not Easy with Nayur Khan (20)

PDF
An introduction into Spark ML plus how to go beyond when you get stuck
PPTX
Spark from the Surface
PPTX
Intro to Apache Spark by CTO of Twingo
PDF
New Analytics Toolbox DevNexus 2015
PPTX
Apache Spark in Industry
PDF
Apache Spark: The Analytics Operating System
PDF
spark_v1_2
PDF
Started with-apache-spark
PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
PPTX
Intro to Spark
PDF
Introduction To Spark - Durham LUG 20150916
PPTX
Building a modern Application with DataFrames
PPTX
Building a modern Application with DataFrames
PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
PPTX
Atlanta MLConf
PPTX
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
PDF
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PPTX
Intro to Apache Spark
PPTX
APACHE SPARK.pptx
An introduction into Spark ML plus how to go beyond when you get stuck
Spark from the Surface
Intro to Apache Spark by CTO of Twingo
New Analytics Toolbox DevNexus 2015
Apache Spark in Industry
Apache Spark: The Analytics Operating System
spark_v1_2
Started with-apache-spark
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Intro to Spark
Introduction To Spark - Durham LUG 20150916
Building a modern Application with DataFrames
Building a modern Application with DataFrames
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
Atlanta MLConf
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Intro to Apache Spark
APACHE SPARK.pptx
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Introduction to Inferential Statistics.pptx
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
Transcultural that can help you someday.
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
Introduction to the R Programming Language
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
chrmotography.pptx food anaylysis techni
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
SET 1 Compulsory MNH machine learning intro
PPT
statistic analysis for study - data collection
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPTX
New ISO 27001_2022 standard and the changes
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Microsoft Core Cloud Services powerpoint
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
Introduction to Inferential Statistics.pptx
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Transcultural that can help you someday.
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Introduction to the R Programming Language
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
chrmotography.pptx food anaylysis techni
Topic 5 Presentation 5 Lesson 5 Corporate Fin
SET 1 Compulsory MNH machine learning intro
statistic analysis for study - data collection
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
New ISO 27001_2022 standard and the changes
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
[EN] Industrial Machine Downtime Prediction
Microsoft Core Cloud Services powerpoint
retention in jsjsksksksnbsndjddjdnFPD.pptx
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx

Running Spark In Production in the Cloud is Not Easy with Nayur Khan

  • 1. Running Spark in Production in the Cloud is not easy Nayur Khan #SAISEnt12
  • 2. 2All content copyright © 2017 QuantumBlack, a McKinsey company Who am I? Nayur Khan Head of Platform Engineering
  • 3. All content copyright © 2017 QuantumBlack, a McKinsey company Born and proven in F1 where the smallest margins are the difference between winning and losing Data has emerged as a fundamental element of competitive advantage 3All content copyright © 2017 QuantumBlack, a McKinsey company QuantumBlack “Exploit Data, Analytics and Design to help our clients be the best they can be”
  • 4. All content copyright © 2018 QuantumBlack, a McKinsey company Advanced Industries Financial Services Healthcare Infra- structure Telecoms Natural Resources Sport Consumer Not just Formula One…
  • 5. 1 – Background 2 – Observations of using Cloud Object Storage 3 – Design To Operate 4 – Pipelines!
  • 7. 7All content copyright © 2017 QuantumBlack, a McKinsey company Some of the Platforms we use Nerve Live Powered By
  • 8. All content copyright © 2018 QuantumBlack, a McKinsey company Typical Flow for production Apply Advanced Analytics and ML to features Leverage unconnected, unstructured data by preparing and linking data and creating features Advanced analytics User modules Analytics Platform Raw Data ... Better & more consistent decision making Prepared Data Cleaned, Enriched and Linked Infrastructure / Hosting • Data Ingested into a Raw storage area • Data is Prepared • Cleansed • Enriched • Linked • etc • Models are run • User Tools consume outputs • Rich custom built tools • Self service BI tools 2 3 4 1 1 2 3 4
  • 9. All content copyright © 2018 QuantumBlack, a McKinsey company Analytic Modules Analytic Modules Zookeeper Mesos / Marathon Master Servers Slave Servers Nerve Live Analytics Cluster (AWS) Zookeeper Mesos / Marathon Zookeeper Mesos / Marathon Mesos Slave / Chronos Analytic Modules Mesos Slave / Chronos Analytic Modules Spark AnalyticsSpark Analytics … Data Buckets • Mesos based Spark cluster • Open Source versions • Data stored mainly in S3 • Raw • Prepared • Features • Models • Data published at end of Pipeline (RDS) • Spark Analytic workloads • Scheduled via Marathon or Chronos Example of physical Architecture of Nerve Live in AWS 1 2 3 4 Published Data Storage ServicesServices Logs Metastore
  • 10. All content copyright © 2018 QuantumBlack, a McKinsey company Spark-Base-2.3.1 Image Dependencies + Config + Config O/S Image Java-Base Image • Use Docker • Provide a Spark-Base-2.x.x Image to teams to package code into 2 1 Example of how we package Spark Jobs
  • 11. All content copyright © 2018 QuantumBlack, a McKinsey company Spark-Base-2.3.1 Image Dependencies + Config + Config O/S Image Java-Base Image App Image + ConfigSpark Code • Use Docker • Provide a Spark-Base-2.x.x Image to teams to package code into • Teams package up their code into provided image • Can tell Spark to tell Mesos which Image to use for executors spark.mesos.executor.docker.image • Bonus - We can run different versions of Spark workloads in the cluster at same time! 2 3 1 4 Example of how we package Spark Jobs 5
  • 12. Observations of using Cloud Object Storage 2
  • 13. All content copyright © 2018 QuantumBlack, a McKinsey company Background – File Storage / foo/ a.csv Directory Physical b.csv c.avrobar/ temp/ emp/ any/ File Storage • Organize (file) data using hierarchical structure (folder & files) • Allows • Traverse a path • Quick to “list” files in a directory • Quick to ”rename” files 1 2 + + +
  • 14. All content copyright © 2018 QuantumBlack, a McKinsey company Background – Object Storage Object Storage Key Physical /foo/a.csv /bar/b.csv /bar/c.avro • No concept of organsiation Keys à Data • Unlike File Storage • Slow to “list” items in a “directory” • Slow to ”rename” files • Use REST calls 1 2 - - -
  • 15. All content copyright © 2018 QuantumBlack, a McKinsey company First Observation – The “Intent” to Read Data // Define a dataframe val df = spark.read.format( “csv” ) .schema( csvSchema ).load( “s3a://...” ) // Do some work... # Custom log4j properties ... log4j.logger.com.amazonaws.request=DEBUG ...
  • 16. All content copyright © 2018 QuantumBlack, a McKinsey company First Observation – The “Intent” to Read Data Spark Code Datasource API // Define a dataframe val df = spark.read.format( “csv” ) .schema( csvSchema ).load( “s3a://...” ) // Do some work... Hadoop Libs AWS Libs load( path ) S3 HTTPS:// HEAD / GET … (depends on number of files/partitions) s ms key REST # Custom log4j properties ... log4j.logger.com.amazonaws.request=DEBUG ...
  • 17. All content copyright © 2018 QuantumBlack, a McKinsey company First Observation – The “Intent” to Read Data – How many HTTPS calls? With Schema // Define a dataframe val df = spark.read.format( “csv” ) .schema( csvSchema ).load( “s3a://...” ) // Do some work Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 18. All content copyright © 2018 QuantumBlack, a McKinsey company First Observation – The “Intent” to Read Data – How many HTTPS calls? With Schema // Define a dataframe val df = spark.read.format( “csv” ) .schema( csvSchema ).load( “s3a://...” ) // Do some work 7 6 4 6 5 4 1 1 0 0 2 4 6 8 2.1 2.2 2.3 Path to 1 x CSV File TOTAL HEAD GET Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 19. All content copyright © 2018 QuantumBlack, a McKinsey company First Observation – The “Intent” to Read Data – How many HTTPS calls? With Schema // Define a dataframe val df = spark.read.format( “csv” ) .schema( csvSchema ).load( “s3a://...” ) // Do some work 7 6 4 6 5 4 1 1 0 0 2 4 6 8 2.1 2.2 2.3 Path to 1 x CSV File TOTAL HEAD GET 15 11 14 9 7 9 6 4 5 0 5 10 15 20 2.1 2.2 2.3 Path to 200 x CSV Files TOTAL HEAD GET Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 20. All content copyright © 2018 QuantumBlack, a McKinsey company First Observation – The “Intent” to Read Data – How many HTTPS calls? With Schema vs. Infer Schema // Define a dataframe val df = spark.read.format( “csv” ) .schema( csvSchema ).load( “s3a://...” ) // Do some work // Define a dataframe val df = spark.read.format( “csv” ) .option( “inferSchema”, true ) .load( “s3a://...” ) // Do some work 15 11 14 9 7 9 6 4 5 0 5 10 15 20 2.1 2.2 2.3 Path to 200 x CSV Files TOTAL HEAD GET 7 6 4 6 5 4 1 1 0 0 2 4 6 8 2.1 2.2 2.3 Path to 1 x CSV File TOTAL HEAD GET Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 21. All content copyright © 2018 QuantumBlack, a McKinsey company First Observation – The “Intent” to Read Data – How many HTTPS calls? With Schema vs. Infer Schema // Define a dataframe val df = spark.read.format( “csv” ) .schema( csvSchema ).load( “s3a://...” ) // Do some work // Define a dataframe val df = spark.read.format( “csv” ) .option( “inferSchema”, true ) .load( “s3a://...” ) // Do some work 13 14 14 10 10 10 3 4 4 0 5 10 15 2.1 2.2 2.3 Path to 1 x CSV File TOTAL HEAD GET 15 11 14 9 7 9 6 4 5 0 5 10 15 20 2.1 2.2 2.3 Path to 200 x CSV Files TOTAL HEAD GET 7 6 4 6 5 4 1 1 0 0 2 4 6 8 2.1 2.2 2.3 Path to 1 x CSV File TOTAL HEAD GET Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 22. All content copyright © 2018 QuantumBlack, a McKinsey company First Observation – The “Intent” to Read Data – How many HTTPS calls? With Schema 7 6 4 6 5 4 1 1 0 0 2 4 6 8 2.1 2.2 2.3 Path to 1 x CSV File TOTAL HEAD GET vs. Infer Schema 13 14 14 10 10 10 3 4 4 0 5 10 15 2.1 2.2 2.3 Path to 1 x CSV File TOTAL HEAD GET 15 11 14 9 7 9 6 4 5 0 5 10 15 20 2.1 2.2 2.3 Path to 200 x CSV Files TOTAL HEAD GET 817 613 1082 610 408 643 207 205 439 0 200 400 600 800 1000 1200 2.1 2.2 2.3 Path to 200 x CSV Files TOTAL HEAD GET // Define a dataframe val df = spark.read.format( “csv” ) .schema( csvSchema ).load( “s3a://...” ) // Do some work // Define a dataframe val df = spark.read.format( “csv” ) .option( “inferSchema”, true ) .load( “s3a://...” ) // Do some work Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 23. All content copyright © 2018 QuantumBlack, a McKinsey company Second Observation – Writing Data – why does it take so long? // Define a dataframe df.write.format( “parquet” ).save( “s3a://...” ) // Do some work... # Custom log4j properties ... log4j.logger.com.amazonaws.request=DEBUG ...
  • 24. All content copyright © 2018 QuantumBlack, a McKinsey company Second Observation – Writing Data – why does it take so long? Spark Code Datasource API Hadoop Libs AWS Libs save( path ) S3 HTTPS:// PUT / HEAD / GET / DELETE … s ms key REST // Define a dataframe df.write.format( “parquet” ).save( “s3a://...” ) // Do some work... # Custom log4j properties ... log4j.logger.com.amazonaws.request=DEBUG ... … (depends on number of files/partitions)
  • 25. All content copyright © 2018 QuantumBlack, a McKinsey company Second Observation – Writing Data – How many HTTPS calls? Default // Define a dataframe df.write.format( “parquet” ).save( “s3a://...” ) // Do some work... Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 26. All content copyright © 2018 QuantumBlack, a McKinsey company Second Observation – Writing Data – How many HTTPS calls? Default 165 151 152 103 95 96 53 47 47 7 7 72 2 2 0 50 100 150 200 2.1 2.2 2.3 1 Partition TOTAL HEAD GET PUT DELETE // Define a dataframe df.write.format( “parquet” ).save( “s3a://...” ) // Do some work... Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 27. All content copyright © 2018 QuantumBlack, a McKinsey company Second Observation – Writing Data – How many HTTPS calls? Default 165 151 152 103 95 96 53 47 47 7 7 72 2 2 0 50 100 150 200 2.1 2.2 2.3 1 Partition TOTAL HEAD GET PUT DELETE // Define a dataframe df.write.format( “parquet” ).save( “s3a://...” ) // Do some work... 1092 1078 381 697 689 244 323 317 114 52 52 1720 20 6 0 200 400 600 800 1000 1200 2.1 2.2 2.3 10 Partitions TOTAL HEAD GET PUT DELETE Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 28. All content copyright © 2018 QuantumBlack, a McKinsey company Second Observation – Writing Data – How many HTTPS calls? Default 165 151 152 103 95 96 53 47 47 7 7 72 2 2 0 50 100 150 200 2.1 2.2 2.3 1 Partition TOTAL HEAD GET PUT DELETE // Define a dataframe df.write.format( “parquet” ).save( “s3a://...” ) // Do some work... 1092 1078 381 697 689 244 323 317 114 52 52 1720 20 6 0 200 400 600 800 1000 1200 2.1 2.2 2.3 10 Partitions TOTAL HEAD GET PUT DELETE Go Faster spark.conf.set( “spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version”, 2) df.write.format( “parquet” ).save( “s3a://...” ) Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 29. All content copyright © 2018 QuantumBlack, a McKinsey company Second Observation – Writing Data – How many HTTPS calls? Default 165 151 152 103 95 96 53 47 47 7 7 72 2 2 0 50 100 150 200 2.1 2.2 2.3 1 Partition TOTAL HEAD GET PUT DELETE // Define a dataframe df.write.format( “parquet” ).save( “s3a://...” ) // Do some work... 126 112 113 79 71 72 40 34 34 5 5 52 2 2 0 20 40 60 80 100 120 140 2.1 2.2 2.3 1 Partition TOTAL HEAD GET PUT DELETE 1092 1078 381 697 689 244 323 317 114 52 52 1720 20 6 0 200 400 600 800 1000 1200 2.1 2.2 2.3 10 Partitions TOTAL HEAD GET PUT DELETE Go Faster spark.conf.set( “spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version”, 2) df.write.format( “parquet” ).save( “s3a://...” ) Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 30. All content copyright © 2018 QuantumBlack, a McKinsey company Second Observation – Writing Data – How many HTTPS calls? Default 165 151 152 103 95 96 53 47 47 7 7 72 2 2 0 50 100 150 200 2.1 2.2 2.3 1 Partition TOTAL HEAD GET PUT DELETE // Define a dataframe df.write.format( “parquet” ).save( “s3a://...” ) // Do some work... 126 112 113 79 71 72 40 34 34 5 5 52 2 2 0 20 40 60 80 100 120 140 2.1 2.2 2.3 1 Partition TOTAL HEAD GET PUT DELETE 1092 1078 381 697 689 244 323 317 114 52 52 1720 20 6 0 200 400 600 800 1000 1200 2.1 2.2 2.3 10 Partitions TOTAL HEAD GET PUT DELETE 738 724 272 475 467 176 220 214 81 32 32 1111 11 4 0 200 400 600 800 2.1 2.2 2.3 10 Partitions TOTAL HEAD GET PUT DELETE Go Faster spark.conf.set( “spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version”, 2) df.write.format( “parquet” ).save( “s3a://...” ) Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 31. All content copyright © 2018 QuantumBlack, a McKinsey company Conclusion • Cloud object storage has tradeoffs • Gives incredible storage capacity and scalability • However, do not expect same performance/ or characteristics as a file system • There are lots of configuration settings that can affect your performance with reading/writing to cloud object storage Note: Some are Hadoop version specific! 1 2 3 4 spark.hadoop.fs.s3a.fast.upload=true spark.hadoop.fs.s3a.connection.maximum=xxx spark.hadoop.fs.s3a.multipart.size=xxx spark.hadoop.fs.s3a.attempts.maximum=xxx spark.hadoop.fs.s3a.multipart.threshold=xxx spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 etc ...
  • 33. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Choices • Our teams have faced choices for Loading / Saving data • What URI scheme to use? ─ i.e. S3://… S3n://… S3a://…, JDBC:// etc • What file format? ─ i.e. Parquet, Avro, CSV, etc • What compression to use? ─ i.e. Snappy, LZO, Gzip, etc • What additional options…. ─ i.e. inferSchema, mergeSchema, pushDownPredicate…etc • Even what path to load/save data 1 ? ? ?
  • 34. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Choices • Our teams have faced choices for Loading / Saving data • What URI scheme to use? ─ i.e. S3://… S3n://… S3a://…, JDBC:// etc • What file format? ─ i.e. Parquet, Avro, CSV, etc • What compression to use? ─ i.e. Snappy, LZO, Gzip, etc • What additional options…. ─ i.e. inferSchema, mergeSchema, pushDownPredicate…etc • Even what path to load/save data • Choices end up being “embedded” into code 1 2 val path = "s3n://my-bucket/csv-data/people" val peopleDF = spark.read.format( "csv" ) .option( "sep", ";" ) .option( "inferSchema", "true" ) .option( "header", "true" ) .load( path ) ... ? ? ?
  • 35. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Challenges • Different teams working in same cluster • Working on different use cases • Likely that Spark Jobs are running with different settings in same cluster 1
  • 36. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Challenges • Different teams working in same cluster • Working on different use cases • Likely that Spark Jobs are running with different settings in same cluster 1 • Those that build may not be the ones that end up running/operating Spark Jobs • Support Ops • Or Client IT teams 2
  • 37. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Challenges • Different teams working in same cluster • Working on different use cases • Likely that Spark Jobs are running with different settings in same cluster 1 • Those that build may not be the ones that end up running/operating Spark Jobs • Support Ops • Or Client IT teams • Difficult for those operating to understand • What configuration settings • What does it depend on? i.e. Data it needs • What depends on it? i.e. Data it writes • How can I change things quickly? 3 2 ?
  • 38. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Our Approach • Borrow ideas from Devops world • "Strict separation of config from code” 1 Dev Ops
  • 39. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Our Approach Spark Code Nerve Load/Save Library + Config • Borrow ideas from Devops world • "Strict separation of config from code” • Custom library and config for loading and saving2 1 USES USES "system" : ”bank”, "table": "customers", ”type": ”input", ”description": ”A Table with banking customer details.", "format": { "type": ”csv", "compression": "snappy", "partitions": "20” "mode": "overwrite" } "location": { "type": "s3", "properties": { ... } } "schema": [ { "name": "id", "type": "int", "doc": "User Id" }, ...
  • 40. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Our Approach Spark Code Nerve Load/Save Library + Config • Borrow ideas from Devops world • "Strict separation of config from code” • Custom library and config for loading and saving • For Loading / Saving data • What URI scheme to use? ─ i.e. S3://… S3n://… S3a://…, JDBC:// etc • What format? ─ i.e. Parquet, Avro, CSV, etc • What compression to use? ─ i.e. Snappy, LZO, Gzip, etc • What additional options…. ─ i.e. inferSchema, mergeSchema, overwrite or append…. • Even what path to load/save data 2 3 1 USES USES "system" : ”bank”, "table": "customers", ”type": ”input", ”description": ”A Table with banking customer details.", "format": { "type": ”csv", "compression": "snappy", "partitions": "20” "mode": "overwrite" } "location": { "type": "s3", "properties": { ... } } "schema": [ { "name": "id", "type": "int", "doc": "User Id" }, ...
  • 41. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Our Approach // Load a dataframe val df = nerve.load( “system”, “table” ) // Save a dataframe nerve.save( df, “system”, “table” ) Spark Code Nerve Load/Save Library + Config • Engineers use this library in their code • Allows Engineers to focus on business logic and not worry about concerns such as: • Data formats • Compressions • Schemas • Partitions • Weird optimisation settings • Allows teams that focus on operating/running to focus on • Consistency • Runtime optimisations 2 3 USES USES 1
  • 43. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines • Pipelines mean different things to different people • Ultimately, its about the Flow of data • We need to • Monitor • Understand • React • It looks simple, but in fact its not 1 2 4 3 Ingest Prepare Analytics Publish Flow Flow
  • 44. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example Events Sources Raw Prepared Published
  • 45. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example Sources Raw Prepared Published Events
  • 46. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example Sources Raw Prepared Published Events
  • 47. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example Sources Raw Prepared Published Events
  • 48. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example Sources Raw Prepared Published Events
  • 49. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example Sources Raw Prepared Published Events
  • 50. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example Sources Raw Prepared Published Events
  • 51. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Agreed Schedule Sources Raw Prepared Published 0500 • Agree a schedule with client when to send data 1 Events
  • 52. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Agreed Schedule Sources Raw Prepared Published 0500 0700 • Agree a schedule with client when to send data • Configure schedule to kick off Spark Jobs 1 2 Events
  • 53. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Agreed Schedule Sources Raw Prepared Published 0500 0700 • Agree a schedule with client when to send data • Configure schedule to kick off Spark Jobs • Subsequent Spark Jobs kick off when previous have completed 1 2 3 Events
  • 54. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Agreed Schedule Sources Raw Prepared Published 0500 0700 0900 • Agree a schedule with client when to send data • Configure schedule to kick off Spark Jobs • Subsequent Spark Jobs kick off when previous have completed 1 2 3 Events
  • 55. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Challenge 1 – Untimely Data Sources Raw Prepared Published 0500 0700 0900 • Data failed to be sent up on time • Impact to rest of the Pipeline • Impact to end user! 1 2 3 Events
  • 56. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Challenge 2 – Bad Data Sources Raw Prepared Published 0500 0700 0900 • Bad/Unexpected Data • Schema change • Change in Business rules • Data Quality Issue • Impact to Pipeline • Impact to end user! 1 2 3 Events
  • 57. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Challenge 3 – Jobs Taking Longer Sources Raw Prepared Published 0500 0700 0900 • Spark Job’s taking longer and longer • Impact to Pipeline • Impact to end user! 1 2 3 Events
  • 58. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example - Challenges Sources Raw Prepared Published 0500 Events • We need to • Monitor • Understand • React • It looks simple, but in fact its not 1 2 0700 0900
  • 59. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Our Approach Sources Raw Prepared Published • Monitor all Raw sources for timely data 1 Monitoring Events +
  • 60. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Our Approach Sources Raw Prepared • Monitor all Raw sources for timely data • Monitor quality of data • Config driven • Agreed business rules 1 Monitoring Events + Metadata 2 Published
  • 61. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Our Approach Sources Raw • Monitor all Raw sources for timely data • Monitor quality of data • Config driven • Agreed business rules • Monitor Spark Metrics • At least how long • Additional Metrics incl: ─ Intent to load data J ─ Saving data J 1 Monitoring Events + Metadata 2 3 PublishedPrepared
  • 62. 1 – Observations of using Cloud Object Storage 2 – Design To Operate 3 – Pipelines! Recap
  • 63. 1 – Observations of using Cloud Object Storage 2 – Design To Operate 3 – Pipelines! Lets Talk! Recap