Spark Summit EU talk by Steve Loughran

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark and Object Stores
—What you need to know
Steve Loughran
stevel@hortonworks.com
@steveloughran
October 2016

Steve Loughran,
Hadoop committer, PMC member, …
Chris Nauroth,
Apache Hadoop committer & PMC
ASF member
Rajesh Balamohan
Tez Committer, PMC Member

ORC, Parquet
datasets
inbound
Elastic ETL
HDFS
external

datasets
external
Notebooks
library

Streaming

A Filesystem: Directories, Files à Data
/
work
pending
part-00
part-01
00
00
00
01
01
01
complete
part-01
rename("/work/pending/part-01", "/work/complete")

Object Store: hash(name)->blob
00
00
00
01
01
s01 s02
s03 s04
hash("/work/pending/part-01")
["s02", "s03", "s04"]
copy("/work/pending/part-01",
"/work/complete/part01")
01
01
01
01
delete("/work/pending/part-01")
hash("/work/pending/part-00")
["s01", "s02", "s04"]

REST APIs
00
00
00
01
01
s01 s02
s03 s04
HEAD /work/complete/part-01
PUT /work/complete/part01
x-amz-copy-source: /work/pending/part-01
01
DELETE /work/pending/part-01
PUT /work/pending/part-01
... DATA ...
GET /work/pending/part-01
Content-Length: 1-8192
GET /?prefix=/work&delimiter=/

Often: Eventually Consistent
00
00
00
01
01
s01 s02
s03 s04
01
DELETE /work/pending/part-00
200
200
200

org.apache.hadoop.fs.FileSystem
hdfs s3awasb adlswift gs

s3:// —“inode on S3”
s3n://
“Native” S3
s3a://
Replaces s3n
swift://
OpenStack
wasb://
Azure WASB
s3a:// Stabilize
oss://
Aliyun
gs://
Google Cloud
s3a://
Speed and consistency adl://
Azure Data Lake
2006
2008
2013
2014
2015
2016
s3://
Amazon EMR S3
History of Object Storage Support

Cloud Storage Connectors
Azure WASB ● Strongly consistent
● Good performance
● Well-tested on applications (incl. HBase)
ADL ● Strongly consistent
● Tuned for big data analytics workloads
Amazon Web Services S3A ● Eventually consistent - consistency work in
progress by Hortonworks
● Performance improvements in progress
● Active development in Apache
EMRFS ● Proprietary connector used in EMR
● Optional strong consistency for a cost
Google Cloud Platform GCS ● Multiple configurable consistency policies
● Currently Google open source
● Good performance
● Could improve test coverage

Four Challenges
1. Classpath
2. Credentials
3. Code
4. Commitment
Let's look At S3 and Azure

Use S3A to work with S3
(EMR: use Amazon's s3:// )

Classpath: fix “No FileSystem for scheme: s3a”
hadoop-aws-2.7.x.jar
aws-java-sdk-1.7.4.jar
joda-time-2.9.3.jar
(jackson-*-2.6.5.jar)
See SPARK-7481
Get Spark with
Hadoop 2.7+ JARs

Credentials
core-site.xml or spark-default.conf
spark.hadoop.fs.s3a.access.key MY_ACCESS_KEY
spark.hadoop.fs.s3a.secret.key MY_SECRET_KEY
spark-submit automatically propagates Environment Variables
export AWS_ACCESS_KEY=MY_ACCESS_KEY
export AWS_SECRET_KEY=MY_SECRET_KEY
NEVER: share, check in to SCM, paste in bug reports…

Authentication Failure: 403
com.amazonaws.services.s3.model.AmazonS3Exception:
The request signature we calculated does not match
the signature you provided.
Check your key and signing method.
1. Check joda-time.jar & JVM version
2. Credentials wrong
3. Credentials not propagating
4. Local system clock (more likely on VMs)

Code: Basic IO
// Read in public dataset
val lines = sc.textFile("s3a://landsat-pds/scene_list.gz")
val lineCount = lines.count()
// generate and write data
val numbers = sc.parallelize(1 to 10000)
numbers.saveAsTextFile("s3a://hwdev-stevel-demo/counts")
All you need is the URL

Code: just use the URL of the object store
val csvdata = spark.read.options(Map(
"header" -> "true",
"inferSchema" -> "true",
"mode" -> "FAILFAST"))
.csv("s3a://landsat-pds/scene_list.gz")
...read time O(distance)

DataFrames
val landsat = "s3a://stevel-demo/landsat"
csvData.write.parquet(landsat)
val landsatOrc = "s3a://stevel-demo/landsatOrc"
csvData.write.orc(landsatOrc)
val df = spark.read.parquet(landsat)
val orcDf = spark.read.parquet(landsatOrc)

Finding dirty data with Spark SQL
val sqlDF = spark.sql(
"SELECT id, acquisitionDate, cloudCover"
+ s" FROM parquet.`${landsat}`")
val negativeClouds = sqlDF.filter("cloudCover < 0")
negativeClouds.show()
* filter columns and data early
* whether/when to cache()?
* copy popular data to HDFS

spark-default.conf
spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true

Notebooks? Classpath & Credentials

The Commitment Problem
⬢ rename() used for atomic commitment transaction
⬢ time to copy() + delete() proportional to data * files
⬢ S3: 6+ MB/s
⬢ Azure: a lot faster —usually
spark.speculation false
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true

What about Direct Output Committers?

Recent S3A Performance (Hadoop 2.8, HDP 2.5, CDH 5.9 (?))
// forward seek by skipping stream
spark.hadoop.fs.s3a.readahead.range 157810688
// faster backward seek for ORC and Parquet input
spark.hadoop.fs.s3a.experimental.input.fadvise random
// PUT blocks in separate threads
spark.hadoop.fs.s3a.fast.output.enabled true

Azure Storage: wasb://
A full substitute for HDFS

Classpath: fix “No FileSystem for scheme: wasb”
wasb:// : Consistent, with very fast rename (hence: commits)
hadoop-azure-2.7.x.jar
azure-storage-2.2.0.jar
+ (jackson-core; http-components, hadoop-common)

Credentials: core-site.xml / spark-default.conf
<property>
<name>fs.azure.account.key.example.blob.core.windows.net</name>
<value>0c0d44ac83ad7f94b0997b36e6e9a25b49a1394c</value>
</property>
spark.hadoop.fs.azure.account.key.example.blob.core.windows.net
0c0d44ac83ad7f94b0997b36e6e9a25b49a1394c
wasb://demo@example.blob.core.windows.net

Example: Azure Storage and Streaming
val streaming = new StreamingContext(sparkConf,Seconds(10))
val azure = "wasb://demo@example.blob.core.windows.net/in"
val lines = streaming.textFileStream(azure)
val matches = lines.map(line => {
println(line)
line
})
matches.print()
streaming.start()
* PUT into the streaming directory
* keep the dir clean
* size window for slow scans

Not Covered
⬢ Partitioning/directory layout
⬢ Infrastructure Throttling
⬢ Optimal path names
⬢ Error handling
⬢ Metrics

Summary
⬢ Object Stores look just like any other URL
⬢ …but do need classpath and configuration
⬢ Issues: performance, commitment
⬢ Use Hadoop 2.7+ JARs
⬢ Tune to reduce I/O
⬢ Keep those credentials secret!

Spark Summit EU talk by Steve Loughran

Backup Slides

Dependencies in Hadoop 2.8
aws-java-sdk-core-1.10.6.jar
aws-java-sdk-kms-1.10.6.jar
aws-java-sdk-s3-1.10.6.jar
joda-time-2.9.3.jar
(jackson-*-2.6.5.jar)
azure-storage-4.2.0.jar

S3 Server-Side Encryption
⬢ Encryption of data at rest at S3
⬢ Supports the SSE-S3 option: each object encrypted by a unique key
using AES-256 cipher
⬢ Now covered in S3A automated test suites
⬢ Support for additional options under development (SSE-KMS and SSE-C)

Advanced authentication
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider,
com.amazonaws.auth.EnvironmentVariableCredentialsProvider,
com.amazonaws.auth.InstanceProfileCredentialsProvider,
org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
</value>
</property>
+encrypted credentials in JECKS files on
HDFS

What Next? Performance and
integration

Next Steps for all Object Stores
⬢ Output Committers
– Logical commit operation decoupled from rename (non-atomic and costly in object stores)
⬢ Object Store Abstraction Layer
– Avoid impedance mismatch with FileSystem API
– Provide specific APIs for better integration with object stores: saving, listing, copying
⬢ Ongoing Performance Improvement
⬢ Consistency

Spark Summit EU talk by Steve Loughran

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to Spark Summit EU talk by Steve Loughran (20)

More from Spark Summit (20)

Recently uploaded (20)

Spark Summit EU talk by Steve Loughran