On-premise Spark as a Service with YARN

On Premise Spark-as-a-Service
on YARN
Jim Dowling
Associate Prof @ KTH, Stockholm
Senior Researcher, SICS Swedish ICT
CEO, Logical Clocks AB
Twitter: @jim_dowling

Spark-as-a-Service in Sweden
• SICS ICE: datacenter research and test environment
• Hopsworks: Spark/Kafka/Flink/Hadoop-as-a-service
– Built on Hops Hadoop (www.hops.io)
– Over 100 active users
– Spark the platform of choice
2

HopsFS Architecture
3
NameNodes
NDB
Leader
HDFS Client
DataNodes

Hops-YARN Architecture
4
ResourceMgrs
NDB
Scheduler
YARN Client
NodeManagers
Resource Trackers
Heartbeats
(70-95%)
AM Reqs
(5-30%)

Pluggable DB: Data Abstraction Layer
5
NameNode
(Apache v2)
DAL API
(Apache v2)
NDB-DAL-Impl
(GPL v2)
Other DB
(Other License)
hops-2.7.3.jar dal-ndb-2.7.3-7.5.4.jar

6
HopsFS Throughput vs Apache HDFS
NDB Setup: Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE.
NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.

Project-Based Multi-Tenancy
• A project is a collection of
– Users with Roles
– HDFS DataSets
– Kafka Topics
– Notebooks, Jobs
• Per-Project quotas
– Storage in HDFS
– CPU in YARN
• Uber-style Pricing
• Sharing across Projects
– Datasets/Topics
8
project
dataset 1
dataset N
Topic 1
Topic N
Kafka
HDFS

9
Alice@gmail.com
NSA__Alice
Authenticate
Users__Alice
HopsFS
HopsYARN
Projects
Secure
Impersonation
Kafka
X.509
Certificates
Dynamic Roles for Hadoop/Kafka

Look Ma, No Kerberos!
• For each project, a user is issued with a X.509
certificate, containing the project-specific userID.
• Inspired by Netflix’ BLESS system.
• Services are also issued with X.509 certificates.
– Both user and service certs are signed with the same CA.
– Services extract the userID from RPCs to identify the caller.

11
Alice@gmail.com
Add/Del
Users
Distributed
Database
Insert/Remove CertsProject
Mgr
Root
CA
Services
Hadoop
Spark
Kafka
etc
Cert Signing
Requests
Project-User Certificates

12
Alice@gmail.com
1. Launch Spark Job
Distributed
Database
2. Get certs,
service endpoints
YARN Private
LocalResources
Spark Streaming App
4. Materialize certs
3. YARN Job, config
6. Get Schema
7. Consume
Produce
5. Read Certs
Hopsworks
KafkaUtil
Spark Streaming on YARN with Hopsworks
8. Authenticate

Spark Stream Producer in Secure Kafka
SparkConf sparkConf = …
JavaSparkContext jsc = …
1. Discover: Schema Registry and Kafka Broker Endpoints
2. Create: Kafka Properties file with certs and broker details
3. Create: producer using Kafka Properties
4. Download: the Schema for the Topic from the Schema Registry
5. Distribute: X.509 certs to all hosts on the cluster
6. Cleanup securely
// write to Kafka
13
Developer
Operations

Spark Streaming Producer in Hopsworks
List<String> topics = KafkaUtil.getTopics();
…
SparkProducer sparkProducer =
KafkaUtil.getSparkProducer(topic);
…
Map<String, String> message = …
sparkProducer.produce(message);
…
sparkProducer.close();
14https://github.com/hopshadoop/hops-kafka-examples

Spark Streaming Consumer in Hopsworks
JavaStreamingContext jssc = …
List<String> topics = KafkaUtil.getTopics();
…
SparkConsumer consumer = KafkaUtil.getSparkConsumer(jssc, topics);
…
// Avro schema downloaded by framework here
GenericRecord genericRecord = KafaUtil.getRecordInjections()
.get(topic);
…
jssc.start();
jssc.awaitTermination();
15
https://github.com/hopshadoop/hops-kafka-examples

Zeppelin Support for Spark/Livy
16

Livy to launch Spark 2.0 Jobs
[Image from: http://gethue.com]

Debugging Spark with DrElephant
• Project-specific view of performance/correctness
issues for completed Spark Jobs
• Customizable
heuristics
• Doesn’t show
killed jobs

Karamel/Chef for Automated Installation
19
Google Compute Engine BareMetal

Summary
• Hopsworks provides first-class support for
Spark-as-a-Service
– Streaming or Batch Jobs
– Zeppelin Notebooks
• Hopworks simplifies writing secure
SparkStreaming applications with Kafka
21http://github.com/hopshadoop
Hops
[Hadoop For Humans]

Hops Team
Active: Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis, Konstantin Popov, Antonios Kouzoupis, Ermias Gebremeskel.
Alumni: Vasileios Giannokostas, Johan Svedlund Nordström, Rizvi Hasan, Paul Mälzer, Bram Leenders,
Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente,
Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis,
Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias,
Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh,
Mariano Valles, Ying Lieu.

On-premise Spark as a Service with YARN

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to On-premise Spark as a Service with YARN (20)

More from Jim Dowling (20)

Recently uploaded (20)

On-premise Spark as a Service with YARN

Editor's Notes