SlideShare a Scribd company logo
On Premise Spark-as-a-Service
on YARN
Jim Dowling
Associate Prof @ KTH, Stockholm
Senior Researcher, SICS Swedish ICT
CEO, Logical Clocks AB
Twitter: @jim_dowling
Spark-as-a-Service in Sweden
• SICS ICE: datacenter research and test environment
• Hopsworks: Spark/Kafka/Flink/Hadoop-as-a-service
– Built on Hops Hadoop (www.hops.io)
– Over 100 active users
– Spark the platform of choice
2
HopsFS Architecture
3
NameNodes
NDB
Leader
HDFS Client
DataNodes
Hops-YARN Architecture
4
ResourceMgrs
NDB
Scheduler
YARN Client
NodeManagers
Resource Trackers
Heartbeats
(70-95%)
AM Reqs
(5-30%)
Pluggable DB: Data Abstraction Layer
5
NameNode
(Apache v2)
DAL API
(Apache v2)
NDB-DAL-Impl
(GPL v2)
Other DB
(Other License)
hops-2.7.3.jar dal-ndb-2.7.3-7.5.4.jar
6
HopsFS Throughput vs Apache HDFS
NDB Setup: Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE.
NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.
HOPSWORKS
7
Project-Based Multi-Tenancy
• A project is a collection of
– Users with Roles
– HDFS DataSets
– Kafka Topics
– Notebooks, Jobs
• Per-Project quotas
– Storage in HDFS
– CPU in YARN
• Uber-style Pricing
• Sharing across Projects
– Datasets/Topics
8
project
dataset 1
dataset N
Topic 1
Topic N
Kafka
HDFS
9
Alice@gmail.com
NSA__Alice
Authenticate
Users__Alice
HopsFS
HopsYARN
Projects
Secure
Impersonation
Kafka
X.509
Certificates
Dynamic Roles for Hadoop/Kafka
Look Ma, No Kerberos!
• For each project, a user is issued with a X.509
certificate, containing the project-specific userID.
• Inspired by Netflix’ BLESS system.
• Services are also issued with X.509 certificates.
– Both user and service certs are signed with the same CA.
– Services extract the userID from RPCs to identify the caller.
11
Alice@gmail.com
Add/Del
Users
Distributed
Database
Insert/Remove CertsProject
Mgr
Root
CA
Services
Hadoop
Spark
Kafka
etc
Cert Signing
Requests
Project-User Certificates
12
Alice@gmail.com
1. Launch Spark Job
Distributed
Database
2. Get certs,
service endpoints
YARN Private
LocalResources
Spark Streaming App
4. Materialize certs
3. YARN Job, config
6. Get Schema
7. Consume
Produce
5. Read Certs
Hopsworks
KafkaUtil
Spark Streaming on YARN with Hopsworks
8. Authenticate
Spark Stream Producer in Secure Kafka
SparkConf sparkConf = …
JavaSparkContext jsc = …
1. Discover: Schema Registry and Kafka Broker Endpoints
2. Create: Kafka Properties file with certs and broker details
3. Create: producer using Kafka Properties
4. Download: the Schema for the Topic from the Schema Registry
5. Distribute: X.509 certs to all hosts on the cluster
6. Cleanup securely
// write to Kafka
13
Developer
Operations
Spark Streaming Producer in Hopsworks
List<String> topics = KafkaUtil.getTopics();
…
SparkProducer sparkProducer =
KafkaUtil.getSparkProducer(topic);
…
Map<String, String> message = …
sparkProducer.produce(message);
…
sparkProducer.close();
14https://github.com/hopshadoop/hops-kafka-examples
Spark Streaming Consumer in Hopsworks
JavaStreamingContext jssc = …
List<String> topics = KafkaUtil.getTopics();
…
SparkConsumer consumer = KafkaUtil.getSparkConsumer(jssc, topics);
…
// Avro schema downloaded by framework here
GenericRecord genericRecord = KafaUtil.getRecordInjections()
.get(topic);
…
jssc.start();
jssc.awaitTermination();
15
https://github.com/hopshadoop/hops-kafka-examples
Zeppelin Support for Spark/Livy
16
Livy to launch Spark 2.0 Jobs
[Image from: http://gethue.com]
Debugging Spark with DrElephant
• Project-specific view of performance/correctness
issues for completed Spark Jobs
• Customizable
heuristics
• Doesn’t show
killed jobs
Karamel/Chef for Automated Installation
19
Google Compute Engine BareMetal
20
Demo
Summary
• Hopsworks provides first-class support for
Spark-as-a-Service
– Streaming or Batch Jobs
– Zeppelin Notebooks
• Hopworks simplifies writing secure
SparkStreaming applications with Kafka
21http://github.com/hopshadoop
Hops
[Hadoop For Humans]
Hops Team
Active: Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis, Konstantin Popov, Antonios Kouzoupis, Ermias Gebremeskel.
Alumni: Vasileios Giannokostas, Johan Svedlund Nordström, Rizvi Hasan, Paul Mälzer, Bram Leenders,
Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente,
Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis,
Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias,
Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh,
Mariano Valles, Ying Lieu.
THANK YOU.
www.hops.io

More Related Content

PDF
How To Connect Spark To Your Own Datasource
MongoDB
 
PDF
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 
PDF
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
PDF
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Databricks
 
PDF
SSR: Structured Streaming for R and Machine Learning
felixcss
 
PDF
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Amy W. Tang
 
PDF
Solr + Hadoop = Big Data Search
Mark Miller
 
How To Connect Spark To Your Own Datasource
MongoDB
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Databricks
 
SSR: Structured Streaming for R and Machine Learning
felixcss
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Amy W. Tang
 
Solr + Hadoop = Big Data Search
Mark Miller
 

What's hot (20)

PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
PDF
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Spark Summit
 
PPTX
Centralized log-management-with-elastic-stack
Rich Lee
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
Spark Summit EU talk by Jim Dowling
Spark Summit
 
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
PDF
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
PDF
Using Spark with Tachyon by Gene Pang
Spark Summit
 
PPTX
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
PPTX
How to ensure Presto scalability 
in multi use case
Kai Sasaki
 
PDF
SQL and Search with Spark in your browser
DataWorks Summit/Hadoop Summit
 
PDF
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Databricks
 
PDF
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
PDF
New directions for Apache Spark in 2015
Databricks
 
PDF
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Spark Summit
 
PDF
Spark Summit EU talk by Emlyn Whittick
Spark Summit
 
PDF
An introduction to Storm Crawler
Julien Nioche
 
PDF
Lessons from Running Large Scale Spark Workloads
Databricks
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Spark Summit
 
Centralized log-management-with-elastic-stack
Rich Lee
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Spark Summit EU talk by Jim Dowling
Spark Summit
 
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Using Spark with Tachyon by Gene Pang
Spark Summit
 
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
How to ensure Presto scalability 
in multi use case
Kai Sasaki
 
SQL and Search with Spark in your browser
DataWorks Summit/Hadoop Summit
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Databricks
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
New directions for Apache Spark in 2015
Databricks
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Spark Summit
 
Spark Summit EU talk by Emlyn Whittick
Spark Summit
 
An introduction to Storm Crawler
Julien Nioche
 
Lessons from Running Large Scale Spark Workloads
Databricks
 
Ad

Viewers also liked (18)

PPTX
Multi-tenant Flink as-a-service with Kafka on Hopsworks
Jim Dowling
 
PPTX
Polyglot metadata for Hadoop
Jim Dowling
 
PPTX
Hops - Distributed metadata for Hadoop
Jim Dowling
 
PPTX
Hopsfs 10x HDFS performance
Jim Dowling
 
PDF
Data Science with the Help of Metadata
Jim Dowling
 
PPTX
Strata Hadoop Hopsworks
Jim Dowling
 
PDF
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Jim Dowling
 
PPTX
Shug meetup Hops Hadoop
Jim Dowling
 
PDF
Spark summit-east-dowling-feb2017-full
Jim Dowling
 
PPTX
Uber
Amit Bhatia
 
DOCX
Colegio nacional nicolás esguerra
Jose Delgado
 
PPT
Vocabulary Jeopardy(Sarah Olivarez-Cruz)
Sarah Cruz
 
PDF
IILIV_M4C4 Lezione 2. sentenza corte cost. 215 87
raffaelebruno1
 
PPSX
Prefessionalism
Silal Rathnayake
 
DOC
Teaching Diagnosis And Treatment Placement Through Use Of The Dsm Iv Tr And T...
PATSPREVENTION
 
PDF
Verbos planeacionesme
Noquieroanadie Hernandez
 
PPTX
Presentatie english Collin
Collin Van der Vorst
 
PPTX
career planning
serenity20099
 
Multi-tenant Flink as-a-service with Kafka on Hopsworks
Jim Dowling
 
Polyglot metadata for Hadoop
Jim Dowling
 
Hops - Distributed metadata for Hadoop
Jim Dowling
 
Hopsfs 10x HDFS performance
Jim Dowling
 
Data Science with the Help of Metadata
Jim Dowling
 
Strata Hadoop Hopsworks
Jim Dowling
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Jim Dowling
 
Shug meetup Hops Hadoop
Jim Dowling
 
Spark summit-east-dowling-feb2017-full
Jim Dowling
 
Colegio nacional nicolás esguerra
Jose Delgado
 
Vocabulary Jeopardy(Sarah Olivarez-Cruz)
Sarah Cruz
 
IILIV_M4C4 Lezione 2. sentenza corte cost. 215 87
raffaelebruno1
 
Prefessionalism
Silal Rathnayake
 
Teaching Diagnosis And Treatment Placement Through Use Of The Dsm Iv Tr And T...
PATSPREVENTION
 
Verbos planeacionesme
Noquieroanadie Hernandez
 
Presentatie english Collin
Collin Van der Vorst
 
career planning
serenity20099
 
Ad

Similar to On-premise Spark as a Service with YARN (20)

PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PPTX
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Flink Forward
 
PDF
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Evention
 
PDF
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Theofilos Kakantousis
 
PDF
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
PDF
실시간 Streaming using Spark and Kafka 강의교재
hkyoon2
 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
PDF
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Abdelhamide EL ARIB
 
PDF
Ingesting hdfs intosolrusingsparktrimmed
whoschek
 
PPTX
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
PDF
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
PPTX
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
DataWorks Summit
 
PPTX
Apache spot 系統架構
Hua Chu
 
PDF
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Databricks
 
PDF
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
PDF
Devops Spark Streaming
Marilyn Waldman
 
PDF
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Luciano Resende
 
PDF
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Shirshanka Das
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Flink Forward
 
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Evention
 
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Theofilos Kakantousis
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
실시간 Streaming using Spark and Kafka 강의교재
hkyoon2
 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Abdelhamide EL ARIB
 
Ingesting hdfs intosolrusingsparktrimmed
whoschek
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
DataWorks Summit
 
Apache spot 系統架構
Hua Chu
 
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Databricks
 
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Devops Spark Streaming
Marilyn Waldman
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Luciano Resende
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Shirshanka Das
 

More from Jim Dowling (20)

PDF
ARVC and flecainide case report[EI] Jim.docx.pdf
Jim Dowling
 
PDF
PyData Berlin 2023 - Mythical ML Pipeline.pdf
Jim Dowling
 
PDF
Serverless ML Workshop with Hopsworks at PyData Seattle
Jim Dowling
 
PDF
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
Jim Dowling
 
PDF
_Python Ireland Meetup - Serverless ML - Dowling.pdf
Jim Dowling
 
PDF
Building Hopsworks, a cloud-native managed feature store for machine learning
Jim Dowling
 
PDF
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Jim Dowling
 
PDF
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling
 
PDF
Hops fs huawei internal conference july 2021
Jim Dowling
 
PDF
Hopsworks MLOps World talk june 21
Jim Dowling
 
PDF
Hopsworks Feature Store 2.0 a new paradigm
Jim Dowling
 
PDF
Metadata and Provenance for ML Pipelines with Hopsworks
Jim Dowling
 
PDF
GANs for Anti Money Laundering
Jim Dowling
 
PDF
Berlin buzzwords 2020-feature-store-dowling
Jim Dowling
 
PDF
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Jim Dowling
 
PDF
Hopsworks data engineering melbourne april 2020
Jim Dowling
 
PDF
The Bitter Lesson of ML Pipelines
Jim Dowling
 
PDF
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Jim Dowling
 
PDF
Hopsworks at Google AI Huddle, Sunnyvale
Jim Dowling
 
PDF
Hopsworks in the cloud Berlin Buzzwords 2019
Jim Dowling
 
ARVC and flecainide case report[EI] Jim.docx.pdf
Jim Dowling
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
Jim Dowling
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Jim Dowling
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
Jim Dowling
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
Jim Dowling
 
Building Hopsworks, a cloud-native managed feature store for machine learning
Jim Dowling
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Jim Dowling
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling
 
Hops fs huawei internal conference july 2021
Jim Dowling
 
Hopsworks MLOps World talk june 21
Jim Dowling
 
Hopsworks Feature Store 2.0 a new paradigm
Jim Dowling
 
Metadata and Provenance for ML Pipelines with Hopsworks
Jim Dowling
 
GANs for Anti Money Laundering
Jim Dowling
 
Berlin buzzwords 2020-feature-store-dowling
Jim Dowling
 
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Jim Dowling
 
Hopsworks data engineering melbourne april 2020
Jim Dowling
 
The Bitter Lesson of ML Pipelines
Jim Dowling
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Jim Dowling
 
Hopsworks at Google AI Huddle, Sunnyvale
Jim Dowling
 
Hopsworks in the cloud Berlin Buzzwords 2019
Jim Dowling
 

Recently uploaded (20)

PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Doc9.....................................
SofiaCollazos
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Doc9.....................................
SofiaCollazos
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 

On-premise Spark as a Service with YARN

  • 1. On Premise Spark-as-a-Service on YARN Jim Dowling Associate Prof @ KTH, Stockholm Senior Researcher, SICS Swedish ICT CEO, Logical Clocks AB Twitter: @jim_dowling
  • 2. Spark-as-a-Service in Sweden • SICS ICE: datacenter research and test environment • Hopsworks: Spark/Kafka/Flink/Hadoop-as-a-service – Built on Hops Hadoop (www.hops.io) – Over 100 active users – Spark the platform of choice 2
  • 5. Pluggable DB: Data Abstraction Layer 5 NameNode (Apache v2) DAL API (Apache v2) NDB-DAL-Impl (GPL v2) Other DB (Other License) hops-2.7.3.jar dal-ndb-2.7.3-7.5.4.jar
  • 6. 6 HopsFS Throughput vs Apache HDFS NDB Setup: Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE. NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.
  • 8. Project-Based Multi-Tenancy • A project is a collection of – Users with Roles – HDFS DataSets – Kafka Topics – Notebooks, Jobs • Per-Project quotas – Storage in HDFS – CPU in YARN • Uber-style Pricing • Sharing across Projects – Datasets/Topics 8 project dataset 1 dataset N Topic 1 Topic N Kafka HDFS
  • 10. Look Ma, No Kerberos! • For each project, a user is issued with a X.509 certificate, containing the project-specific userID. • Inspired by Netflix’ BLESS system. • Services are also issued with X.509 certificates. – Both user and service certs are signed with the same CA. – Services extract the userID from RPCs to identify the caller.
  • 12. 12 [email protected] 1. Launch Spark Job Distributed Database 2. Get certs, service endpoints YARN Private LocalResources Spark Streaming App 4. Materialize certs 3. YARN Job, config 6. Get Schema 7. Consume Produce 5. Read Certs Hopsworks KafkaUtil Spark Streaming on YARN with Hopsworks 8. Authenticate
  • 13. Spark Stream Producer in Secure Kafka SparkConf sparkConf = … JavaSparkContext jsc = … 1. Discover: Schema Registry and Kafka Broker Endpoints 2. Create: Kafka Properties file with certs and broker details 3. Create: producer using Kafka Properties 4. Download: the Schema for the Topic from the Schema Registry 5. Distribute: X.509 certs to all hosts on the cluster 6. Cleanup securely // write to Kafka 13 Developer Operations
  • 14. Spark Streaming Producer in Hopsworks List<String> topics = KafkaUtil.getTopics(); … SparkProducer sparkProducer = KafkaUtil.getSparkProducer(topic); … Map<String, String> message = … sparkProducer.produce(message); … sparkProducer.close(); 14https://github.com/hopshadoop/hops-kafka-examples
  • 15. Spark Streaming Consumer in Hopsworks JavaStreamingContext jssc = … List<String> topics = KafkaUtil.getTopics(); … SparkConsumer consumer = KafkaUtil.getSparkConsumer(jssc, topics); … // Avro schema downloaded by framework here GenericRecord genericRecord = KafaUtil.getRecordInjections() .get(topic); … jssc.start(); jssc.awaitTermination(); 15 https://github.com/hopshadoop/hops-kafka-examples
  • 16. Zeppelin Support for Spark/Livy 16
  • 17. Livy to launch Spark 2.0 Jobs [Image from: http://gethue.com]
  • 18. Debugging Spark with DrElephant • Project-specific view of performance/correctness issues for completed Spark Jobs • Customizable heuristics • Doesn’t show killed jobs
  • 19. Karamel/Chef for Automated Installation 19 Google Compute Engine BareMetal
  • 21. Summary • Hopsworks provides first-class support for Spark-as-a-Service – Streaming or Batch Jobs – Zeppelin Notebooks • Hopworks simplifies writing secure SparkStreaming applications with Kafka 21http://github.com/hopshadoop Hops [Hadoop For Humans]
  • 22. Hops Team Active: Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Konstantin Popov, Antonios Kouzoupis, Ermias Gebremeskel. Alumni: Vasileios Giannokostas, Johan Svedlund Nordström, Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.

Editor's Notes

  • #8: Nobody gets a cluster ….. everybody gets a project!
  • #9: Privileges – upload/download data, run analysis jobs Like RBAC solution. All access via HopsWorks.
  • #10: 9
  • #12: 11
  • #13: 12
  • #14: public class HopsKafkaUtil implements Serializable { KAFKA_BROKERADDR_ENV_VAR = "kafka.brokeraddress"; KAFKA_RESTENDPOINT = "kafka.restendpoint"; KAFKA_SESSIONID_ENV_VAR = "kafka.sessionid"; KAFKA_PROJECTID_ENV_VAR = "kafka.projectid"; KAFKA_K_CERTIFICATE_ENV_VAR = "kafka_k_certificate"; KAFKA_T_CERTIFICATE_ENV_VAR = "kafka_t_certificate"; String getHopsConsumer(String topic) {…} String getHopsProducer(String topic) {…} String getHopsSparkKafkaConsumer(String topic) {…} String getHopsSparkKafkaProducer(String topic) {…} String getSchema(String topicName, int versionId) {..} Map<String, String> getKafkaProps(String propsStr) {…} }
  • #15: HopsKafkaProperties.defaultProps())
  • #16: HopsKafkaProperties.defaultProps())
  • #17: https://gist.github.com/rawkintrevo/ad206879753733f5a536
  • #19: Netty dependency conflict with our app in blocking mode Impacts: application size, main class run on our multi-tenant application - System.exit(), logs are written locally No accumulator results or exceptions from the ExecutionEnvironment.execute() call Can only kill YARN job, not Spark session – cleanup issues Spark Dispatcher The client directly starts the Job in YARN, rather than bootstrapping a cluster and after that submitting the job to that cluster. The client can hence disconnect immediately after the job was submitted All user code libraries and config files are directly in the Application Classpath, rather than in the dynamic user code class loader Containers are requested as needed and will be released when not used any more The “as needed” allocation of containers allows for different profiles of containers (CPU / memory) to be used for different operators