SlideShare a Scribd company logo
Standalone Spark Deployment
For Stability and Performance
Totango
❖ Leading Customer Success Platform
❖ Helps companies retain and grow their customer base
❖ Advanced actionable analytics for subscription and recurring
revenue
❖ Founded @ 2010
❖ Infrastructure on AWS cloud
❖ Spark for batch processing
❖ ElasticSearch for serving layer
About Me
Romi Kuntsman
Senior Big Data Engineer @ Totango
Working with Apache Spark since v1.0
Working with AWS Cloud since 2008
Spark on AWS - first attempts
❖ We tried Amazon EMR (Elastic MapReduce) to install Spark on
YARN
➢ Performance hit per application (starts Spark instance for each)
➢ Performance hit per server (running services we don't use, like
HDFS)
➢ Slow and unstable cluster resizing (often stuck and need to
recreate)
❖We tried spark-ec2 script to install Spark Standalone on AWS
EC2 machines
➢ Serial (not parallel) initialization of multiple servers - slow!
➢ Unmaintained scripts since availability of Spark on EMR (see
above)
➢ Doesn't integrate with our existing systems
Spark on AWS - road to success
❖ We decided to write our own scripts to integrate and control
everything
❖Understood all Spark components and configuration settings
❖Deployment based on Chef, like we do in all servers
❖Integrated monitoring and logging, like we have in all our systems
❖Full server utilization - running exactly what we need and nothing
more
❖Cluster hanging or crashing no longer happens
❖Seamless cluster resize without hurting any existing jobs
❖Able to upgrade to any version of Spark (not dependant on third
party)
What we'll discuss
❖Separation of Spark Components
❖Centralized Managed Logging
❖Monitoring Cluster Utilization
❖Auto Scaling Groups
❖Termination Protection
❖Upstart Mechanism
❖NewRelic Integration
❖Chef-based Instantiation
Data w/ Romi
Ops w/ Alon
Separation of Components
❖Spark Master Server (single)
➢Master Process - accepts requests to start applications
➢History Process - serves history data of completed
applications
❖Spark Slave Server (multiple)
➢Worker Process - handles workload of applications on
server
➢External Shuffle Service - handles data exchange
between workers
➢Executor Process (one per core - for running apps) - runs
actual code
Configuration - Deploy Spread Out
❖spark.deploy.spreadOut (SPARK_MASTER_OPTS)
➢true = use cores spread across all workers
➢false = fill up all worker cores before getting more
Configuration - Cleanup
❖spark.worker.cleanup.* (SPARK_WORKER_OPTS)
➢.enabled = true (turn on mechanism to clean up app folders)
➢.interval = 1800 (run every 1800 seconds, or 30 minutes)
➢.appDataTtl = 1800 (remove finished applications after 30
minutes)
❖We have 100s of applications per day, each with it's jars and
logs
❖Rapid cleanup is essential to avoid filling up disk space
❖We collect the logs before cleanup - details in following slides ;-)
❖Only cleans up files of completed applications
External Shuffle Service
❖Preserves shuffle files written by executors
❖Servers shuffle files to other executors who want to fetch
them
❖If (when) one executor crashes (OOM etc), others may still
access it's shuffle
❖We run the shuffle service itself in a separate process from
the executor
❖To enable: spark.shuffle.service.enable=true
❖Config: spark.shuffle.io.* (see documentation)
Logging - components
❖ Master Log (/logs/spark-runner-
org.apache.spark.deploy.master.Master-*)
➢ Application registration, worker coordination
❖History Log (/logs/spark-runner-
org.apache.spark.deploy.history.HistoryServer-*)
➢ Access to history, errors reading (e.g. I/O from S3, not found)
❖Worker Log (/logs/spark-runner-
org.apache.spark.deploy.worker.Worker-*)
➢ Executor management (launch, kill, ACLs)
❖Shuffle Log (/logs/org.apache.spark.deploy.ExternalShuffleService-*)
➢ External Executor Registrations
Logging - applications
❖Application Logs (/mnt/spark-work/app-12345/execid/stderr)
➢ All output from executor process, including your own code
❖Using LogStash to gather logs from all applications together
input {
file {
path => "/mnt/spark-work/app-*/*/std*"
start_position => beginning
}
}
filter {
grok {
match => [ "path", "/mnt/spark-work/%{NOTSPACE:application}/.+/%{NOTSPACE:logtype}" ]
}
}
output {
file {
path => "/logs/applications.log"
message_format => "%{application} %{logtype} %{message}"
}
}
Monitoring Cluster Utilization
❖ Spark Reports Metrics (Codahale) through Graphite
➢Master metrics - running application and their status
➢Worker metrics - used cores, free cores
➢JVM metrics - memory allocation, GC
❖We use Anodot to view and track
metrics trends and anomalies
And now, to the Ops side...
Alon Torres
DevOps Engineer @ Totango
Auto Scaling Group Components
❖Auto Scaling Group
➢ Scale your group up or down flexibly
➢ Supports health checks and load balancing
❖Launch Configuration
➢ Template used by the ASG to launch instances
➢ User Data script for post-launch configuration
❖User Data
➢ Install prerequisites and fetch instance info
➢ Install and start Chef client
➢ Sanity checks throughout
Launch
Configuratio
n
Auto
Scaling
Group
EC2
Instance
EC2
Instance
EC2
Instance
EC2
Instance
EC2
Instance
EC2
Instance
User
Data
Auto Scaling Group resizing in AWS
❖ Scheduled
➢ Set the desired size according to a specified schedule
➢ Good for scenarios with predictable, cyclic workloads.
❖Alert-Based
➢ Set specific alerts that trigger a cluster action
➢ Alerts can monitor instance health properties (resource usage)
❖Remote-triggered
➢ Using the AWS API/CLI, resize the cluster however you want
Resizing the ASG with Jenkins
❖We use schedule-based Jenkins jobs that utilize the AWS CLI
➢ Each job sets the desired Spark cluster size
➢ Makes it easy for our Data team to make changes to the
schedule
➢ Desired size can be manually overridden if needed
Termination Protection
❖When scaling down, ASG treats all nodes as equal
termination candidates
❖We want to avoid killing instances with currently running jobs
❖To achieve this, we used a built-in feature of ASG -
termination protection
❖Any instance in the ASG can be set as protected, thus
preventing termination when scaling down the cluster.
if [ $(ps -ef | grep executor | grep spark | wc -l) -ne 0 ]; then
aws autoscaling set-instance-protection --protected-from-scale-in …
fi
Upstart Jobs for Spark
❖ Every spark component has an upstart job the does the
following
➢ Set Spark Niceness (Process priority in CPU resource
distribution)
➢ Start the required Spark component and ensure it stays running
■ The default spark daemon script runs in the background
■ For Upstart, we modified the script to run in the foreground
❖ nohup nice -n "$SPARK_NICENESS"…&
vs
❖ nice -n "$SPARK_NICENESS" ...
NewRelic Monitoring
❖ Cloud-based Application and Server monitoring
❖Supports multiple alert policies for different needs
➢ Who to alert, and what triggers the alerts
❖Newly created instances are auto - assigned the default alert policy
Policy Assignment using AWS Lambda
❖Spark instances have their own policy in NewRelic
❖Each instance has to ask NewRelic to be reassigned to the
new policy
➢Parallel reassignment requests may collide and override
each other
❖Solution - during provisioning and shutdown, we do the
following:
➢Put a record in an AWS Kinesis stream that contains their
hostname and their desired NewRelic policy ID
➢The record triggers an AWS Lambda script that uses the
NewRelic API to reassign the hostname given to the policy
ID given
Chef
❖Configuration Management Tool, can provision and configure
instances
➢Describe an instance state as code, let chef handle the
rest
➢Typically works in server/client mode - client updates
every 30m
➢Besides provisioning, also prevents configuration drifts
❖Vast amount of plugins and cookbooks - the sky's the limit!
❖Configures all the instances in our DC
Spark Instance Provisioning
❖ Setup Spark
➢ Setup prerequisites - users, directories, symlinks and jars
➢ Download and extract spark package from S3
❖Configure termination protection cron script
❖Configure upstart conf files
❖Place spark config files
❖Assign NewRelic policy
❖Add shutdown scripts
➢ Delete instance from chef database
➢ Remove from NewRelic monitoring policy
Questions?
❖ Alon Torres, DevOps
https://il.linkedin.com/in/alontorres
❖Romi Kuntsman, Senior Big Data Engineer
https://il.linkedin.com/in/romik
❖Stay in touch!
Totango Engineering Technical Blog
http://labs.totango.com/

More Related Content

What's hot (20)

PPT
spark-kafka_mod
Vritika Godara
 
PDF
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
PDF
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 
PDF
Structured Streaming Use-Cases at Apple
Databricks
 
PPTX
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
PPTX
Zeppelin and spark sql demystified
Omid Vahdaty
 
PDF
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
PPTX
Emr spark tuning demystified
Omid Vahdaty
 
PDF
Spark Working Environment in Windows OS
Universiti Technologi Malaysia (UTM)
 
PDF
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
PDF
Operational Tips for Deploying Spark by Miklos Christine
Spark Summit
 
PDF
Continuous Processing in Structured Streaming with Jose Torres
Databricks
 
PDF
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Databricks
 
PDF
Hoodie: How (And Why) We built an analytical datastore on Spark
Vinoth Chandar
 
PPT
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG
 
PPTX
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird
 
PDF
Dive into Spark Streaming
Gerard Maas
 
PDF
Introduction to Spark Streaming
datamantra
 
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
PPTX
Bullet: A Real Time Data Query Engine
DataWorks Summit
 
spark-kafka_mod
Vritika Godara
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 
Structured Streaming Use-Cases at Apple
Databricks
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
Zeppelin and spark sql demystified
Omid Vahdaty
 
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
Emr spark tuning demystified
Omid Vahdaty
 
Spark Working Environment in Windows OS
Universiti Technologi Malaysia (UTM)
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
Operational Tips for Deploying Spark by Miklos Christine
Spark Summit
 
Continuous Processing in Structured Streaming with Jose Torres
Databricks
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Databricks
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Vinoth Chandar
 
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG
 
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird
 
Dive into Spark Streaming
Gerard Maas
 
Introduction to Spark Streaming
datamantra
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
Bullet: A Real Time Data Query Engine
DataWorks Summit
 

Viewers also liked (13)

PPTX
Multi dimension aggregations using spark and dataframes
Romi Kuntsman
 
PDF
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
PPTX
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Chris Fregly
 
PDF
DockerCon EU 2015: Deploying and Managing Containers for Developers
Docker, Inc.
 
PDF
DockerCon EU 2015: Official Repos and Project Nautilus
Docker, Inc.
 
PDF
DockerCon EU 2015: The Latest in Docker Engine
Docker, Inc.
 
PPTX
DockerCon EU 2015: Docker Universal Control Plane (Gordon's Special Session)
Docker, Inc.
 
PDF
Docker Orchestration at Production Scale
Docker, Inc.
 
PDF
Pivoting Data with SparkSQL by Andrew Ray
Spark Summit
 
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
PPTX
DockerCon EU 2015: What's New with Docker Trusted Registry
Docker, Inc.
 
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
PPTX
Why your Spark Job is Failing
DataWorks Summit
 
Multi dimension aggregations using spark and dataframes
Romi Kuntsman
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Chris Fregly
 
DockerCon EU 2015: Deploying and Managing Containers for Developers
Docker, Inc.
 
DockerCon EU 2015: Official Repos and Project Nautilus
Docker, Inc.
 
DockerCon EU 2015: The Latest in Docker Engine
Docker, Inc.
 
DockerCon EU 2015: Docker Universal Control Plane (Gordon's Special Session)
Docker, Inc.
 
Docker Orchestration at Production Scale
Docker, Inc.
 
Pivoting Data with SparkSQL by Andrew Ray
Spark Summit
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
DockerCon EU 2015: What's New with Docker Trusted Registry
Docker, Inc.
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Why your Spark Job is Failing
DataWorks Summit
 
Ad

Similar to Standalone Spark Deployment for Stability and Performance (20)

PPTX
Standalone Spark Deployment for Stability and Performance
Alon Torres
 
PPTX
How to deploy Apache Spark in a multi-tenant, on-premises environment
BlueData, Inc.
 
PDF
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Spark Summit
 
PDF
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Spark Summit
 
PDF
Productionizing Spark and the Spark Job Server
Evan Chan
 
PPTX
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
PDF
Apache Spark At Scale in the Cloud
Databricks
 
PDF
Apache Spark At Scale in the Cloud
Rose Toomey
 
PPTX
AWS and Serverless with Alexa
Rory Preddy
 
PPTX
Autoscaling Spark on AWS EC2 - 11th Spark London meetup
Rafal Kwasny
 
PDF
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
PDF
AWS for Java Developers workshop
Rory Preddy
 
PDF
Running Spark on Cloud
Qubole
 
PDF
Apache spark - Installation
Martin Zapletal
 
PPTX
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
PPTX
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
DataWorks Summit
 
PDF
Just enough DevOps for Data Scientists (Part II)
Databricks
 
PPTX
Scaling horizontally on AWS
Bozhidar Bozhanov
 
PDF
AWS Certified Solutions Architect Associate Exam Guide 1st Edition 2024_KIRAN...
Kiran Kumar Malik
 
PDF
The benefits of running Spark on your own Docker
Itai Yaffe
 
Standalone Spark Deployment for Stability and Performance
Alon Torres
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
BlueData, Inc.
 
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Spark Summit
 
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Spark Summit
 
Productionizing Spark and the Spark Job Server
Evan Chan
 
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Apache Spark At Scale in the Cloud
Databricks
 
Apache Spark At Scale in the Cloud
Rose Toomey
 
AWS and Serverless with Alexa
Rory Preddy
 
Autoscaling Spark on AWS EC2 - 11th Spark London meetup
Rafal Kwasny
 
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
AWS for Java Developers workshop
Rory Preddy
 
Running Spark on Cloud
Qubole
 
Apache spark - Installation
Martin Zapletal
 
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
DataWorks Summit
 
Just enough DevOps for Data Scientists (Part II)
Databricks
 
Scaling horizontally on AWS
Bozhidar Bozhanov
 
AWS Certified Solutions Architect Associate Exam Guide 1st Edition 2024_KIRAN...
Kiran Kumar Malik
 
The benefits of running Spark on your own Docker
Itai Yaffe
 
Ad

Recently uploaded (20)

PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 

Standalone Spark Deployment for Stability and Performance

  • 1. Standalone Spark Deployment For Stability and Performance
  • 2. Totango ❖ Leading Customer Success Platform ❖ Helps companies retain and grow their customer base ❖ Advanced actionable analytics for subscription and recurring revenue ❖ Founded @ 2010 ❖ Infrastructure on AWS cloud ❖ Spark for batch processing ❖ ElasticSearch for serving layer
  • 3. About Me Romi Kuntsman Senior Big Data Engineer @ Totango Working with Apache Spark since v1.0 Working with AWS Cloud since 2008
  • 4. Spark on AWS - first attempts ❖ We tried Amazon EMR (Elastic MapReduce) to install Spark on YARN ➢ Performance hit per application (starts Spark instance for each) ➢ Performance hit per server (running services we don't use, like HDFS) ➢ Slow and unstable cluster resizing (often stuck and need to recreate) ❖We tried spark-ec2 script to install Spark Standalone on AWS EC2 machines ➢ Serial (not parallel) initialization of multiple servers - slow! ➢ Unmaintained scripts since availability of Spark on EMR (see above) ➢ Doesn't integrate with our existing systems
  • 5. Spark on AWS - road to success ❖ We decided to write our own scripts to integrate and control everything ❖Understood all Spark components and configuration settings ❖Deployment based on Chef, like we do in all servers ❖Integrated monitoring and logging, like we have in all our systems ❖Full server utilization - running exactly what we need and nothing more ❖Cluster hanging or crashing no longer happens ❖Seamless cluster resize without hurting any existing jobs ❖Able to upgrade to any version of Spark (not dependant on third party)
  • 6. What we'll discuss ❖Separation of Spark Components ❖Centralized Managed Logging ❖Monitoring Cluster Utilization ❖Auto Scaling Groups ❖Termination Protection ❖Upstart Mechanism ❖NewRelic Integration ❖Chef-based Instantiation Data w/ Romi Ops w/ Alon
  • 7. Separation of Components ❖Spark Master Server (single) ➢Master Process - accepts requests to start applications ➢History Process - serves history data of completed applications ❖Spark Slave Server (multiple) ➢Worker Process - handles workload of applications on server ➢External Shuffle Service - handles data exchange between workers ➢Executor Process (one per core - for running apps) - runs actual code
  • 8. Configuration - Deploy Spread Out ❖spark.deploy.spreadOut (SPARK_MASTER_OPTS) ➢true = use cores spread across all workers ➢false = fill up all worker cores before getting more
  • 9. Configuration - Cleanup ❖spark.worker.cleanup.* (SPARK_WORKER_OPTS) ➢.enabled = true (turn on mechanism to clean up app folders) ➢.interval = 1800 (run every 1800 seconds, or 30 minutes) ➢.appDataTtl = 1800 (remove finished applications after 30 minutes) ❖We have 100s of applications per day, each with it's jars and logs ❖Rapid cleanup is essential to avoid filling up disk space ❖We collect the logs before cleanup - details in following slides ;-) ❖Only cleans up files of completed applications
  • 10. External Shuffle Service ❖Preserves shuffle files written by executors ❖Servers shuffle files to other executors who want to fetch them ❖If (when) one executor crashes (OOM etc), others may still access it's shuffle ❖We run the shuffle service itself in a separate process from the executor ❖To enable: spark.shuffle.service.enable=true ❖Config: spark.shuffle.io.* (see documentation)
  • 11. Logging - components ❖ Master Log (/logs/spark-runner- org.apache.spark.deploy.master.Master-*) ➢ Application registration, worker coordination ❖History Log (/logs/spark-runner- org.apache.spark.deploy.history.HistoryServer-*) ➢ Access to history, errors reading (e.g. I/O from S3, not found) ❖Worker Log (/logs/spark-runner- org.apache.spark.deploy.worker.Worker-*) ➢ Executor management (launch, kill, ACLs) ❖Shuffle Log (/logs/org.apache.spark.deploy.ExternalShuffleService-*) ➢ External Executor Registrations
  • 12. Logging - applications ❖Application Logs (/mnt/spark-work/app-12345/execid/stderr) ➢ All output from executor process, including your own code ❖Using LogStash to gather logs from all applications together input { file { path => "/mnt/spark-work/app-*/*/std*" start_position => beginning } } filter { grok { match => [ "path", "/mnt/spark-work/%{NOTSPACE:application}/.+/%{NOTSPACE:logtype}" ] } } output { file { path => "/logs/applications.log" message_format => "%{application} %{logtype} %{message}" } }
  • 13. Monitoring Cluster Utilization ❖ Spark Reports Metrics (Codahale) through Graphite ➢Master metrics - running application and their status ➢Worker metrics - used cores, free cores ➢JVM metrics - memory allocation, GC ❖We use Anodot to view and track metrics trends and anomalies
  • 14. And now, to the Ops side... Alon Torres DevOps Engineer @ Totango
  • 15. Auto Scaling Group Components ❖Auto Scaling Group ➢ Scale your group up or down flexibly ➢ Supports health checks and load balancing ❖Launch Configuration ➢ Template used by the ASG to launch instances ➢ User Data script for post-launch configuration ❖User Data ➢ Install prerequisites and fetch instance info ➢ Install and start Chef client ➢ Sanity checks throughout Launch Configuratio n Auto Scaling Group EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance User Data
  • 16. Auto Scaling Group resizing in AWS ❖ Scheduled ➢ Set the desired size according to a specified schedule ➢ Good for scenarios with predictable, cyclic workloads. ❖Alert-Based ➢ Set specific alerts that trigger a cluster action ➢ Alerts can monitor instance health properties (resource usage) ❖Remote-triggered ➢ Using the AWS API/CLI, resize the cluster however you want
  • 17. Resizing the ASG with Jenkins ❖We use schedule-based Jenkins jobs that utilize the AWS CLI ➢ Each job sets the desired Spark cluster size ➢ Makes it easy for our Data team to make changes to the schedule ➢ Desired size can be manually overridden if needed
  • 18. Termination Protection ❖When scaling down, ASG treats all nodes as equal termination candidates ❖We want to avoid killing instances with currently running jobs ❖To achieve this, we used a built-in feature of ASG - termination protection ❖Any instance in the ASG can be set as protected, thus preventing termination when scaling down the cluster. if [ $(ps -ef | grep executor | grep spark | wc -l) -ne 0 ]; then aws autoscaling set-instance-protection --protected-from-scale-in … fi
  • 19. Upstart Jobs for Spark ❖ Every spark component has an upstart job the does the following ➢ Set Spark Niceness (Process priority in CPU resource distribution) ➢ Start the required Spark component and ensure it stays running ■ The default spark daemon script runs in the background ■ For Upstart, we modified the script to run in the foreground ❖ nohup nice -n "$SPARK_NICENESS"…& vs ❖ nice -n "$SPARK_NICENESS" ...
  • 20. NewRelic Monitoring ❖ Cloud-based Application and Server monitoring ❖Supports multiple alert policies for different needs ➢ Who to alert, and what triggers the alerts ❖Newly created instances are auto - assigned the default alert policy
  • 21. Policy Assignment using AWS Lambda ❖Spark instances have their own policy in NewRelic ❖Each instance has to ask NewRelic to be reassigned to the new policy ➢Parallel reassignment requests may collide and override each other ❖Solution - during provisioning and shutdown, we do the following: ➢Put a record in an AWS Kinesis stream that contains their hostname and their desired NewRelic policy ID ➢The record triggers an AWS Lambda script that uses the NewRelic API to reassign the hostname given to the policy ID given
  • 22. Chef ❖Configuration Management Tool, can provision and configure instances ➢Describe an instance state as code, let chef handle the rest ➢Typically works in server/client mode - client updates every 30m ➢Besides provisioning, also prevents configuration drifts ❖Vast amount of plugins and cookbooks - the sky's the limit! ❖Configures all the instances in our DC
  • 23. Spark Instance Provisioning ❖ Setup Spark ➢ Setup prerequisites - users, directories, symlinks and jars ➢ Download and extract spark package from S3 ❖Configure termination protection cron script ❖Configure upstart conf files ❖Place spark config files ❖Assign NewRelic policy ❖Add shutdown scripts ➢ Delete instance from chef database ➢ Remove from NewRelic monitoring policy
  • 24. Questions? ❖ Alon Torres, DevOps https://il.linkedin.com/in/alontorres ❖Romi Kuntsman, Senior Big Data Engineer https://il.linkedin.com/in/romik ❖Stay in touch! Totango Engineering Technical Blog http://labs.totango.com/

Editor's Notes