SlideShare a Scribd company logo
Spark tutorial, developing
locally and deploying on EMR
Use cases (my biased opinion)
• Interactive and Expressive Data Analysis
• If you feel limited when trying to express yourself in “group by”, “join” and
“where”
• Only if it is not possible to work with datasets locally
• Entering Danger Zone:
• Spark SQL engine, like Impala/Hive
• Speed up ETLs if your data can fit in memory (speculation)
• Machine learning
• Graph analytics
• Streaming (not mature yet)
Possible working styles
• Develop in IDE
• Develop as you go in Spark shell
IDE Spark-shell
Easier to manipulate with objects,
inheritance, package management
Easier to debug code with production
scale data
Requires some hacking to get programs
run on both Windows and Prod
environments
Will only run on Windows if you have
correct line endings in spark-shell
launcher scripts or use Cygwin
IntelliJ IDEA
• Basic set up https://gitz.adform.com/dspr/audience-
extension/tree/38b4b0588902457677f985caf6eb356e037a668c/spar
k-skeleton
Hacks
• 99% chance that on Windows you won’t be able to use function
`saveAsTextFile()`
• Download exe file from
http://stackoverflow.com/questions/19620642/failed-to-locate-the-
winutils-binary-in-the-hadoop-binary-path
• Place it somewhere on your PC in bin folder
(C:somewherebinwinutils.exe) and set in your code before using
save function
System.setProperty("hadoop.home.dir", "C:somewhere")
When you are done with your code…
• It is time to package everything to fat jar with sbt assembly
• Add “provided” to library dependencies, since spark libs are already in
the classpath if you run job on emr with spark already set-up
• Find more info in Audience Extension project Spark branch build.sbt
file.
libraryDependencies += "org.apache.spark" %% "spark-core" %
"1.2.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-mllib" %
"1.2.0" % "provided"
Running on EMR
• build.sbt can be configured (S3 package) to upload fat jar to s3 when
it is done with assembly, if you don’t have that just upload it manually
• Run bootstrap action s3://support.elasticmapreduce/spark/install-
spark with arguments -v 1.2.0.a -x –g (some documentation in
https://github.com/awslabs/emr-bootstrap-
actions/tree/master/spark)
• Also install ganglia for monitoring cluster load (run this before spark
bootstrap step)
• If you don’t install ganglia ssh tunnels to spark UI won’t work.
Start with local mode first
Use only one instance in cluster, submit your jar with this:
/home/hadoop/spark/bin/spark-submit 
--class com.adform.dspr.SimilarityJob 
--master local[16] 
--driver-memory 4G 
--conf spark.default.parallelism=112
SimilarityJob.jar 
--remote 
--input s3://adform-dsp-warehouse/data/facts/impressions/dt=20150109/* 
--output s3://dev-adform-data-engineers/tmp/spark/2days 
--similarity-threshold 300
Run on multiple machines with yarn master
/home/hadoop/spark/bin/spark-submit 
--class com.adform.dspr.SimilarityJob 
--master yarn 
--deploy-mode client  #or cluster
--num-executors 7 
--executor-memory 116736 M 
--executor-cores 16 
--conf spark.default.parallelism=112 
--conf spark.task.maxFailures=4 
SimilarityJob.jar 
--remote 
… … …
Executor parameters are optional, bootstrap
script will automatically try to maximize spark
configuration options. Note that scripts are
not aware of tasks that you are doing, they
only read emr cluster specifications.
Spark UI
• Need to set up ssh tunnel to use access it from your PC
• Alternative is to use command line browser lynx
• When you submit app with local master UI will be in ip:4040
• When you submit with Yarn master, go to Hadoop UI on port 9026, it
will have Spark task running, click on ApplicationMaster in Tracking UI
column, or get UI url from command line when you submit task
Spark UI
For spark 1.2.0 Executors tab is wrong, storage is always empty, only useful tabs
are Jobs, Stages and Environment.
Some useful settings
• spark.hadoop.validateOutputSpecs useful when developing, set to
false, then you can overwrite output files
• spark.default.parallelism (number of output files / number of cores),
automatically configured when you run bootstrap actions with -x
option
• spark.shuffle.consolidateFiles (default false)
• spark.rdd.compress (default false)
• spark.akka.timeout, spark.akka.frameSize, spark.speculation, …
• http://spark.apache.org/docs/1.2.0/configuration.html
Spark shell
/home/hadoop/spark/bin/spark-shell 
--master <yarn|local[*]> 
--deploy-mode client 
--num-executors 7 
--executor-memory 4G 
--executor-cores 16 
--driver-memory 4G
--conf spark.default.parallelism=112 
--conf spark.task.maxFailures=4
Spark shell
• In spark shell you don’t need to instantiate spark context, it is already
intantiated, but you can create another one if you like
• Type scala expressions and see what is happening
• Note the lazy evaluation, to force expression evaluation fore
debugging use action functions like [expression].take(n) or
[expression].count to see if your statements are OK
Summary
• Spark is better suited for developing in Linux
• Don’t trust Amazon bootstrap scripts, check if your application is
utilizing resources with Ganglia
• Try to write scala code in a way that it is possible to run parts of it in
spark-shell, otherwise it is hard to debug problems which occur only
at production dataset scale.

More Related Content

What's hot (20)

PPTX
Ansible - Why and what
Maruti Gollapudi
 
PPTX
Go Faster with Ansible (PHP meetup)
Richard Donkin
 
PPTX
DevOps for database
Osama Mustafa
 
PDF
OSDC 2013 | Introduction into Chef by Andy Hawkins
NETWAYS
 
PDF
Security Testing with OWASP ZAP in CI/CD - Simon Bennetts - Codemotion Amster...
Codemotion
 
PPTX
Building Enterprise Search Engines using Open Source Technologies
Rahul Singh
 
PDF
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
gethue
 
PDF
CocoaPods Basic Usage
Ryan Wang
 
PDF
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
gethue
 
PDF
#SFSE Lightning Talk: WebDriver, ScalaTest, SBT and IntelliJ-IDEA
Sauce Labs
 
PPTX
Learn you some Ansible for great good!
David Lapsley
 
PDF
AWS ElasticBeanstalk Advanced configuration
Lionel LONKAP TSAMBA
 
PDF
Hands On Introduction To Ansible Configuration Management With Ansible Comple...
SlideTeam
 
PPTX
What Is Ansible? | How Ansible Works? | Ansible Tutorial For Beginners | DevO...
Simplilearn
 
PDF
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Shalin Shekhar Mangar
 
PDF
Final Report - Spark
Syed Danyal Khaliq
 
PDF
MySQL on AWS 101
Anders Karlsson
 
PDF
LDAP, SAML and Hue
gethue
 
PDF
Analyzing Log Data With Apache Spark
Spark Summit
 
PPTX
Drupal Camp Melbourne
Hasitha Guruge
 
Ansible - Why and what
Maruti Gollapudi
 
Go Faster with Ansible (PHP meetup)
Richard Donkin
 
DevOps for database
Osama Mustafa
 
OSDC 2013 | Introduction into Chef by Andy Hawkins
NETWAYS
 
Security Testing with OWASP ZAP in CI/CD - Simon Bennetts - Codemotion Amster...
Codemotion
 
Building Enterprise Search Engines using Open Source Technologies
Rahul Singh
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
gethue
 
CocoaPods Basic Usage
Ryan Wang
 
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
gethue
 
#SFSE Lightning Talk: WebDriver, ScalaTest, SBT and IntelliJ-IDEA
Sauce Labs
 
Learn you some Ansible for great good!
David Lapsley
 
AWS ElasticBeanstalk Advanced configuration
Lionel LONKAP TSAMBA
 
Hands On Introduction To Ansible Configuration Management With Ansible Comple...
SlideTeam
 
What Is Ansible? | How Ansible Works? | Ansible Tutorial For Beginners | DevO...
Simplilearn
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Shalin Shekhar Mangar
 
Final Report - Spark
Syed Danyal Khaliq
 
MySQL on AWS 101
Anders Karlsson
 
LDAP, SAML and Hue
gethue
 
Analyzing Log Data With Apache Spark
Spark Summit
 
Drupal Camp Melbourne
Hasitha Guruge
 

Similar to Spark intro by Adform Research (20)

PDF
R Data Access from hdfs,spark,hive
arunkumar sadhasivam
 
PDF
Spark Working Environment in Windows OS
Universiti Technologi Malaysia (UTM)
 
PPTX
How to build your query engine in spark
Peng Cheng
 
PPTX
Introduction to Apache Spark and MLlib
pumaranikar
 
PDF
Introduction to Apache Spark Ecosystem
Bojan Babic
 
PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PPTX
Spark with HDInsight
Khalid Salama
 
PDF
Apache spark - Installation
Martin Zapletal
 
PDF
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Summit
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PDF
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
PDF
AWS meetup「Apache Spark on EMR」
SmartNews, Inc.
 
PPTX
YARN Ready: Apache Spark
Hortonworks
 
PPTX
Apache Spark SQL- Installing Spark
Experfy
 
PPTX
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
PPTX
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
PDF
Hortonworks tech workshop in-memory processing with spark
Hortonworks
 
PPTX
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
DataWorks Summit
 
PDF
Introduction to Apache Spark
Samy Dindane
 
R Data Access from hdfs,spark,hive
arunkumar sadhasivam
 
Spark Working Environment in Windows OS
Universiti Technologi Malaysia (UTM)
 
How to build your query engine in spark
Peng Cheng
 
Introduction to Apache Spark and MLlib
pumaranikar
 
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Spark with HDInsight
Khalid Salama
 
Apache spark - Installation
Martin Zapletal
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Summit
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
AWS meetup「Apache Spark on EMR」
SmartNews, Inc.
 
YARN Ready: Apache Spark
Hortonworks
 
Apache Spark SQL- Installing Spark
Experfy
 
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
DataWorks Summit
 
Introduction to Apache Spark
Samy Dindane
 
Ad

More from Vasil Remeniuk (20)

PPTX
Product Minsk - РТБ и Программатик
Vasil Remeniuk
 
PDF
Работа с Akka Сluster, @afiskon, scalaby#14
Vasil Remeniuk
 
PDF
Cake pattern. Presentation by Alex Famin at scalaby#14
Vasil Remeniuk
 
PDF
Scala laboratory: Globus. iteration #3
Vasil Remeniuk
 
PPTX
Testing in Scala by Adform research
Vasil Remeniuk
 
PPTX
Spark Intro by Adform Research
Vasil Remeniuk
 
PPTX
Types by Adform Research, Saulius Valatka
Vasil Remeniuk
 
PPTX
Types by Adform Research
Vasil Remeniuk
 
PPTX
Scalding by Adform Research, Alex Gryzlov
Vasil Remeniuk
 
PPTX
Scalding by Adform Research, Alex Gryzlov
Vasil Remeniuk
 
PPTX
Spark by Adform Research, Paulius
Vasil Remeniuk
 
PPTX
Scala Style by Adform Research (Saulius Valatka)
Vasil Remeniuk
 
PPTX
SBT by Aform Research, Saulius Valatka
Vasil Remeniuk
 
PDF
Scala laboratory: Globus. iteration #2
Vasil Remeniuk
 
PPTX
Testing in Scala. Adform Research
Vasil Remeniuk
 
PDF
Scala laboratory. Globus. iteration #1
Vasil Remeniuk
 
PDF
Cassandra + Spark + Elk
Vasil Remeniuk
 
PDF
Опыт использования Spark, Основано на реальных событиях
Vasil Remeniuk
 
PDF
ETL со Spark
Vasil Remeniuk
 
PDF
Funtional Reactive Programming with Examples in Scala + GWT
Vasil Remeniuk
 
Product Minsk - РТБ и Программатик
Vasil Remeniuk
 
Работа с Akka Сluster, @afiskon, scalaby#14
Vasil Remeniuk
 
Cake pattern. Presentation by Alex Famin at scalaby#14
Vasil Remeniuk
 
Scala laboratory: Globus. iteration #3
Vasil Remeniuk
 
Testing in Scala by Adform research
Vasil Remeniuk
 
Spark Intro by Adform Research
Vasil Remeniuk
 
Types by Adform Research, Saulius Valatka
Vasil Remeniuk
 
Types by Adform Research
Vasil Remeniuk
 
Scalding by Adform Research, Alex Gryzlov
Vasil Remeniuk
 
Scalding by Adform Research, Alex Gryzlov
Vasil Remeniuk
 
Spark by Adform Research, Paulius
Vasil Remeniuk
 
Scala Style by Adform Research (Saulius Valatka)
Vasil Remeniuk
 
SBT by Aform Research, Saulius Valatka
Vasil Remeniuk
 
Scala laboratory: Globus. iteration #2
Vasil Remeniuk
 
Testing in Scala. Adform Research
Vasil Remeniuk
 
Scala laboratory. Globus. iteration #1
Vasil Remeniuk
 
Cassandra + Spark + Elk
Vasil Remeniuk
 
Опыт использования Spark, Основано на реальных событиях
Vasil Remeniuk
 
ETL со Spark
Vasil Remeniuk
 
Funtional Reactive Programming with Examples in Scala + GWT
Vasil Remeniuk
 
Ad

Recently uploaded (20)

PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
July Patch Tuesday
Ivanti
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Python basic programing language for automation
DanialHabibi2
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 

Spark intro by Adform Research

  • 1. Spark tutorial, developing locally and deploying on EMR
  • 2. Use cases (my biased opinion) • Interactive and Expressive Data Analysis • If you feel limited when trying to express yourself in “group by”, “join” and “where” • Only if it is not possible to work with datasets locally • Entering Danger Zone: • Spark SQL engine, like Impala/Hive • Speed up ETLs if your data can fit in memory (speculation) • Machine learning • Graph analytics • Streaming (not mature yet)
  • 3. Possible working styles • Develop in IDE • Develop as you go in Spark shell IDE Spark-shell Easier to manipulate with objects, inheritance, package management Easier to debug code with production scale data Requires some hacking to get programs run on both Windows and Prod environments Will only run on Windows if you have correct line endings in spark-shell launcher scripts or use Cygwin
  • 4. IntelliJ IDEA • Basic set up https://gitz.adform.com/dspr/audience- extension/tree/38b4b0588902457677f985caf6eb356e037a668c/spar k-skeleton
  • 5. Hacks • 99% chance that on Windows you won’t be able to use function `saveAsTextFile()` • Download exe file from http://stackoverflow.com/questions/19620642/failed-to-locate-the- winutils-binary-in-the-hadoop-binary-path • Place it somewhere on your PC in bin folder (C:somewherebinwinutils.exe) and set in your code before using save function System.setProperty("hadoop.home.dir", "C:somewhere")
  • 6. When you are done with your code… • It is time to package everything to fat jar with sbt assembly • Add “provided” to library dependencies, since spark libs are already in the classpath if you run job on emr with spark already set-up • Find more info in Audience Extension project Spark branch build.sbt file. libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0" % "provided" libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.2.0" % "provided"
  • 7. Running on EMR • build.sbt can be configured (S3 package) to upload fat jar to s3 when it is done with assembly, if you don’t have that just upload it manually • Run bootstrap action s3://support.elasticmapreduce/spark/install- spark with arguments -v 1.2.0.a -x –g (some documentation in https://github.com/awslabs/emr-bootstrap- actions/tree/master/spark) • Also install ganglia for monitoring cluster load (run this before spark bootstrap step) • If you don’t install ganglia ssh tunnels to spark UI won’t work.
  • 8. Start with local mode first Use only one instance in cluster, submit your jar with this: /home/hadoop/spark/bin/spark-submit --class com.adform.dspr.SimilarityJob --master local[16] --driver-memory 4G --conf spark.default.parallelism=112 SimilarityJob.jar --remote --input s3://adform-dsp-warehouse/data/facts/impressions/dt=20150109/* --output s3://dev-adform-data-engineers/tmp/spark/2days --similarity-threshold 300
  • 9. Run on multiple machines with yarn master /home/hadoop/spark/bin/spark-submit --class com.adform.dspr.SimilarityJob --master yarn --deploy-mode client #or cluster --num-executors 7 --executor-memory 116736 M --executor-cores 16 --conf spark.default.parallelism=112 --conf spark.task.maxFailures=4 SimilarityJob.jar --remote … … … Executor parameters are optional, bootstrap script will automatically try to maximize spark configuration options. Note that scripts are not aware of tasks that you are doing, they only read emr cluster specifications.
  • 10. Spark UI • Need to set up ssh tunnel to use access it from your PC • Alternative is to use command line browser lynx • When you submit app with local master UI will be in ip:4040 • When you submit with Yarn master, go to Hadoop UI on port 9026, it will have Spark task running, click on ApplicationMaster in Tracking UI column, or get UI url from command line when you submit task
  • 11. Spark UI For spark 1.2.0 Executors tab is wrong, storage is always empty, only useful tabs are Jobs, Stages and Environment.
  • 12. Some useful settings • spark.hadoop.validateOutputSpecs useful when developing, set to false, then you can overwrite output files • spark.default.parallelism (number of output files / number of cores), automatically configured when you run bootstrap actions with -x option • spark.shuffle.consolidateFiles (default false) • spark.rdd.compress (default false) • spark.akka.timeout, spark.akka.frameSize, spark.speculation, … • http://spark.apache.org/docs/1.2.0/configuration.html
  • 13. Spark shell /home/hadoop/spark/bin/spark-shell --master <yarn|local[*]> --deploy-mode client --num-executors 7 --executor-memory 4G --executor-cores 16 --driver-memory 4G --conf spark.default.parallelism=112 --conf spark.task.maxFailures=4
  • 14. Spark shell • In spark shell you don’t need to instantiate spark context, it is already intantiated, but you can create another one if you like • Type scala expressions and see what is happening • Note the lazy evaluation, to force expression evaluation fore debugging use action functions like [expression].take(n) or [expression].count to see if your statements are OK
  • 15. Summary • Spark is better suited for developing in Linux • Don’t trust Amazon bootstrap scripts, check if your application is utilizing resources with Ganglia • Try to write scala code in a way that it is possible to run parts of it in spark-shell, otherwise it is hard to debug problems which occur only at production dataset scale.