SlideShare a Scribd company logo
Apache	
  Spark	
  	
  
Fundamentals	
   	
  
	
  
Eren	
  Avşaroğulları	
  	
  
Data	
  Science	
  and	
  Engineering	
  Club	
  Meetup	
  
Dublin	
  -­‐	
  December	
  9,	
  2017	
  
Agenda	
  
Ê  What	
  is	
  Apache	
  Spark?	
  
Ê  Spark	
  Ecosystem	
  &	
  Terminology	
  
Ê  RDDs	
  &	
  Operation	
  Types	
  (Transformations	
  &	
  Actions)	
  
Ê  RDD	
  Lineage	
  
Ê  Job	
  Lifecycle	
  
Ê  RDD	
  Evolution	
  (DataFrames	
  and	
  DataSets)	
  
Ê  Persistency	
  
Ê  Clustering	
  /	
  Spark	
  on	
  YARN	
  
	
  
shows	
  code	
  samples	
  
Bio	
  
Ê  B.Sc	
  &	
  M.Sc.	
  on	
  Electronics	
  &	
  Control	
  Engineering	
  
Ê  Apache	
  Spark	
  Contributor	
  since	
  v2.0.0
Ê  Sr.	
  Software	
  Engineer	
  @	
  	
  	
  
Ê  Currently,	
  work	
  on	
  Data	
  Analytics	
  
Data	
  Transformations	
  /	
  Cleaning	
  
	
  	
  	
  	
  	
  	
  	
  erenavsarogullari	
  
What	
  is	
  Apache	
  Spark?	
  
Ê  Distributed	
  Compute	
  Engine	
  
Ê  Project	
  started	
  in	
  2009	
  at	
  UC	
  Berkley	
  
Ê  First	
  version(v0.5)	
  is	
  released	
  on	
  June	
  2012	
  
Ê  Moved	
  to	
  Apache	
  Software	
  Foundation	
  in	
  2013	
  
Ê  Supported	
  Languages:	
  Java,	
  Scala,	
  Python	
  and	
  R	
  
Ê  +	
  1100	
  contributors	
  /	
  +14K	
  forks	
  on	
  Github	
  
Ê  spark-­‐packages.org	
  =>	
  ~380	
  Extensions	
  
Spark	
  Ecosystem	
  
Spark	
  SQL	
  
Spark	
  
Streaming	
  
MLlib	
   GraphX	
  
Spark	
  Core	
  Engine	
  
Standalone	
   YARN	
   Mesos	
  Local	
  
Cluster	
  Mode	
  Local	
  Mode	
  
Terminology	
  
Ê  RDD:	
  Resilient	
  Distributed	
  Dataset,	
  immutable,	
  resilient	
  and	
  partitioned.	
  
Ê  DAG:	
  Direct	
  Acyclic	
  Graph.	
  An	
  execution	
  plan	
  of	
  a	
  job	
  (a.k.a	
  RDD	
  dependency	
  graph)	
  
Ê  Application:	
  An	
  instance	
  of	
  Spark	
  Context.	
  Single	
  per	
  JVM.	
  
	
  Ê  Job:	
  An	
  action	
  operator	
  triggering	
  
computation.	
  
Ê  Driver:	
  The	
  program/process	
  for	
  running	
  
the	
  Job	
  over	
  the	
  Spark	
  Engine	
  
Ê  Executor:	
  The	
  process	
  executing	
  a	
  task	
  
Ê  Worker:	
  The	
  node	
  running	
  executors.	
  
	
  
How	
  to	
  create	
  RDD?	
  
Ê  Collection	
  Parallelize	
  
Ê  By	
  Loading	
  file	
  
Ê  Transformations	
  
Ê  Lets	
  see	
  the	
  sample	
  =>	
  Application-­‐1	
  
RDD	
  
RDD	
  
RDD	
  
RDD	
  Operation	
  Types	
  
Two	
  types	
  of	
  Spark	
  operations	
  on	
  RDD	
  
Ê  Transformations:	
  lazy	
  evaluated	
  (not	
  computed	
  immediately)	
  
Ê  Actions:	
  triggers	
  the	
  computation	
  and	
  returns	
  value	
  
Transformations	
  
RDD	
   Actions	
   Value	
  Data	
  
Transformations	
  
Ê  map(func)	
  
Ê  flatMap(func)	
  
Ê  filter(func)	
  
Ê  union(dataset)	
  
Ê  join(dataset,	
  usingColumns:	
  Seq[String])	
  
Ê  intersect(dataset)	
  
Ê  coalesce(numPartitions)	
  
Ê  repartition(numPartitions)	
  
Full	
  List:	
  
https://spark.apache.org/docs/latest/rdd-­‐programming-­‐
guide.html#transformations	
  
	
  
Actions	
  
Ê  first()	
  
Ê  take(n)	
  
Ê  collect()	
  
Ê  count()	
  
Ê  saveAsTextFile(path)	
  
Full	
  List:	
  
https://spark.apache.org/docs/latest/rdd-­‐programming-­‐guide.html#actions	
  
Lets	
  see	
  the	
  sample	
  =>	
  Application-­‐2	
  
	
  
RDD	
  Dependencies	
  (Lineage)	
  
RDD	
  5	
  
Stage	
  1	
  
RDD	
  1	
  
Stage	
  0	
  
RDD	
  3	
  
RDD	
  2	
  
map	
  
RDD	
  4	
  
union	
  
RDD	
  6	
  
sort	
  
RDD	
  7	
  
join	
  
Stage	
  3	
  
Narrow	
  
Transformation	
  
Narrow	
  
Transformations	
  
Wide	
  
Transformations	
  
Shuffles	
  
Shuffles	
  
Job	
  Lifecyle	
  
RDD	
  Evolution	
  
RDD	
  
V1.0	
  
(2011)	
  
DataFrame	
  
V1.3	
  
(2013)	
  
DataSet	
  
V1.6	
  
(2015)	
  
Untyped	
  API	
  
Schema	
  based	
  -­‐	
  Tabular	
  	
  
Java	
  Objects	
  
Low	
  level	
  data-­‐structure	
  
To	
  work	
  with	
  	
  
Unstructured	
  Data	
  
Typed	
  API:	
  [T]	
  
Tabular	
  
SQL	
  Support	
  
To	
  work	
  with	
  	
  
Semi-­‐Structured	
  (csv,	
  json)	
  /	
  Structured	
  Data	
  (jdbc)	
  
Project	
  Tungsten	
  
Catalyst	
  Optimizer	
  
	
  
Two	
  tier	
  
optimizations	
  
How	
  to	
  create	
  the	
  DataFrame?	
  
Ê  By	
  loading	
  file	
  (spark.read.format("csv").load())	
  
Ê  SparkSession.createDataFrame(RDD,	
  schema)	
  
	
  
Lets	
  see	
  the	
  code	
  –	
  Application-­‐3	
  
How	
  to	
  create	
  the	
  DataSet?	
  
Ê  By	
  loading	
  file	
  (spark.read.format("csv").load())	
  
Ê  SparkSession.createDataSet(collection	
  or	
  RDD)	
  
	
  
Lets	
  see	
  the	
  code	
  –	
  Application-­‐4-­‐1	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Application-­‐4-­‐2	
  
	
  
Persistency	
  
Storage	
  Modes	
   Details	
  
MEMORY_ONLY	
   Store	
  RDD	
  as	
  deserialized	
  Java	
  objects	
  in	
  the	
  JVM	
  
MEMORY_AND_DISK	
   Store	
  RDD	
  as	
  deserialized	
  Java	
  objects	
  in	
  the	
  JVM	
  
MEMORY_ONLY_SER	
   Store	
  RDD	
  as	
  serialized	
  Java	
  objects	
  (Kryo	
  API	
  can	
  be	
  thought)	
  
MEMORY_AND_DISK_SER	
   Similar	
  to	
  MEMORY_ONLY_SER	
  
DISK_ONLY	
   Store	
  the	
  RDD	
  partitions	
  only	
  on	
  disk.	
  
MEMORY_ONLY_2,	
  
MEMORY_AND_DISK_2	
  
Same	
  as	
  the	
  levels	
  above,	
  but	
  replicate	
  each	
  partition	
  on	
  two	
  
cluster	
  nodes.	
  
Ê  RDD	
  /	
  DF.persist(newStorageLevel:	
  StorageLevel)	
  
Ê  RDD.unpersist()	
  =>	
  Unpersists	
  RDD	
  from	
  memory	
  and	
  disk	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Unpersist	
  will	
  need	
  to	
  be	
  forced	
  for	
  long	
  term	
  to	
  use	
  	
  executor	
  memory	
  efficiently.	
  
Note:	
  Also	
  when	
  cached	
  data	
  exceeds	
  storage	
  memory,	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Spark	
  will	
  use	
  Least	
  Recently	
  Used(LRU)	
  Expiry	
  Policy	
  as	
  default	
  
Clustering	
  /	
  Spark	
  on	
  YARN	
  
YARN	
  Client	
  
Mode	
  
Q	
  &	
  A	
  
Thanks	
  
References	
  
Ê  https://spark.apache.org/docs/latest/	
  
Ê  https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals	
  
Ê  https://jaceklaskowski.gitbooks.io/mastering-­‐apache-­‐spark	
  
Ê  https://stackoverflow.com/questions/36215672/spark-­‐yarn-­‐architecture	
  
Ê  High	
  Performance	
  Spark	
  by	
  	
  
	
  	
  	
  	
  	
  	
  	
  Holden	
  Karau	
  &	
  Rachel	
  Warren	
  

More Related Content

What's hot (20)

PDF
Apache Spark: What's under the hood
Adarsh Pannu
 
PDF
Apache Spark & Streaming
Fernando Rodriguez
 
PDF
A Deep Dive Into Spark
Ashish kumar
 
PDF
Introduction to spark
Duyhai Doan
 
PPTX
Tuning and Debugging in Apache Spark
Patrick Wendell
 
PDF
Intro to apache spark stand ford
Thu Hiền
 
PPT
Spark and spark streaming internals
Sigmoid
 
PPTX
Spark Study Notes
Richard Kuo
 
PPTX
Apache Spark
Majid Hajibaba
 
PDF
BDM25 - Spark runtime internal
David Lauzon
 
PPTX
Apache spark Intro
Tudor Lapusan
 
PDF
DTCC '14 Spark Runtime Internals
Cheng Lian
 
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Survey of Spark for Data Pre-Processing and Analytics
Yannick Pouliot
 
PDF
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Apache Spark, the Next Generation Cluster Computing
Gerger
 
PDF
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
PDF
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
PDF
Concurrent and Distributed Applications with Akka, Java and Scala
Fernando Rodriguez
 
PDF
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
Apache Spark: What's under the hood
Adarsh Pannu
 
Apache Spark & Streaming
Fernando Rodriguez
 
A Deep Dive Into Spark
Ashish kumar
 
Introduction to spark
Duyhai Doan
 
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Intro to apache spark stand ford
Thu Hiền
 
Spark and spark streaming internals
Sigmoid
 
Spark Study Notes
Richard Kuo
 
Apache Spark
Majid Hajibaba
 
BDM25 - Spark runtime internal
David Lauzon
 
Apache spark Intro
Tudor Lapusan
 
DTCC '14 Spark Runtime Internals
Cheng Lian
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Survey of Spark for Data Pre-Processing and Analytics
Yannick Pouliot
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark, the Next Generation Cluster Computing
Gerger
 
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
Concurrent and Distributed Applications with Akka, Java and Scala
Fernando Rodriguez
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 

Similar to Apache Spark Fundamentals Meetup Talk (20)

PPTX
Apache Spark Fundamentals Training
Eren Avşaroğulları
 
PPTX
Bring the Spark To Your Eyes
Demi Ben-Ari
 
PPTX
Intro to Apache Spark
Robert Sanders
 
PPTX
Intro to Apache Spark
clairvoyantllc
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Apache Spark Introduction.pdf
MaheshPandit16
 
PPTX
Introduction to Apache Spark
Hubert Fan Chiang
 
PPTX
Spark
Heena Madan
 
PPTX
Apache Spark overview
DataArt
 
PDF
Introduction to Apache Spark
Vincent Poncet
 
PDF
Tuning and Debugging in Apache Spark
Databricks
 
PDF
Big Data Analytics with Apache Spark
MarcoYuriFujiiMelo
 
PDF
Data Processing with Apache Spark Meetup Talk
Eren Avşaroğulları
 
PDF
Spark For Faster Batch Processing
Edureka!
 
PPTX
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
PPTX
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Demi Ben-Ari
 
PDF
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
PPTX
Apache Spark Introduction
Rich Lee
 
Apache Spark Fundamentals Training
Eren Avşaroğulları
 
Bring the Spark To Your Eyes
Demi Ben-Ari
 
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
clairvoyantllc
 
Apache Spark Architecture
Alexey Grishchenko
 
Apache Spark Introduction.pdf
MaheshPandit16
 
Introduction to Apache Spark
Hubert Fan Chiang
 
Apache Spark overview
DataArt
 
Introduction to Apache Spark
Vincent Poncet
 
Tuning and Debugging in Apache Spark
Databricks
 
Big Data Analytics with Apache Spark
MarcoYuriFujiiMelo
 
Data Processing with Apache Spark Meetup Talk
Eren Avşaroğulları
 
Spark For Faster Batch Processing
Edureka!
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Demi Ben-Ari
 
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Apache Spark Introduction
Rich Lee
 
Ad

Recently uploaded (20)

PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Ad

Apache Spark Fundamentals Meetup Talk

  • 1. Apache  Spark     Fundamentals       Eren  Avşaroğulları     Data  Science  and  Engineering  Club  Meetup   Dublin  -­‐  December  9,  2017  
  • 2. Agenda   Ê  What  is  Apache  Spark?   Ê  Spark  Ecosystem  &  Terminology   Ê  RDDs  &  Operation  Types  (Transformations  &  Actions)   Ê  RDD  Lineage   Ê  Job  Lifecycle   Ê  RDD  Evolution  (DataFrames  and  DataSets)   Ê  Persistency   Ê  Clustering  /  Spark  on  YARN     shows  code  samples  
  • 3. Bio   Ê  B.Sc  &  M.Sc.  on  Electronics  &  Control  Engineering   Ê  Apache  Spark  Contributor  since  v2.0.0 Ê  Sr.  Software  Engineer  @       Ê  Currently,  work  on  Data  Analytics   Data  Transformations  /  Cleaning                erenavsarogullari  
  • 4. What  is  Apache  Spark?   Ê  Distributed  Compute  Engine   Ê  Project  started  in  2009  at  UC  Berkley   Ê  First  version(v0.5)  is  released  on  June  2012   Ê  Moved  to  Apache  Software  Foundation  in  2013   Ê  Supported  Languages:  Java,  Scala,  Python  and  R   Ê  +  1100  contributors  /  +14K  forks  on  Github   Ê  spark-­‐packages.org  =>  ~380  Extensions  
  • 5. Spark  Ecosystem   Spark  SQL   Spark   Streaming   MLlib   GraphX   Spark  Core  Engine   Standalone   YARN   Mesos  Local   Cluster  Mode  Local  Mode  
  • 6. Terminology   Ê  RDD:  Resilient  Distributed  Dataset,  immutable,  resilient  and  partitioned.   Ê  DAG:  Direct  Acyclic  Graph.  An  execution  plan  of  a  job  (a.k.a  RDD  dependency  graph)   Ê  Application:  An  instance  of  Spark  Context.  Single  per  JVM.    Ê  Job:  An  action  operator  triggering   computation.   Ê  Driver:  The  program/process  for  running   the  Job  over  the  Spark  Engine   Ê  Executor:  The  process  executing  a  task   Ê  Worker:  The  node  running  executors.    
  • 7. How  to  create  RDD?   Ê  Collection  Parallelize   Ê  By  Loading  file   Ê  Transformations   Ê  Lets  see  the  sample  =>  Application-­‐1  
  • 8. RDD   RDD   RDD   RDD  Operation  Types   Two  types  of  Spark  operations  on  RDD   Ê  Transformations:  lazy  evaluated  (not  computed  immediately)   Ê  Actions:  triggers  the  computation  and  returns  value   Transformations   RDD   Actions   Value  Data  
  • 9. Transformations   Ê  map(func)   Ê  flatMap(func)   Ê  filter(func)   Ê  union(dataset)   Ê  join(dataset,  usingColumns:  Seq[String])   Ê  intersect(dataset)   Ê  coalesce(numPartitions)   Ê  repartition(numPartitions)   Full  List:   https://spark.apache.org/docs/latest/rdd-­‐programming-­‐ guide.html#transformations    
  • 10. Actions   Ê  first()   Ê  take(n)   Ê  collect()   Ê  count()   Ê  saveAsTextFile(path)   Full  List:   https://spark.apache.org/docs/latest/rdd-­‐programming-­‐guide.html#actions   Lets  see  the  sample  =>  Application-­‐2    
  • 11. RDD  Dependencies  (Lineage)   RDD  5   Stage  1   RDD  1   Stage  0   RDD  3   RDD  2   map   RDD  4   union   RDD  6   sort   RDD  7   join   Stage  3   Narrow   Transformation   Narrow   Transformations   Wide   Transformations   Shuffles   Shuffles  
  • 13. RDD  Evolution   RDD   V1.0   (2011)   DataFrame   V1.3   (2013)   DataSet   V1.6   (2015)   Untyped  API   Schema  based  -­‐  Tabular     Java  Objects   Low  level  data-­‐structure   To  work  with     Unstructured  Data   Typed  API:  [T]   Tabular   SQL  Support   To  work  with     Semi-­‐Structured  (csv,  json)  /  Structured  Data  (jdbc)   Project  Tungsten   Catalyst  Optimizer     Two  tier   optimizations  
  • 14. How  to  create  the  DataFrame?   Ê  By  loading  file  (spark.read.format("csv").load())   Ê  SparkSession.createDataFrame(RDD,  schema)     Lets  see  the  code  –  Application-­‐3  
  • 15. How  to  create  the  DataSet?   Ê  By  loading  file  (spark.read.format("csv").load())   Ê  SparkSession.createDataSet(collection  or  RDD)     Lets  see  the  code  –  Application-­‐4-­‐1                                                                                      Application-­‐4-­‐2    
  • 16. Persistency   Storage  Modes   Details   MEMORY_ONLY   Store  RDD  as  deserialized  Java  objects  in  the  JVM   MEMORY_AND_DISK   Store  RDD  as  deserialized  Java  objects  in  the  JVM   MEMORY_ONLY_SER   Store  RDD  as  serialized  Java  objects  (Kryo  API  can  be  thought)   MEMORY_AND_DISK_SER   Similar  to  MEMORY_ONLY_SER   DISK_ONLY   Store  the  RDD  partitions  only  on  disk.   MEMORY_ONLY_2,   MEMORY_AND_DISK_2   Same  as  the  levels  above,  but  replicate  each  partition  on  two   cluster  nodes.   Ê  RDD  /  DF.persist(newStorageLevel:  StorageLevel)   Ê  RDD.unpersist()  =>  Unpersists  RDD  from  memory  and  disk                                  Unpersist  will  need  to  be  forced  for  long  term  to  use    executor  memory  efficiently.   Note:  Also  when  cached  data  exceeds  storage  memory,                              Spark  will  use  Least  Recently  Used(LRU)  Expiry  Policy  as  default  
  • 17. Clustering  /  Spark  on  YARN   YARN  Client   Mode  
  • 18. Q  &  A   Thanks   References   Ê  https://spark.apache.org/docs/latest/   Ê  https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals   Ê  https://jaceklaskowski.gitbooks.io/mastering-­‐apache-­‐spark   Ê  https://stackoverflow.com/questions/36215672/spark-­‐yarn-­‐architecture   Ê  High  Performance  Spark  by                  Holden  Karau  &  Rachel  Warren