SlideShare a Scribd company logo
C H A P T E R 0 6 : A D V A N C E D S P A R K
P R O G R A M M I N G
Learning Spark
by Holden Karau et. al.
Overview: Advanced Spark Programming
 Introduction
 Accumulators
 Accumulators and Fault Tolerance
 Custom Accumulators
 Broadcast Variables
 Optimizing Broadcasts
 Working on a Per-Partition Basis
 Piping to External Programs
 Numeric RDD Operations
 Conclusion
6.1 Introduction
 This chapter introduces a variety of advanced Spark
programming features that we didn’t get to cover in
the previous chapters.
 We introduce two types of shared variables:
 Accumulators to aggregate information
 Broadcast variables to efficiently distribute large values.
 Throughout this chapter we build an example using
ham radio operators’ call logs as the input.
 This chapter introduces how to use Spark’s language-
agnostic pipe() method to interact with other
programs through standard input and output.
6.2 Accumulators
 When we normally pass functions to Spark, such as a
map() function or a condition for filter(), they can use
variables defined outside them in the driver program, but
each task running on the cluster gets a new copy of each
variable, and updates from these copies are not
propagated back to the driver. Spark’s shared variables,
accumulators and broadcast variables, relax this
restriction for two common types of communication
patterns: aggregation of results and broadcasts.
 Spark supports accumulators of type Double, Long, and
Float
Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala
6.3 Broadcast Variables
 Spark’s second type of shared variable, broadcast
variables, allows the program to efficiently send a large,
read-only value to all the worker nodes for use in one or
more Spark operations. They come in handy, for
example, if your application needs to send a large, read-
only lookup table to all the nodes, or even a large feature
vector in a machine learning algorithm.
 When we are broadcasting large values, it is important to
choose a data serialization format that is both fast and
compact, because the time to send the value over the
network can quickly become a bottleneck if it takes a long
time to either serialize a value or to send the serialized
value over the network.
6.4 Working on a Per-Partition Basis
 Working with data on a per-partition basis allows us
to avoid redoing setup work for each data item.
Operations like opening a database connection or
creating a random-number generator are examples
of setup steps that we wish to avoid doing for each
element.
 Spark has per-partition versions of map and for each
to help reduce the cost of these operations by letting
you run code only once for each partition of an RDD.
6.5 Piping to External Programs
 With three language bindings to choose from out of the
box, you may have all the options you need for writing
Spark applications. However, if none of Scala, Java, or
Python does what you need, then Spark provides a
general mechanism to pipe data to programs in other
languages, like R scripts.
 Spark provides a pipe() method on RDDs. Spark’s pipe()
lets us write parts of jobs using any language we want as
long as it can read and write to Unix standard streams.
 With pipe(), you can write a transformation of an RDD
that reads each RDD element from standard input as a
String, manipulates that String however you like, and
then writes the result(s) as Strings to standard output.
6.6 Numeric RDD Operations
 Spark’s numeric operations are
implemented with a streaming
algorithm that allows for building
up our model one element at a
time. The descriptive statistics
are all computed in a single pass
over the data and returned as a
StatsCounter object by calling
stats().
 If you want to compute only one
of these statistics, you can also
call the correspond‐ing method
directly on an RDD—for example,
rdd.mean() or rdd.sum().
6.7 Conclusion
 In this chapter, you have been introduced to some of
the more advanced Spark programming features that
you can use to make your programs more efficient or
expressive. Subsequent chapters cover deploying and
tuning Spark applications, as well as built-in libraries
for SQL and streaming and machine learning. We’ll
also start seeing more complex and more complete
sample applications that make use of much of the
functionality described so far, and that should help
guide and inspire your own usage of Spark.
Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala

More Related Content

What's hot (19)

PDF
Apache Spark Tutorial
Ahmet Bulut
 
PPTX
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
PDF
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
PDF
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
PPTX
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
PDF
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit
 
PDF
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
PDF
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
PDF
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
PPT
Scala and spark
Fabio Fumarola
 
PDF
Introduction to Apache Spark
Samy Dindane
 
PDF
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
PDF
Data Science
Ahmet Bulut
 
PPTX
Spark sql
Zahra Eskandari
 
PDF
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Databricks
 
PDF
Introduction to spark 2.0
datamantra
 
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Apache Spark Tutorial
Ahmet Bulut
 
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit
 
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
Scala and spark
Fabio Fumarola
 
Introduction to Apache Spark
Samy Dindane
 
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Data Science
Ahmet Bulut
 
Spark sql
Zahra Eskandari
 
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Databricks
 
Introduction to spark 2.0
datamantra
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 

Viewers also liked (20)

PPTX
Overcoming cassandra query limitation spark
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
PDF
Distributed real time stream processing- why and how
Petr Zapletal
 
PPTX
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Sudhir Mallem
 
PDF
Spark, Python and Parquet
odsc
 
PDF
df: Dataframe on Spark
Alpine Data
 
PDF
Parquet - Data I/O - Philadelphia 2013
larsgeorge
 
PPTX
Apache Spark
Majid Hajibaba
 
ODP
A Step to programming with Apache Spark
Knoldus Inc.
 
PPTX
Apache Spark Core
Girish Khanzode
 
PPTX
Programming in Spark using PySpark
Mostafa
 
PDF
Code Review and other aspects of project organization
Łukasz Dumiszewski
 
PDF
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
PDF
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Spark Summit
 
PDF
Proof of Concept for Hadoop: storage and analytics of electrical time-series
DataWorks Summit
 
PPT
Kafka Reliability - When it absolutely, positively has to be there
Gwen (Chen) Shapira
 
PDF
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
Spark Summit
 
PDF
Programming in Spark - Lessons Learned in OpenAire project
Łukasz Dumiszewski
 
PDF
Effective testing for spark programs Strata NY 2015
Holden Karau
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Overcoming cassandra query limitation spark
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
Distributed real time stream processing- why and how
Petr Zapletal
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Sudhir Mallem
 
Spark, Python and Parquet
odsc
 
df: Dataframe on Spark
Alpine Data
 
Parquet - Data I/O - Philadelphia 2013
larsgeorge
 
Apache Spark
Majid Hajibaba
 
A Step to programming with Apache Spark
Knoldus Inc.
 
Apache Spark Core
Girish Khanzode
 
Programming in Spark using PySpark
Mostafa
 
Code Review and other aspects of project organization
Łukasz Dumiszewski
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Spark Summit
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
DataWorks Summit
 
Kafka Reliability - When it absolutely, positively has to be there
Gwen (Chen) Shapira
 
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
Spark Summit
 
Programming in Spark - Lessons Learned in OpenAire project
Łukasz Dumiszewski
 
Effective testing for spark programs Strata NY 2015
Holden Karau
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Ad

Similar to Learning spark ch06 - Advanced Spark Programming (20)

PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
PPTX
Introduction to Spark - DataFactZ
DataFactZ
 
PDF
Apache spark - Spark's distributed programming model
Martin Zapletal
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
PPTX
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
PDF
Spark cluster computing with working sets
JinxinTang
 
PPTX
Apache Spark Introduction @ University College London
Vitthal Gogate
 
PDF
Spark Performance Tuning .pdf
Amit Raj
 
PPT
An Introduction to Apache spark with scala
johnn210
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PPTX
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
 
PPTX
Apache Spark An Overview
Mohit Jain
 
PDF
A Deep Dive Into Spark
Ashish kumar
 
PPT
11. From Hadoop to Spark 2/2
Fabio Fumarola
 
PDF
Scala Guide for Data Science Professionals 1st Edition Pascal Bugnion
wjxfhqqj9842
 
PDF
Spark basic.pdf
ssuser8b6c85
 
PDF
Spark: A Unified Engine for Big Data Processing
ChadrequeCruzManuela
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PPTX
Introduction to spark
Javier Santos Paniego
 
PDF
Scala Guide for Data Science Professionals 1st Edition Pascal Bugnion
tasmasivayhu
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Introduction to Spark - DataFactZ
DataFactZ
 
Apache spark - Spark's distributed programming model
Martin Zapletal
 
Apache Spark: What? Why? When?
Massimo Schenone
 
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Spark cluster computing with working sets
JinxinTang
 
Apache Spark Introduction @ University College London
Vitthal Gogate
 
Spark Performance Tuning .pdf
Amit Raj
 
An Introduction to Apache spark with scala
johnn210
 
Apache Spark Fundamentals
Zahra Eskandari
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
 
Apache Spark An Overview
Mohit Jain
 
A Deep Dive Into Spark
Ashish kumar
 
11. From Hadoop to Spark 2/2
Fabio Fumarola
 
Scala Guide for Data Science Professionals 1st Edition Pascal Bugnion
wjxfhqqj9842
 
Spark basic.pdf
ssuser8b6c85
 
Spark: A Unified Engine for Big Data Processing
ChadrequeCruzManuela
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Introduction to spark
Javier Santos Paniego
 
Scala Guide for Data Science Professionals 1st Edition Pascal Bugnion
tasmasivayhu
 
Ad

More from phanleson (20)

PPT
Firewall - Network Defense in Depth Firewalls
phanleson
 
PPT
Mobile Security - Wireless hacking
phanleson
 
PPT
Authentication in wireless - Security in Wireless Protocols
phanleson
 
PPT
E-Commerce Security - Application attacks - Server Attacks
phanleson
 
PPT
Hacking web applications
phanleson
 
PPTX
HBase In Action - Chapter 04: HBase table design
phanleson
 
PPT
HBase In Action - Chapter 10 - Operations
phanleson
 
PPT
Hbase in action - Chapter 09: Deploying HBase
phanleson
 
PPT
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
phanleson
 
PPT
Lecture 1 - Getting to know XML
phanleson
 
PPTX
Lecture 4 - Adding XTHML for the Web
phanleson
 
PPT
Lecture 2 - Using XML for Many Purposes
phanleson
 
PPTX
SOA Course - SOA governance - Lecture 19
phanleson
 
PPTX
Lecture 18 - Model-Driven Service Development
phanleson
 
PPTX
Lecture 15 - Technical Details
phanleson
 
PPTX
Lecture 10 - Message Exchange Patterns
phanleson
 
PPTX
Lecture 9 - SOA in Context
phanleson
 
PPTX
Lecture 07 - Business Process Management
phanleson
 
PPTX
Lecture 04 - Loose Coupling
phanleson
 
PPTX
Lecture 2 - SOA
phanleson
 
Firewall - Network Defense in Depth Firewalls
phanleson
 
Mobile Security - Wireless hacking
phanleson
 
Authentication in wireless - Security in Wireless Protocols
phanleson
 
E-Commerce Security - Application attacks - Server Attacks
phanleson
 
Hacking web applications
phanleson
 
HBase In Action - Chapter 04: HBase table design
phanleson
 
HBase In Action - Chapter 10 - Operations
phanleson
 
Hbase in action - Chapter 09: Deploying HBase
phanleson
 
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
phanleson
 
Lecture 1 - Getting to know XML
phanleson
 
Lecture 4 - Adding XTHML for the Web
phanleson
 
Lecture 2 - Using XML for Many Purposes
phanleson
 
SOA Course - SOA governance - Lecture 19
phanleson
 
Lecture 18 - Model-Driven Service Development
phanleson
 
Lecture 15 - Technical Details
phanleson
 
Lecture 10 - Message Exchange Patterns
phanleson
 
Lecture 9 - SOA in Context
phanleson
 
Lecture 07 - Business Process Management
phanleson
 
Lecture 04 - Loose Coupling
phanleson
 
Lecture 2 - SOA
phanleson
 

Recently uploaded (20)

PPTX
Controller Request and Response in Odoo18
Celine George
 
PDF
Vani - The Voice of Excellence - Jul 2025 issue
Savipriya Raghavendra
 
PDF
Introduction presentation of the patentbutler tool
MIPLM
 
PDF
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
PDF
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
PDF
Week 2 - Irish Natural Heritage Powerpoint.pdf
swainealan
 
PPTX
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
PDF
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
PPTX
EDUCATIONAL MEDIA/ TEACHING AUDIO VISUAL AIDS
Sonali Gupta
 
PPTX
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
PPTX
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
PPTX
How to Create a Customer From Website in Odoo 18.pptx
Celine George
 
PPTX
How to Send Email From Odoo 18 Website - Odoo Slides
Celine George
 
PPTX
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
PDF
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
PDF
Android Programming - Basics of Mobile App, App tools and Android Basics
Kavitha P.V
 
PPTX
DIGITAL CITIZENSHIP TOPIC TLE 8 MATATAG CURRICULUM
ROBERTAUGUSTINEFRANC
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PDF
epi editorial commitee meeting presentation
MIPLM
 
Controller Request and Response in Odoo18
Celine George
 
Vani - The Voice of Excellence - Jul 2025 issue
Savipriya Raghavendra
 
Introduction presentation of the patentbutler tool
MIPLM
 
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
Week 2 - Irish Natural Heritage Powerpoint.pdf
swainealan
 
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
EDUCATIONAL MEDIA/ TEACHING AUDIO VISUAL AIDS
Sonali Gupta
 
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
How to Create a Customer From Website in Odoo 18.pptx
Celine George
 
How to Send Email From Odoo 18 Website - Odoo Slides
Celine George
 
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
Android Programming - Basics of Mobile App, App tools and Android Basics
Kavitha P.V
 
DIGITAL CITIZENSHIP TOPIC TLE 8 MATATAG CURRICULUM
ROBERTAUGUSTINEFRANC
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
epi editorial commitee meeting presentation
MIPLM
 

Learning spark ch06 - Advanced Spark Programming

  • 1. C H A P T E R 0 6 : A D V A N C E D S P A R K P R O G R A M M I N G Learning Spark by Holden Karau et. al.
  • 2. Overview: Advanced Spark Programming  Introduction  Accumulators  Accumulators and Fault Tolerance  Custom Accumulators  Broadcast Variables  Optimizing Broadcasts  Working on a Per-Partition Basis  Piping to External Programs  Numeric RDD Operations  Conclusion
  • 3. 6.1 Introduction  This chapter introduces a variety of advanced Spark programming features that we didn’t get to cover in the previous chapters.  We introduce two types of shared variables:  Accumulators to aggregate information  Broadcast variables to efficiently distribute large values.  Throughout this chapter we build an example using ham radio operators’ call logs as the input.  This chapter introduces how to use Spark’s language- agnostic pipe() method to interact with other programs through standard input and output.
  • 4. 6.2 Accumulators  When we normally pass functions to Spark, such as a map() function or a condition for filter(), they can use variables defined outside them in the driver program, but each task running on the cluster gets a new copy of each variable, and updates from these copies are not propagated back to the driver. Spark’s shared variables, accumulators and broadcast variables, relax this restriction for two common types of communication patterns: aggregation of results and broadcasts.  Spark supports accumulators of type Double, Long, and Float
  • 5. Edx and Coursera Courses  Introduction to Big Data with Apache Spark  Spark Fundamentals I  Functional Programming Principles in Scala
  • 6. 6.3 Broadcast Variables  Spark’s second type of shared variable, broadcast variables, allows the program to efficiently send a large, read-only value to all the worker nodes for use in one or more Spark operations. They come in handy, for example, if your application needs to send a large, read- only lookup table to all the nodes, or even a large feature vector in a machine learning algorithm.  When we are broadcasting large values, it is important to choose a data serialization format that is both fast and compact, because the time to send the value over the network can quickly become a bottleneck if it takes a long time to either serialize a value or to send the serialized value over the network.
  • 7. 6.4 Working on a Per-Partition Basis  Working with data on a per-partition basis allows us to avoid redoing setup work for each data item. Operations like opening a database connection or creating a random-number generator are examples of setup steps that we wish to avoid doing for each element.  Spark has per-partition versions of map and for each to help reduce the cost of these operations by letting you run code only once for each partition of an RDD.
  • 8. 6.5 Piping to External Programs  With three language bindings to choose from out of the box, you may have all the options you need for writing Spark applications. However, if none of Scala, Java, or Python does what you need, then Spark provides a general mechanism to pipe data to programs in other languages, like R scripts.  Spark provides a pipe() method on RDDs. Spark’s pipe() lets us write parts of jobs using any language we want as long as it can read and write to Unix standard streams.  With pipe(), you can write a transformation of an RDD that reads each RDD element from standard input as a String, manipulates that String however you like, and then writes the result(s) as Strings to standard output.
  • 9. 6.6 Numeric RDD Operations  Spark’s numeric operations are implemented with a streaming algorithm that allows for building up our model one element at a time. The descriptive statistics are all computed in a single pass over the data and returned as a StatsCounter object by calling stats().  If you want to compute only one of these statistics, you can also call the correspond‐ing method directly on an RDD—for example, rdd.mean() or rdd.sum().
  • 10. 6.7 Conclusion  In this chapter, you have been introduced to some of the more advanced Spark programming features that you can use to make your programs more efficient or expressive. Subsequent chapters cover deploying and tuning Spark applications, as well as built-in libraries for SQL and streaming and machine learning. We’ll also start seeing more complex and more complete sample applications that make use of much of the functionality described so far, and that should help guide and inspire your own usage of Spark.
  • 11. Edx and Coursera Courses  Introduction to Big Data with Apache Spark  Spark Fundamentals I  Functional Programming Principles in Scala