SlideShare a Scribd company logo
Programming in Spark using PySpark
Mostafa Elzoghbi
Sr. Technical Evangelist – Microsoft
@MostafaElzoghbi
http://mostafa.rocks
Session Objectives & Takeaways
• Programming Spark
• Spark Program Structure
• Working with RDDs
• Transformations versus Actions
• Lambda, Shared Variables (Broadcast vs accumulators)
• Visualizing big data in Spark
• Spark in the cloud (Azure)
• Working with cluster types, notebooks, scaling.
Python Spark (pySpark)
• We are using the Python programming interface to Spark (pySpark)
• pySpark provides an easy-to-use programming abstraction and parallel
runtime:
“Here’s an operation, run it on all of the data”
• RDDs are the key concept
Apache Spark Driver and
Workers
• A Spark program is two programs:
• A driver program and a workers program
• Worker programs run on cluster nodes or in
local threads
• RDDs (Resilient Distributed Datasets) are
distributed
Spark Essentials: Master
• The master parameter for a SparkContext determines which type and size of
cluster to use
Spark Context
• A Spark program first creates a SparkContext object
» Tells Spark how and where to access a cluster
» pySpark shell and Databricks cloud automatically create the sc variable
» iPython and programs must use a constructor to create a new
SparkContext
• Use SparkContext to create RDDs
Resilient Distributed Datasets
• The primary abstraction in Spark
» Immutable once constructed
» Track lineage information to efficiently recompute lost data
» Enable operations on collection of elements in parallel
• You construct RDDs
» by parallelizing existing Python collections (lists)
» by transforming an existing RDDs
» from files in HDFS or any other storage system
RDDs
• Spark revolves around the concept of a resilient distributed dataset (RDD),
which is a fault-tolerant collection of elements that can be operated on in
parallel.
• Two types of operations: transformations and actions
• Transformations are lazy (not computed immediately)
• Transformed RDD is executed when action runs on it
• Persist (cache) RDDs in memory or disk
Programming in Spark using PySpark
Creating an RDD
• Create RDDs from Python collections (lists)
• From HDFS, text files, Hypertable, Amazon S3, Apache Hbase, SequenceFiles,
any other Hadoop InputFormat, and directory or glob wildcard: /data/201404*
Working with RDDs
• Create an RDD from a data source: <list>
• Apply transformations to an RDD: map filter
• Apply actions to an RDD: collect count
Spark Transformations
• Create new datasets from an existing one
• Use lazy evaluation: results not computed right away –
• instead Spark remembers set of transformations applied to base dataset
» Spark optimizes the required calculations
» Spark recovers from failures and slow workers
• Think of this as a recipe for creating result
Python lambda Functions
• Small anonymous functions (not bound to a name)
lambda a, b: a+b
» returns the sum of its two arguments
• Can use lambda functions wherever function objects are required
• Restricted to a single expression
Spark Actions
• Cause Spark to execute recipe to transform source
• Mechanism for getting results out of Spark
Spark Program Lifecycle
1. Create RDDs from external data or parallelize a collection in your driver
program
2. Lazily transform them into new RDDs
3. cache() some RDDs for reuse -- IMPORTANT
4. Perform actions to execute parallel
5. Computation and produce results
pySpark Shared Variables
• Broadcast Variables
» Efficiently send large, read-only value to all workers
» Saved at workers for use in one or more Spark operations
» Like sending a large, read-only lookup table to all the nodes
At the driver: broadcastVar = sc.broadcast([1, 2, 3])
At a worker: broadcastVar.value
• Accumulators
» Aggregate values from workers back to driver
» Only driver can access value of accumulator
» For tasks, accumulators are write-only
» Use to count errors seen in RDD across workers
>>> accum = sc.accumulator(0)
>>> rdd = sc.parallelize([1, 2, 3, 4])
>>> def f(x):
>>> global accum
>>> accum += x
>>> rdd.foreach(f)
>>> accum.value
Value: 10
Visualizing Big Data in the browser
• Challenges:
• Manipulating large data can take long time
Memory: caching -> Scale clusters
CPU: Parallelism -> Scale clusters
• We have more data points than possible pixels
> Summarize: Aggregation, Pivoting (more data than pixels)
> Model (Clustering, Classification, D. Reduction, …etc)
> Sample: approximate (faster) and exact sampling
• Internal Tools: Matplotlib, GGPlot, D3, SVC, and more.
Spark Kernels and MAGIC keywords
• PySpark kernel supports set of %%MAGIC keywords
• It supports built-in IPython built-in magics, including %%sh.
• Auto visualization
• Magic keywords:
• %%SQL % Spark SQL
• %%lsmagic % List all supported magic keywords (Important)
• %env % Set environment variable
• %run % Execute python code
• %who % List all variables of global scope
• Run code from a different kernel in a notebook.
Spark in Azure
Hadoop clusters in Azure are packaged under “HDInsight” service
Spark in Azure
• Create clusters in few clicks
• Apache Spark comes only in Linux OS.
• Multiple HDP versions
• Comes with preloaded: SSH, Hive, Oozie, DLS, Vnets.
• Multiple Storage options:
• Azure Storage
• ADL store
• External metadata store in SQL server database for
Hive and Oozie.
• All notebooks are stored in the storage account
associated with Spark cluster
• Zeppelin notebook is available on certain Spark
versions but not all.
Programming Spark Apps in HDInsight
• Supports four kernels in Jupyter in HDInsight Spark clusters in Azure
DEMO
Spark Apps using Jupyter
References
• Spark Programming Guide
http://spark.apache.org/docs/latest/programming-guide.html
• edx.org: Free Apache Spark courses
• Visualizations for Databricks
https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20
Databricks%20Overview/15%20Visualizations/0%20Visualizations%20Ov
erview.html
• SPARKHub by Databricks
https://sparkhub.databricks.com/resources/
Thank you
• Check out my blog big data articles: http://mostafa.rocks
• Follow me on Twitter: @MostafaElzoghbi
• Want some help in building cloud solutions? Contact me to know more.

More Related Content

What's hot (20)

PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PDF
Introduction to Spark with Python
Gokhan Atil
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PDF
Dive into PySpark
Mateusz Buśkiewicz
 
PPTX
Spark
Heena Madan
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
PDF
Spark overview
Lisa Hua
 
PDF
Scaling Apache Spark at Facebook
Databricks
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
PDF
Spark SQL
Joud Khattab
 
PDF
Introduction to apache spark
Aakashdata
 
PPTX
PySpark dataframe
Jaemun Jung
 
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Introduction to Spark with Python
Gokhan Atil
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Dive into PySpark
Mateusz Buśkiewicz
 
Deep Dive: Memory Management in Apache Spark
Databricks
 
Spark overview
Lisa Hua
 
Scaling Apache Spark at Facebook
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Spark SQL
Joud Khattab
 
Introduction to apache spark
Aakashdata
 
PySpark dataframe
Jaemun Jung
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Apache Spark Overview
Vadim Y. Bichutskiy
 

Viewers also liked (20)

PDF
Modern SQL in Open Source and Commercial Databases
Markus Winand
 
PDF
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
 
PPTX
Big data solutions in Azure
Mostafa
 
PPTX
Introducing Power BI Embedded
Mostafa
 
PDF
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
PDF
PySpark in practice slides
Dat Tran
 
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
PPTX
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
PDF
Building predictive models in Azure Machine Learning
Mostafa
 
PPTX
Build intelligent solutions using Azure
Mostafa
 
PDF
Azure Machine Learning
Mostafa
 
PPTX
Extending Product Outreach with Outlook Connectors
Mostafa
 
PDF
Machine Learning Classifiers
Mostafa
 
PDF
Spark, Python and Parquet
odsc
 
PDF
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
PDF
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
PDF
Parquet - Data I/O - Philadelphia 2013
larsgeorge
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PDF
Architecting big data solutions in the cloud
Mostafa
 
PDF
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
Modern SQL in Open Source and Commercial Databases
Markus Winand
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
 
Big data solutions in Azure
Mostafa
 
Introducing Power BI Embedded
Mostafa
 
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
PySpark in practice slides
Dat Tran
 
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
Building predictive models in Azure Machine Learning
Mostafa
 
Build intelligent solutions using Azure
Mostafa
 
Azure Machine Learning
Mostafa
 
Extending Product Outreach with Outlook Connectors
Mostafa
 
Machine Learning Classifiers
Mostafa
 
Spark, Python and Parquet
odsc
 
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Parquet - Data I/O - Philadelphia 2013
larsgeorge
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Architecting big data solutions in the cloud
Mostafa
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
Ad

Similar to Programming in Spark using PySpark (20)

PPTX
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PPTX
Apache Spark Core
Girish Khanzode
 
PDF
Apache Spark Tutorial
Ahmet Bulut
 
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
PPTX
Apache Spark
masifqadri
 
PPTX
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
PPTX
Spark core
Prashant Gupta
 
PPTX
Intro to Apache Spark
Robert Sanders
 
PPTX
Intro to Apache Spark
clairvoyantllc
 
PPTX
Apache spark
Prashant Pranay
 
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
PPTX
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
PDF
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
PPTX
Spark from the Surface
Josi Aranda
 
PPTX
APACHE SPARK.pptx
DeepaThirumurugan
 
PDF
Apache Spark Overview @ ferret
Andrii Gakhov
 
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Apache Spark Fundamentals
Zahra Eskandari
 
Apache Spark Core
Girish Khanzode
 
Apache Spark Tutorial
Ahmet Bulut
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Apache Spark
masifqadri
 
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Spark core
Prashant Gupta
 
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
clairvoyantllc
 
Apache spark
Prashant Pranay
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
Spark from the Surface
Josi Aranda
 
APACHE SPARK.pptx
DeepaThirumurugan
 
Apache Spark Overview @ ferret
Andrii Gakhov
 
Ad

More from Mostafa (20)

PPTX
The role of intelligent sensors in the cloud public
Mostafa
 
PPTX
Skill up in machine learning using Azure ML
Mostafa
 
PDF
Big data talking stories in Healthcare
Mostafa
 
PPTX
Building Big data solutions in Azure
Mostafa
 
PPTX
Patterns and Practices in Building Office Add-ins
Mostafa
 
PPTX
Data science essentials in azure ml
Mostafa
 
PPTX
Build Interactive Analytics using Power BI
Mostafa
 
PPTX
TypeScript Jump Start
Mostafa
 
PPTX
Big data solutions in azure
Mostafa
 
PPTX
Build intelligent solutions using ms azure
Mostafa
 
PPTX
Mistakes that kill startups
Mostafa
 
PPTX
PnP in building office add ins - public
Mostafa
 
PPTX
How to migrate Console Apps as a cloud service
Mostafa
 
PPTX
HBase introduction in azure
Mostafa
 
PDF
eRecall
Mostafa
 
PPTX
Get your site microsoft edge ready
Mostafa
 
PPTX
Developing cross platform mobile apps using Apache Cordova
Mostafa
 
PPTX
Identity and o365 on Azure
Mostafa
 
PPTX
Azure Data platform
Mostafa
 
PPTX
Building IoT solutions using Windows 10 IoT Core & Azure
Mostafa
 
The role of intelligent sensors in the cloud public
Mostafa
 
Skill up in machine learning using Azure ML
Mostafa
 
Big data talking stories in Healthcare
Mostafa
 
Building Big data solutions in Azure
Mostafa
 
Patterns and Practices in Building Office Add-ins
Mostafa
 
Data science essentials in azure ml
Mostafa
 
Build Interactive Analytics using Power BI
Mostafa
 
TypeScript Jump Start
Mostafa
 
Big data solutions in azure
Mostafa
 
Build intelligent solutions using ms azure
Mostafa
 
Mistakes that kill startups
Mostafa
 
PnP in building office add ins - public
Mostafa
 
How to migrate Console Apps as a cloud service
Mostafa
 
HBase introduction in azure
Mostafa
 
eRecall
Mostafa
 
Get your site microsoft edge ready
Mostafa
 
Developing cross platform mobile apps using Apache Cordova
Mostafa
 
Identity and o365 on Azure
Mostafa
 
Azure Data platform
Mostafa
 
Building IoT solutions using Windows 10 IoT Core & Azure
Mostafa
 

Recently uploaded (20)

PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 

Programming in Spark using PySpark

  • 1. Programming in Spark using PySpark Mostafa Elzoghbi Sr. Technical Evangelist – Microsoft @MostafaElzoghbi http://mostafa.rocks
  • 2. Session Objectives & Takeaways • Programming Spark • Spark Program Structure • Working with RDDs • Transformations versus Actions • Lambda, Shared Variables (Broadcast vs accumulators) • Visualizing big data in Spark • Spark in the cloud (Azure) • Working with cluster types, notebooks, scaling.
  • 3. Python Spark (pySpark) • We are using the Python programming interface to Spark (pySpark) • pySpark provides an easy-to-use programming abstraction and parallel runtime: “Here’s an operation, run it on all of the data” • RDDs are the key concept
  • 4. Apache Spark Driver and Workers • A Spark program is two programs: • A driver program and a workers program • Worker programs run on cluster nodes or in local threads • RDDs (Resilient Distributed Datasets) are distributed
  • 5. Spark Essentials: Master • The master parameter for a SparkContext determines which type and size of cluster to use
  • 6. Spark Context • A Spark program first creates a SparkContext object » Tells Spark how and where to access a cluster » pySpark shell and Databricks cloud automatically create the sc variable » iPython and programs must use a constructor to create a new SparkContext • Use SparkContext to create RDDs
  • 7. Resilient Distributed Datasets • The primary abstraction in Spark » Immutable once constructed » Track lineage information to efficiently recompute lost data » Enable operations on collection of elements in parallel • You construct RDDs » by parallelizing existing Python collections (lists) » by transforming an existing RDDs » from files in HDFS or any other storage system
  • 8. RDDs • Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. • Two types of operations: transformations and actions • Transformations are lazy (not computed immediately) • Transformed RDD is executed when action runs on it • Persist (cache) RDDs in memory or disk
  • 10. Creating an RDD • Create RDDs from Python collections (lists) • From HDFS, text files, Hypertable, Amazon S3, Apache Hbase, SequenceFiles, any other Hadoop InputFormat, and directory or glob wildcard: /data/201404*
  • 11. Working with RDDs • Create an RDD from a data source: <list> • Apply transformations to an RDD: map filter • Apply actions to an RDD: collect count
  • 12. Spark Transformations • Create new datasets from an existing one • Use lazy evaluation: results not computed right away – • instead Spark remembers set of transformations applied to base dataset » Spark optimizes the required calculations » Spark recovers from failures and slow workers • Think of this as a recipe for creating result
  • 13. Python lambda Functions • Small anonymous functions (not bound to a name) lambda a, b: a+b » returns the sum of its two arguments • Can use lambda functions wherever function objects are required • Restricted to a single expression
  • 14. Spark Actions • Cause Spark to execute recipe to transform source • Mechanism for getting results out of Spark
  • 15. Spark Program Lifecycle 1. Create RDDs from external data or parallelize a collection in your driver program 2. Lazily transform them into new RDDs 3. cache() some RDDs for reuse -- IMPORTANT 4. Perform actions to execute parallel 5. Computation and produce results
  • 16. pySpark Shared Variables • Broadcast Variables » Efficiently send large, read-only value to all workers » Saved at workers for use in one or more Spark operations » Like sending a large, read-only lookup table to all the nodes At the driver: broadcastVar = sc.broadcast([1, 2, 3]) At a worker: broadcastVar.value
  • 17. • Accumulators » Aggregate values from workers back to driver » Only driver can access value of accumulator » For tasks, accumulators are write-only » Use to count errors seen in RDD across workers >>> accum = sc.accumulator(0) >>> rdd = sc.parallelize([1, 2, 3, 4]) >>> def f(x): >>> global accum >>> accum += x >>> rdd.foreach(f) >>> accum.value Value: 10
  • 18. Visualizing Big Data in the browser • Challenges: • Manipulating large data can take long time Memory: caching -> Scale clusters CPU: Parallelism -> Scale clusters • We have more data points than possible pixels > Summarize: Aggregation, Pivoting (more data than pixels) > Model (Clustering, Classification, D. Reduction, …etc) > Sample: approximate (faster) and exact sampling • Internal Tools: Matplotlib, GGPlot, D3, SVC, and more.
  • 19. Spark Kernels and MAGIC keywords • PySpark kernel supports set of %%MAGIC keywords • It supports built-in IPython built-in magics, including %%sh. • Auto visualization • Magic keywords: • %%SQL % Spark SQL • %%lsmagic % List all supported magic keywords (Important) • %env % Set environment variable • %run % Execute python code • %who % List all variables of global scope • Run code from a different kernel in a notebook.
  • 20. Spark in Azure Hadoop clusters in Azure are packaged under “HDInsight” service
  • 21. Spark in Azure • Create clusters in few clicks • Apache Spark comes only in Linux OS. • Multiple HDP versions • Comes with preloaded: SSH, Hive, Oozie, DLS, Vnets. • Multiple Storage options: • Azure Storage • ADL store • External metadata store in SQL server database for Hive and Oozie. • All notebooks are stored in the storage account associated with Spark cluster • Zeppelin notebook is available on certain Spark versions but not all.
  • 22. Programming Spark Apps in HDInsight • Supports four kernels in Jupyter in HDInsight Spark clusters in Azure
  • 24. References • Spark Programming Guide http://spark.apache.org/docs/latest/programming-guide.html • edx.org: Free Apache Spark courses • Visualizations for Databricks https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20 Databricks%20Overview/15%20Visualizations/0%20Visualizations%20Ov erview.html • SPARKHub by Databricks https://sparkhub.databricks.com/resources/
  • 25. Thank you • Check out my blog big data articles: http://mostafa.rocks • Follow me on Twitter: @MostafaElzoghbi • Want some help in building cloud solutions? Contact me to know more.

Editor's Notes

  • #2: Ref.: https://azure.microsoft.com/en-us/services/hdinsight/apache-spark/ Apache Spark leverages a common execution model for doing multiple tasks like ETL, batch queries, interactive queries, real-time streaming, machine learning, and graph processing on data stored in Azure Data Lake Store. This allows you to use Spark for Azure HDInsight to solve big data challenges in near real-time like fraud detection, click stream analysis, financial alerts, telemetry from connected sensors and devices (Internet of Things, IoT), social analytics, always-on ETL pipelines, and network monitoring.
  • #3: A) Main concepts to cover for Data Science: Regression Classification -- FOCUS Clustering Recommendation B) Building programmable components in Azure ML experiments C) Working with Azure ML studio
  • #5: Source: https://courses.edx.org/c4x/BerkeleyX/CS100.1x/asset/Week2Lec4.pdf Spark standalone running on two nodes with two workers: A client process submit an app to the master. The master instructs one of its workers to launch a driver. The worker spawns a driver JVM. The master instructs both works to launch executors for the app. The workers spawn executor JVMs. The driver and executors communicate independent of the cluster’s processes.
  • #6: Running Spark: Standalone cluster: Spark standalone comes out of the box. Comes with it is own web UI (monitor and run apps/jobs) Contains of master and worker (also called slave) Mesos and Yarn are also supported in Spark. Yarn is the only cluster manager on which spark can access HDFS secured with Kerberos. Yarn is the new generation of Hadoop’s MapReduce execution engine and can run MapReduce, Spark and other types of programs.
  • #16: For that reason, cache is said to 'break the lineage' as it creates a checkpoint that can be reused for further processing. Rule of thumb: Use cache when the lineage of your RDD branches out or when an RDD is used multiple times like in a loop.
  • #17: Keep read-only variable cached on workers » Ship to each worker only once instead of with each task • Example: efficiently give every worker a large dataset • Usually distributed using efficient broadcast algorithms
  • #19: Extensively used in statistics Spark offers native support for: • Approximate and exact sampling • Approximate and exact stratified sampling Approximate sampling is faster and is good enough in most cases
  • #20: 1) Jupyter notebooks kernels with Apache Spark clusters in HDInsight https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-notebook-kernels 2) Ipython built in magics https://ipython.org/ipython-doc/3/interactive/magics.html#cell-magics Source for tipcs and magic keywords: https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
  • #21: Url: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-spark-sql
  • #25: Url: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-spark-sql
  • #26: Spark 2.0 announcements: https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html