SlideShare a Scribd company logo
Machine Learning with
SparkR
OLGUN AYDIN
SENIOR DATA SCIENTIST
olgun_aydin@epam.com
info@olgunaydin.com
About me
 BsC. And MsC. degree from Statistics
 6 years experienced Data Scientist
 6 years experience of R
 Love to use R, SparkR and Shiny
 Organizer of PyData Istanbul
 Co-organizer of Istanbul Spark Meetup
 Co-organizer of Trojimasto Spark Meetup
github.com/olgnaydn/R
www.linkedin.com/in/olgun-aydin/
twitter.com/olgunaydinn
https://www.packtpub.com/books/info/authors/olgun-aydin
Outline
 Introduction to Machine Learning
 SparkR
 Getting Data
 DataFrames
 Applications
Introduction to Machine Learning
 Machine learning is a field of computer science that uses statistical
techniques to give computer systems the ability to "learn" (e.g.,
progressively improve performance on a specific task) with data, without
being explicitly programmed. (Wikipedia)
 Machine learning is closely related to (and often overlaps with)
computational statistics, which also focuses on prediction-making through
the use of computers. It has strong ties to mathematical optimization,
which delivers methods, theory and application domains to the field.
Introduction to Machine Learning
 DeepMind developed an agent that surpassed human-level
performance at 49 Atari games, receiving only the pixels and game
score as inputs.
 Soon after, in 2016, DeepMind obsoleted their own this achievement
by releasing a new state-of-the-art gameplay method called A3C.
 Meanwhile, AlphaGo defeated one of the best human players at
Go—an extraordinary achievement in a game dominated by humans
for two decades after machines first conquered chess.
Introduction to Machine Learning
Introduction to Machine Learning
2
1
3
Examples for Real Life Applications
Internet Search
• Google, Bing, Yahoo, Ask
• Better results with data science algorithms
Recomendation
Systems
• Netflix, Amazon, Alibaba
Predictions
Systems
• Image recognition, speech recognition
• Fraud and Risk detection,Self driving cars, robots
Examples for Real Life Applications
Power of
 Fast
 Powerful
 Scalable
Power of
 Effective
 Number of Packages
 One of the Most prefered language
for statistical analysis
Power of
Power of
 Effective
 Powerful
 Statiscal Power
 Fast
 Scalable
+
 SparkR provides a frontend to Apache Spark and uses Spark’s distributed
computation engine to enable large scale data analysis from the R Shell.
 Data analysis using R is limited by the amount of memory available on a
single machine and further as R is single threaded it is often impractical to
use R on large datasets.
 R programs can be scaled while making it easy to use and deploy across a
number of workloads. SparkR: an R frontend for Apache Spark, a widely
deployed cluster computing engine. There are a number of benefits to
designing an R frontend that is tightly integrated with Spark.
 SparkR requires no changes to R. The central component of SparkR is a
distributed data frame that enables structured data processing with a
syntax familiar to R users.
 To improve performance over large datasets, SparkR performs lazy
evaluation on data frame operations and uses Spark’s relational query
optimizer to optimize execution.
 SparkR was initially developed at the AMPLab, UC Berkeley and has been a
part of the Apache Spark.
 The central component of SparkR is a distributed data frame implemented
on top of Spark.
 SparkR DataFrames have an API similar to dplyr or local R data frames, but
scale to large datasets using Spark’s execution engine and relational query
optimizer.
 SparkR’s read.df method integrates with Spark’s data source API and this
enables users to load data from systems like HBase, Cassandra etc. Having
loaded the data, users are then able to use a familiar syntax for performing
relational operations like selections, projections, aggregations and joins.
 Further, SparkR supports more than 100 pre-defined functions on
DataFrames including string manipulation methods, statistical functions
and date-time operations. Users can also execute SQL queries directly on
SparkR DataFrames using the sql command. SparkR also makes it easy for
users to chain commands using existing R libraries.
 Finally, SparkR DataFrames can be converted to a local R data frame using
the collect operator and this is useful for the big data, small learning
scenarios described earlier
Machine Learning with SparkR
 SparkR’s architecture consists of two main components an R to JVM
binding on the driver that allows R programs to submit jobs to a Spark
cluster and support for running R on the Spark executors.
Installation and Creating a SparkContext
 Step 1: Download Spark
 http://spark.apache.org/
Installation and Creating a SparkContext
 Step 1: Download Spark
http://spark.apache.org/
 Step 2: Run in Command Prompt
Now start your favorite command shell and change directory to your Spark folder
 Step 3: Run in RStudio
Set System Environment. Once you have opened RStudio, you need to set the
system environment first. You have to point your R session to the installed version
of SparkR. Use the code shown in Figure 11 below but replace
the SPARK_HOME variable using the path to your Spark folder.
“C:/Apache/Spark-1.4.1″.
Getting Data
 From local data frames
 The simplest way to create a data frame is to convert a local R data frame
into a SparkR DataFrame. Specifically we can use createDataFrame and
pass in the local R data frame to create a SparkR DataFrame. As an
example, the following creates a DataFrame based using the faithful
dataset from R.
Getting Data
 From Data Sources
 SparkR supports operating on a variety of data sources through the DataFrame
interface. This section describes the general methods for loading and saving
data using Data Sources. You can check the Spark SQL programming guide for
more specific options that are available for the built-in data sources.
 The general method for creating DataFrames from data sources is read.df.
 This method takes in the SQLContext, the path for the file to load and the type
of data source.
 SparkR supports reading JSON and Parquet files natively and through Spark
Packages you can find data source connectors for popular file formats like CSV
and Avro.
Getting Data
 We can see how to use data sources using an example JSON input file.
Note that the file that is used here is not a typical JSON file. Each line in
the file must contain a separate, self-contained valid JSON object.
Getting Data
 From Hive tables
 You can also create SparkR DataFrames from Hive tables. To do this we will need to create
a HiveContext which can access tables in the Hive MetaStore. Note that Spark should have
been built with Hive support and more details on the difference between SQLContext and
HiveContext can be found in the SQL programming guide.
SQL queries in SparkR
 A SparkR DataFrame can also be registered as a temporary table in Spark SQL and
registering a DataFrame as a table allows you to run SQL queries over its data. The sql
function enables applications to run SQL queries programmatically and returns the result
as a DataFrame.
DataFrames
 SparkR DataFrames support a number of functions to do structured data processing.
Here we include some basic examples and a complete list can be found in the API docs.
DataFrames
 SparkR data frames support a number of commonly used functions to aggregate data
after grouping. For example we can compute a histogram of the waiting time in the
faithful dataset as shown below
DataFrames
 SparkR also provides a number of functions that can directly applied to columns for data
processing and during aggregation. The example below shows the use of basic
arithmetic functions.
Applications
Correlation Analysis
K-Means
Decision Trees

More Related Content

What's hot (20)

PDF
SparkR-Advance Analytic for Big Data
samuel shamiri
 
PPT
Heart Proposal
Edward Yoon
 
PPTX
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
CitiusTech
 
PDF
20140908 spark sql & catalyst
Takuya UESHIN
 
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
PDF
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
PDF
Splunk and map_reduce
Greg Hanchin
 
PPTX
12 SQL On-Hadoop Tools
Xplenty
 
PPTX
Modeling employees relationships with Apache Spark
Wassim TRIFI
 
PPT
Technologies for Websites
Compare Infobase Limited
 
PPTX
Unifying your data management with Hadoop
Jayant Shekhar
 
PPT
Big Data Fundamentals in the Emerging New Data World
Jongwook Woo
 
PDF
Streaming analytics state of the art
Stavros Kontopoulos
 
DOCX
projects_with_descriptions
James Mission, CBIP
 
PPTX
Splunk Architecture
Kishore Chaganti
 
PDF
Hadoop - A Very Short Introduction
dewang_mistry
 
PPTX
Hive
Manas Nayak
 
PDF
Azure Data Factory usage at Aucfanlab
Aucfan
 
PDF
Aucfanlab Datalake - Big Data Management Platform -
Aucfan
 
PPTX
Big data analytics use case and software
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
SparkR-Advance Analytic for Big Data
samuel shamiri
 
Heart Proposal
Edward Yoon
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
CitiusTech
 
20140908 spark sql & catalyst
Takuya UESHIN
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
Splunk and map_reduce
Greg Hanchin
 
12 SQL On-Hadoop Tools
Xplenty
 
Modeling employees relationships with Apache Spark
Wassim TRIFI
 
Technologies for Websites
Compare Infobase Limited
 
Unifying your data management with Hadoop
Jayant Shekhar
 
Big Data Fundamentals in the Emerging New Data World
Jongwook Woo
 
Streaming analytics state of the art
Stavros Kontopoulos
 
projects_with_descriptions
James Mission, CBIP
 
Splunk Architecture
Kishore Chaganti
 
Hadoop - A Very Short Introduction
dewang_mistry
 
Azure Data Factory usage at Aucfanlab
Aucfan
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfan
 
Big data analytics use case and software
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 

Similar to Machine Learning with SparkR (20)

PDF
Introduction to SparkR
Ankara Big Data Meetup
 
PPT
An Introduction to Apache spark with scala
johnn210
 
PPTX
big data analytics (BAD601) Module-5.pptx
AmbikaVenkatesh4
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PDF
Sparkr sigmod
waqasm86
 
PPTX
Getting Started with Apache Spark (Scala)
Knoldus Inc.
 
PDF
Started with-apache-spark
Happiest Minds Technologies
 
PDF
Apache Spark PDF
Naresh Rupareliya
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PPTX
Lighting up Big Data Analytics with Apache Spark in Azure
Jen Stirrup
 
PPTX
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
PDF
spark interview questions & answers acadgild blogs
prateek kumar
 
PDF
5 Reasons why Spark is in demand!
Edureka!
 
PDF
Apache spark
Dona Mary Philip
 
PPTX
5 things one must know about spark!
Edureka!
 
PPTX
5 reasons why spark is in demand!
Edureka!
 
PDF
SparkPaper
Suraj Thapaliya
 
PDF
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
bemeneqhueen
 
PPTX
Spark from the Surface
Josi Aranda
 
Introduction to SparkR
Ankara Big Data Meetup
 
An Introduction to Apache spark with scala
johnn210
 
big data analytics (BAD601) Module-5.pptx
AmbikaVenkatesh4
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Sparkr sigmod
waqasm86
 
Getting Started with Apache Spark (Scala)
Knoldus Inc.
 
Started with-apache-spark
Happiest Minds Technologies
 
Apache Spark PDF
Naresh Rupareliya
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Lighting up Big Data Analytics with Apache Spark in Azure
Jen Stirrup
 
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
spark interview questions & answers acadgild blogs
prateek kumar
 
5 Reasons why Spark is in demand!
Edureka!
 
Apache spark
Dona Mary Philip
 
5 things one must know about spark!
Edureka!
 
5 reasons why spark is in demand!
Edureka!
 
SparkPaper
Suraj Thapaliya
 
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
bemeneqhueen
 
Spark from the Surface
Josi Aranda
 
Ad

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPT
Performance Review for Security and Commodity.ppt
chatwithnitin
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
deep dive data management sharepoint apps.ppt
novaprofk
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Performance Review for Security and Commodity.ppt
chatwithnitin
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Ad

Machine Learning with SparkR

  • 2. About me  BsC. And MsC. degree from Statistics  6 years experienced Data Scientist  6 years experience of R  Love to use R, SparkR and Shiny  Organizer of PyData Istanbul  Co-organizer of Istanbul Spark Meetup  Co-organizer of Trojimasto Spark Meetup github.com/olgnaydn/R www.linkedin.com/in/olgun-aydin/ twitter.com/olgunaydinn https://www.packtpub.com/books/info/authors/olgun-aydin
  • 3. Outline  Introduction to Machine Learning  SparkR  Getting Data  DataFrames  Applications
  • 4. Introduction to Machine Learning  Machine learning is a field of computer science that uses statistical techniques to give computer systems the ability to "learn" (e.g., progressively improve performance on a specific task) with data, without being explicitly programmed. (Wikipedia)  Machine learning is closely related to (and often overlaps with) computational statistics, which also focuses on prediction-making through the use of computers. It has strong ties to mathematical optimization, which delivers methods, theory and application domains to the field.
  • 5. Introduction to Machine Learning  DeepMind developed an agent that surpassed human-level performance at 49 Atari games, receiving only the pixels and game score as inputs.  Soon after, in 2016, DeepMind obsoleted their own this achievement by releasing a new state-of-the-art gameplay method called A3C.  Meanwhile, AlphaGo defeated one of the best human players at Go—an extraordinary achievement in a game dominated by humans for two decades after machines first conquered chess.
  • 8. 2 1 3 Examples for Real Life Applications Internet Search • Google, Bing, Yahoo, Ask • Better results with data science algorithms Recomendation Systems • Netflix, Amazon, Alibaba Predictions Systems • Image recognition, speech recognition • Fraud and Risk detection,Self driving cars, robots
  • 9. Examples for Real Life Applications
  • 10. Power of  Fast  Powerful  Scalable
  • 11. Power of  Effective  Number of Packages  One of the Most prefered language for statistical analysis
  • 14.  Effective  Powerful  Statiscal Power  Fast  Scalable +
  • 15.  SparkR provides a frontend to Apache Spark and uses Spark’s distributed computation engine to enable large scale data analysis from the R Shell.
  • 16.  Data analysis using R is limited by the amount of memory available on a single machine and further as R is single threaded it is often impractical to use R on large datasets.  R programs can be scaled while making it easy to use and deploy across a number of workloads. SparkR: an R frontend for Apache Spark, a widely deployed cluster computing engine. There are a number of benefits to designing an R frontend that is tightly integrated with Spark.
  • 17.  SparkR requires no changes to R. The central component of SparkR is a distributed data frame that enables structured data processing with a syntax familiar to R users.  To improve performance over large datasets, SparkR performs lazy evaluation on data frame operations and uses Spark’s relational query optimizer to optimize execution.  SparkR was initially developed at the AMPLab, UC Berkeley and has been a part of the Apache Spark.
  • 18.  The central component of SparkR is a distributed data frame implemented on top of Spark.  SparkR DataFrames have an API similar to dplyr or local R data frames, but scale to large datasets using Spark’s execution engine and relational query optimizer.  SparkR’s read.df method integrates with Spark’s data source API and this enables users to load data from systems like HBase, Cassandra etc. Having loaded the data, users are then able to use a familiar syntax for performing relational operations like selections, projections, aggregations and joins.
  • 19.  Further, SparkR supports more than 100 pre-defined functions on DataFrames including string manipulation methods, statistical functions and date-time operations. Users can also execute SQL queries directly on SparkR DataFrames using the sql command. SparkR also makes it easy for users to chain commands using existing R libraries.  Finally, SparkR DataFrames can be converted to a local R data frame using the collect operator and this is useful for the big data, small learning scenarios described earlier
  • 21.  SparkR’s architecture consists of two main components an R to JVM binding on the driver that allows R programs to submit jobs to a Spark cluster and support for running R on the Spark executors.
  • 22. Installation and Creating a SparkContext  Step 1: Download Spark  http://spark.apache.org/
  • 23. Installation and Creating a SparkContext  Step 1: Download Spark http://spark.apache.org/  Step 2: Run in Command Prompt Now start your favorite command shell and change directory to your Spark folder  Step 3: Run in RStudio Set System Environment. Once you have opened RStudio, you need to set the system environment first. You have to point your R session to the installed version of SparkR. Use the code shown in Figure 11 below but replace the SPARK_HOME variable using the path to your Spark folder. “C:/Apache/Spark-1.4.1″.
  • 24. Getting Data  From local data frames  The simplest way to create a data frame is to convert a local R data frame into a SparkR DataFrame. Specifically we can use createDataFrame and pass in the local R data frame to create a SparkR DataFrame. As an example, the following creates a DataFrame based using the faithful dataset from R.
  • 25. Getting Data  From Data Sources  SparkR supports operating on a variety of data sources through the DataFrame interface. This section describes the general methods for loading and saving data using Data Sources. You can check the Spark SQL programming guide for more specific options that are available for the built-in data sources.  The general method for creating DataFrames from data sources is read.df.  This method takes in the SQLContext, the path for the file to load and the type of data source.  SparkR supports reading JSON and Parquet files natively and through Spark Packages you can find data source connectors for popular file formats like CSV and Avro.
  • 26. Getting Data  We can see how to use data sources using an example JSON input file. Note that the file that is used here is not a typical JSON file. Each line in the file must contain a separate, self-contained valid JSON object.
  • 27. Getting Data  From Hive tables  You can also create SparkR DataFrames from Hive tables. To do this we will need to create a HiveContext which can access tables in the Hive MetaStore. Note that Spark should have been built with Hive support and more details on the difference between SQLContext and HiveContext can be found in the SQL programming guide.
  • 28. SQL queries in SparkR  A SparkR DataFrame can also be registered as a temporary table in Spark SQL and registering a DataFrame as a table allows you to run SQL queries over its data. The sql function enables applications to run SQL queries programmatically and returns the result as a DataFrame.
  • 29. DataFrames  SparkR DataFrames support a number of functions to do structured data processing. Here we include some basic examples and a complete list can be found in the API docs.
  • 30. DataFrames  SparkR data frames support a number of commonly used functions to aggregate data after grouping. For example we can compute a histogram of the waiting time in the faithful dataset as shown below
  • 31. DataFrames  SparkR also provides a number of functions that can directly applied to columns for data processing and during aggregation. The example below shows the use of basic arithmetic functions.