Machine Learning with SparkR

Machine Learning with
SparkR
OLGUN AYDIN
SENIOR DATA SCIENTIST
olgun_aydin@epam.com
info@olgunaydin.com

About me
 BsC. And MsC. degree from Statistics
 6 years experienced Data Scientist
 6 years experience of R
 Love to use R, SparkR and Shiny
 Organizer of PyData Istanbul
 Co-organizer of Istanbul Spark Meetup
 Co-organizer of Trojimasto Spark Meetup
github.com/olgnaydn/R
www.linkedin.com/in/olgun-aydin/
twitter.com/olgunaydinn
https://www.packtpub.com/books/info/authors/olgun-aydin

Outline
 Introduction to Machine Learning
 SparkR
 Getting Data
 DataFrames
 Applications

Introduction to Machine Learning
 Machine learning is a field of computer science that uses statistical
techniques to give computer systems the ability to "learn" (e.g.,
progressively improve performance on a specific task) with data, without
being explicitly programmed. (Wikipedia)
 Machine learning is closely related to (and often overlaps with)
computational statistics, which also focuses on prediction-making through
the use of computers. It has strong ties to mathematical optimization,
which delivers methods, theory and application domains to the field.

 DeepMind developed an agent that surpassed human-level
performance at 49 Atari games, receiving only the pixels and game
score as inputs.
 Soon after, in 2016, DeepMind obsoleted their own this achievement
by releasing a new state-of-the-art gameplay method called A3C.
 Meanwhile, AlphaGo defeated one of the best human players at
Go—an extraordinary achievement in a game dominated by humans
for two decades after machines first conquered chess.

2
1
3
Examples for Real Life Applications
Internet Search
• Google, Bing, Yahoo, Ask
• Better results with data science algorithms
Recomendation
Systems
• Netflix, Amazon, Alibaba
Predictions
Systems
• Image recognition, speech recognition
• Fraud and Risk detection,Self driving cars, robots

Examples for Real Life Applications

Power of
 Fast
 Powerful
 Scalable

Power of
 Effective
 Number of Packages
 One of the Most prefered language
for statistical analysis

 Effective
 Powerful
 Statiscal Power
 Fast
 Scalable
+

 SparkR provides a frontend to Apache Spark and uses Spark’s distributed
computation engine to enable large scale data analysis from the R Shell.

 Data analysis using R is limited by the amount of memory available on a
single machine and further as R is single threaded it is often impractical to
use R on large datasets.
 R programs can be scaled while making it easy to use and deploy across a
number of workloads. SparkR: an R frontend for Apache Spark, a widely
deployed cluster computing engine. There are a number of benefits to
designing an R frontend that is tightly integrated with Spark.

 SparkR requires no changes to R. The central component of SparkR is a
distributed data frame that enables structured data processing with a
syntax familiar to R users.
 To improve performance over large datasets, SparkR performs lazy
evaluation on data frame operations and uses Spark’s relational query
optimizer to optimize execution.
 SparkR was initially developed at the AMPLab, UC Berkeley and has been a
part of the Apache Spark.

 The central component of SparkR is a distributed data frame implemented
on top of Spark.
 SparkR DataFrames have an API similar to dplyr or local R data frames, but
scale to large datasets using Spark’s execution engine and relational query
optimizer.
 SparkR’s read.df method integrates with Spark’s data source API and this
enables users to load data from systems like HBase, Cassandra etc. Having
loaded the data, users are then able to use a familiar syntax for performing
relational operations like selections, projections, aggregations and joins.

 Further, SparkR supports more than 100 pre-defined functions on
DataFrames including string manipulation methods, statistical functions
and date-time operations. Users can also execute SQL queries directly on
SparkR DataFrames using the sql command. SparkR also makes it easy for
users to chain commands using existing R libraries.
 Finally, SparkR DataFrames can be converted to a local R data frame using
the collect operator and this is useful for the big data, small learning
scenarios described earlier

 SparkR’s architecture consists of two main components an R to JVM
binding on the driver that allows R programs to submit jobs to a Spark
cluster and support for running R on the Spark executors.

Installation and Creating a SparkContext
 Step 1: Download Spark
 http://spark.apache.org/

Installation and Creating a SparkContext
 Step 1: Download Spark
http://spark.apache.org/
 Step 2: Run in Command Prompt
Now start your favorite command shell and change directory to your Spark folder
 Step 3: Run in RStudio
Set System Environment. Once you have opened RStudio, you need to set the
system environment first. You have to point your R session to the installed version
of SparkR. Use the code shown in Figure 11 below but replace
the SPARK_HOME variable using the path to your Spark folder.
“C:/Apache/Spark-1.4.1″.

Getting Data
 From local data frames
 The simplest way to create a data frame is to convert a local R data frame
into a SparkR DataFrame. Specifically we can use createDataFrame and
pass in the local R data frame to create a SparkR DataFrame. As an
example, the following creates a DataFrame based using the faithful
dataset from R.

Getting Data
 From Data Sources
 SparkR supports operating on a variety of data sources through the DataFrame
interface. This section describes the general methods for loading and saving
data using Data Sources. You can check the Spark SQL programming guide for
more specific options that are available for the built-in data sources.
 The general method for creating DataFrames from data sources is read.df.
 This method takes in the SQLContext, the path for the file to load and the type
of data source.
 SparkR supports reading JSON and Parquet files natively and through Spark
Packages you can find data source connectors for popular file formats like CSV
and Avro.

Getting Data
 We can see how to use data sources using an example JSON input file.
Note that the file that is used here is not a typical JSON file. Each line in
the file must contain a separate, self-contained valid JSON object.

Getting Data
 From Hive tables
 You can also create SparkR DataFrames from Hive tables. To do this we will need to create
a HiveContext which can access tables in the Hive MetaStore. Note that Spark should have
been built with Hive support and more details on the difference between SQLContext and
HiveContext can be found in the SQL programming guide.

SQL queries in SparkR
 A SparkR DataFrame can also be registered as a temporary table in Spark SQL and
registering a DataFrame as a table allows you to run SQL queries over its data. The sql
function enables applications to run SQL queries programmatically and returns the result
as a DataFrame.

DataFrames
 SparkR DataFrames support a number of functions to do structured data processing.
Here we include some basic examples and a complete list can be found in the API docs.

DataFrames
 SparkR data frames support a number of commonly used functions to aggregate data
after grouping. For example we can compute a histogram of the waiting time in the
faithful dataset as shown below

DataFrames
 SparkR also provides a number of functions that can directly applied to columns for data
processing and during aggregation. The example below shows the use of basic
arithmetic functions.

Applications
Correlation Analysis
K-Means
Decision Trees

Machine Learning with SparkR

More Related Content

What's hot (20)

Similar to Machine Learning with SparkR (20)

Recently uploaded (20)

Machine Learning with SparkR