Analyzing Big data in R and Scala using Apache Spark 17-7-19

ANALYZING BIG DATA IN R AND SCALA USING
APACHE SPARK
17-07-2019
By: Ahmed Elsayed
M.Sc. Information Systems.
Director of Software Applications Department - IT
Alexandria Petroleum Maintenance Co. "Petromaint"
AhmedElsayeddb@gmail.com

 DATA SCIENCE
 BIG DATA
 HADOOP
o INSTALL & LEARN
 R
o INSTALL & LEARN
 SPARK & SPARKR
o INSTALL & LEARN
 CASES STUDY
o FIRST CASETITANICTEST EXAMPLE
o SECOND CASE CRIMES EXAMPLE
 SPARKRON R USING RSTUDIO
o THIRD CASE DELAY PREDICTION
 SCALA ON ZEPPELIN
 LEARN AND be IBM CERTIFIED

MOTIVATION
• Make a prediction from dataset to know not only what
happened or why it happened but also what will happen.
• Make the machine give the most accurate answer(class
value or group) from new data which entered by user
learning from punch of historical data which is so big to be
handled by human or even by one machine memory

6
Machine learning (Big Data Analytics).
Choosing dataset (in progress to choose at least 5GB
dataset file).
Pre-processing on dataset for removing missing values or
refilling it.
Implementing classification and clustering
DATA SCIENCE

7
Data Science is a combination of:
Studies of managing, storing, and analyzing data.
Mathematics, statistics, programming.
Ways of capturing data might not been captured till now.
Ability to look at things 'differently'.
Activity of cleansing, preparing and aligning the data.
DATA SCIENCE

8
Machine learning
Branch of Computer Science.
Low-level algorithms to discover patterns implicit in the
data.
The more data, the more effective learning.
Which is why machine learning and big data are intricately
tied together.
DATA SCIENCE

10
BIG DATA Why Big Data
 Used to process, analyze and store large amount of data.
 Structured and unstructured .
 (computers, mobile devices, satellites, cameras, images etc.).
 Exceeds the processing capacity of traditional DBMS.
 Over 90% of World data generated last two years.
 Scale up from single servers to thousands of machines.
 Big data is valuable for organization falls in two categories:
o Predicting new products basis on products data history.
o Data sizes from TB to many PB in a single sets of data.
 Hadoop is an open source framework which does all above.

13
changes our entire way of thinking about predictive analytics,
knowledge extraction and interpretation.
trial-and-error analysis, approach becomes impossible when
datasets are large and heterogeneous.
very few tools allow for processing large datasets in reasonable
amount of time.
traditional statistical solutions typically focus on static analytics that
is limited to the analysis of samples that are frozen in time, which
often results in surpassed and unreliable conclusions.
BIG DATA Big Data Analytics

14

15

17
HADOOP
 Is an open source software framework developed in java.
 Processing, querying huge amount of data.
 On large clusters of commodity hardware.
 Divide massive data into smaller chunks.
 Spread it out over many machines.
 Each machine can process those chunks in parallel.
 So results can be obtained extremely fast.
 Apache Hadoop has two main components:
 HDFS.
 MapReduce.

18
Hadoop Distributed File System (HDFS)
 Derived from the concept of Google File System (GFS).
 It is a data storage layer based on the UNIX.
 Creates multiple replicas of each data block.
 Distributes them on computers throughout a cluster.
 To enable reliable and rapid access.
 Suitable for applications have large data sets.
HADOOP

19
HADOOP Hadoop Distributed File System (HDFS)

20
MapReduce
 Core component of the Hadoop.
 Processing Big Data distributed over thousands of nodes.
 processes chunks in parallel.
 Later individual results are combined together to get result.
 This whole processing is done in two phases: (Map
,Reduce).
HADOOP

23
 Hadoop is a cluster resource management platform.
 Responsible for managing computing resources in clusters.
 Using them for scheduling of users applications
 Resource manager (one per cluster)
 Node managers running on all the nodes in the cluster
 To launch and monitor containers.
YARNHADOOP

24
 MasterNode: storing data (HDFS), parallel computations (MR).
 Slave/Worker Node: machines do all works assigned to them from
MasterNodes.
 NameNode: master of the HDFS system. maintains all the directories, files,
manages the blocks present on DataNodes.
 DataNode: machine actual storage. like slaves of HDFS. are responsible for
serving read-write requests for the clients.
 JobTracker: do parallel processing of data using MapReduce. This process is
assigned to interact with clients applications.
 TaskTracker: process that executes tasks assigned to it from JobTracker like
Map, Reduce and Shuffle.
Master-Slave architectureHADOOP

25
Master-Slave architectureHADOOP

 Load balancing , Node failures, Cluster expansion, Highly fault-tolerant.
 Typically 128 MB block size three copies(chunks) are maintained:
 One on the same node.
 One on the same rack but on different node.
 One on the other rack on different node.
 Information about all these copies is maintained on the NameNode.
 Client accesses data directly from DataNode.
 Allow move processing to data. High throughput.
 Suitable for applications with large data sets.
 Streaming access to file system data.
 Can be built out of commodity hardware.
26
multi-node clusterHADOOP

32
Install & learnHADOOP
HADOOP MULTI NODE CLUSTER ON UBUNTU IN 30 MINUTES
HADOOP 2.7.0 MULTI NODE CLUSTER SETUP ON UBUNTU 15.04
HADOOP TUTORIAL FOR BIG DATA ENTHUSIASTS

34
 R is becoming the most popular language for data science.
 R is data analysis software: statistical analysis, data visualization,
and predictive modeling.
 R is a programming language: An object-oriented language.
 R is an open-source software project: integrate with other
applications and systems.
 R is a community: thousands of contributors have created add-on
packages. With two million users, R boasts a vibrant online
community
Why RR

35
 Cloud For Bigger Data and R is the most way to analyze data.
 R is one of the fastest growing languages in the world.
 R has one of the best visualization in analytics software.
 It is open source, free, 8000 plus packages built in.
 Supported (Google, Oracle, Microsoft, Sap, Sas Institute, Ibm, etc…). With
GUI packages easily to start analyzing data in R
 Community System (conferences, help groups, books, startups, experienced
companies).
 RStudio IDE for helping business users with faster project execution and
easier transition to the R platform.
Why Should Cloud Users Learn More About R?R

37
Install & learnR
INSTALL R, R STUDIO AND R PACKAGES IN SIMPLE STEPS
R TUTORIAL – OUTSTANDING INTRODUCTION TO R PROGRAMMING
FOR DATA SCIENCE!

• What is Spark?
• An unified, open source,
parallel, data processing
framework for Big Data
Analytics
What is Spark?SPARK

•http://spark.apache.org/
• Speed
• Ease of use
• Generality
• Integrated with Hadoop
• Scalability
Motivation to use SparkSPARK

• Apache Spark is an open source cluster computing
framework
• Originally developed at the University of California,
Berkeley's AMPLab
OriginSPARK

RDD (Resilient Distributed Dataset)SPARK

Iterative Operations on MapReduceSPARK

Iterative Operations on Spark RDDSPARK

Fast!
Scalable
Flexible
Statistical!
Interactive
Packages
SPARKR

How does Sparkr cluster works?SPARKR

52
Spark Driver
R JVM
RBackend
JVM
Worker
JVM
Worker
DataSources
R
R
SparkR architecture (since 2.0)SPARKR

53
IO
read.df / write.df /
createDataFrame / collect
Caching
cache / persist / unpersist /
cacheTable / uncacheTable
SQL
sql / table / saveAsTable /
registerTempTable / tables
ML Lib
glm / kmeans / Naïve Bayes
Survival regression
DataFrame API
select / subset / groupBy /
head / avg / column / dim
UDF functionality (since 2.0)
spark.lapply / dapply /
gapply / dapplyCollect
Overview of SparkR APISPARKR

57
Install & learn
INSTALL APACHE SPARK ON MULTI-NODE CLUSTER
SPARK TUTORIAL – LEARN SPARK PROGRAMMING
INSTALLING SPARKR
SPARKR AND R – DATAFRAME AND DATA.FRAME
INSTALLING R ON HADOOP CLUSTER TO RUN SPARKR
INSTALL R, R STUDIO AND R PACKAGES IN SIMPLE STEPS
SPARKR
BUILDING ZEPPELIN-WITH-R ON SPARK AND ZEPPELIN

FIRST CASE
TITANIC TEST EXAMPLE

http://amunategui.github.io/databricks-spark-bayes/

http://amunategui.github.io/databricks-spark-bayes/
My Work

6 Machines specifications
Hdmaster:
Processor: AMD Phenom(tm) 8600B
Cores: 3
Memory: 8 GB
Hard disk: 120 GB
Network card: Gigabit
OS: Linux (Ubuntu 14)
System type: 64-pit
Hdslave1, Hdslave2, Hdslave3, Hdslave4 and Hdslave5:
Processor: Intel Core 2 Duo CPU E8400 3.00GHz
Cores: 2
Memory: 4 GB
Hard disk: 40 GB
Network card: Gigabit
System type: 64-pit
6 Machines connected together on 1 switch (Gigabits), speed approximately 600 Mbit.
CRIMES EXAMPLE

1Master and 5 slaves
Hadoop ClusterCRIMES EXAMPLE

1 Driver and 6 Workers
Spark Standalone ClusterCRIMES EXAMPLE

Dataset
Spark Standalone ClusterCRIMES EXAMPLE

https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
DatasetCRIMES EXAMPLE

Dataset

Dataset downloaded 19-11-2016
DatasetCRIMES EXAMPLE

Initiating Sparkr
PreprocessingCRIMES EXAMPLE

Splitting Dataset
Splitting dataset to capture specific columns not all columns in a new
dataset(Crimes2001topresent.csv 1.5 GB, 6208265 Rows).
> path1 <- file.path("hdfs://hdmaster:9000/user/ahmed/data/crimes/Crimes2001topresent.csv")
> system.time(path <- (cache(read.df( path =path1 , source = "com.databricks.spark.csv", inferSchema = "true",header="true"))))
[Stage 2:> (0 + 11) / 11][Stage 2:=====> (1 + 10) / 11][Stage
2:=====================> (4 + 7) / 11][Stage 2:==========================> (5 + 6) /
11][Stage 2:====================================> (7 + 4) / 11][Stage
2:===============================================> (9 + 2) / 11][Stage
2:===================================================> (10 + 1) / 11]
user system elapsed
0.076 0.064 35.736

Splitting Dataset
> createOrReplaceTempView(path, "path")
> dssql <- sql("SELECT PrimaryType,LocationDescription,Arrest,Domestic,District,Beat,Year FROM path ")
> system.time(write.df(repartition(dssql, 1), "hdfs://hdmaster:9000/user/ahmed/data/crimes/Crimes2", source="csv", mode = "overwrite"))
[Stage 3:> (0 + 13) / 13][Stage 3:====> (1 + 12) / 13][Stage
3:=================> (4 + 9) / 13][Stage 3:==========================> (6 + 7) /
13][Stage 3:===============================> (7 + 6) / 13][Stage
3:===================================> (8 + 5) / 13][Stage
3:===========================================> (10 + 3) / 13][Stage
3:================================================> (11 + 2) / 13] 0.531

Downloading output file

Spark dataframe for new file crimessplited.csv

Null values

Preparing Main Spark dataframe from sql

Naïve Bayes
Splitting Dataset to (Test, Train).
Learning phaseCRIMES EXAMPLE

Naïve Bayes and predicting algorithm from spark Mlib
Naïve Bayes

Apriori and samples of class weight for each column

Convert spark dataframe to local R dataframe for confusionmatrix purposes
Prediction phaseCRIMES EXAMPLE

Confusionmatrix

Predicting

Suggesting

Convert spark dataframe to local R dataframe for graph ggplot2 purposes
GraphCRIMES EXAMPLE

Transforming prediction and arrest to be Boolean instead of nominal for Plotting purposes.
GraphCRIMES EXAMPLE

104
 The dataset made up of records of all USA domestic flights of
major carriers.
 “Airline on-time performance” downloaded as CSV file.
 Details of the arrival and departure of all commercial flights in
the US, from October 1987 to April 2008.
 Total of nearly 123 million records stored on 12 gigabytes.
DatasetDELAY PREDICTION

105
• Year : 1987-2008,
• Month: 1-12,
• DayofMonth: 1-31,
• DayOfWeek: 1 (Monday) - 7 (Sunday),
• DepTime:actual departure time,
• CRSDepTime: scheduled departure time
• ArrTime: actual arrival time,
• CRSArrTime: scheduled arrival time,
• UniqueCarrier: unique carrier code
• FlightNum: flight number,
• TailNum: plane tail number,
• ActualElapsedTim: in minutes,
• CRSElapsedTime: in minutes,
• AirTime: in minutes.
• ArrDelay: arrival delay, in minutes.
• DepDelay: departure delay in minutes.
• Origin: origin IATA airport code
• Dest: destination IATA airport code
• Distance: in miles,
• TaxiIn: taxi in time in minutes,
• TaxiOut: taxi out time in minutes,
• Cancelled: was the flight cancelled?,
• CancellationCode: reason for cancellation
(A =carrier, B =weather, C=NAS, D =
security),
• Diverted: 1 = yes 0 = no,
• CarrierDelay: in minutes,
• WeatherDelay: in minutes,
• NASDelay: in minutes,
• SecurityDelay: in minutes,
• LateAircraftDelay: in minutes.8
Variables descriptions(29 variables):
DatasetDELAY PREDICTION

106
Class
•Class was built depending on U.S. Department of
transportation federal aviation administration (FAA).
•Ontime binary class: if departure delay <15 then ‘yes’
or if it is delay>15 or is canceled then ‘no’.
•Criteria: Jan-2004, instances selected (583.9K rows).
•70% for the training (407.7K rows) and 30% for the test
(176.2K rows).
Classification Algorithms ComparisonDELAY PREDICTION

107
Performance classification comparison
As an answer to first question “what is the best classification
algorithm to use from SparkR MLib?”. And as shown in table
(4).
Classification Algorithms ComparisonDELAY PREDICTION

108
Binary Class Test
DELAY PREDICTION

109
Spark Cluster Over the Hadoop Cluster
Hadoop ClusterDELAY PREDICTION

110
Actual cluster : as
shown the true
picture illustrates the
physical cluster
machines

111
Hadoop Cluster Specs: 6 Machines specifications
Hdmaster:
Processor: AMD Phenom(tm)
8600B
Cores: 3
Memory: 8 GB
Hard disk: 120 GB
Network
card:
Gigabit
System type: 64-pit
Hdslave (1,2,3,4, and 5)
Processor: Intel Core 2 Duo CPU E8400
3.00GHz
Cores: 2
Memory: 4 GB
Hard disk: 40 GB
Network
card:
Gigabit
System type: 64-pit
6 Machines connected together.
Hadoop Version 2.6

112
 If departure delay <15 then on-time is ‘True’ and
 If it is >15 or is canceled then on-time is ‘False’.
Binary Class TestDELAY PREDICTION

113
Dividing Dataset
 The selected range of data is 15 years with 91,449,659
instances.
 The full dataset is separated into:
 70% as a training set with 64,020,457 Instances.
 30% as a testing set with 27,429,202 instances.
 The split and validation done using Holdout Validation
Technique.
 Training and test sets are cashed in spark dataframe
cluster.

114
The Result
The test for both (departure and arrival) delays prediction

115
Predicting The Departure And Arrival Flight
Delays In One Process
Multinomial ClassDELAY PREDICTION

116
Preprocessing Using SparkR SQL
 Many attributes pruned (10 columns) according to their lack
of data or some columns are empty.
 (AirTime, TailNum, TaxiIn, TaxiOut, CancellationCode,
CarrierDelay, WeatherDelay, NASDelay, SecurityDelay and
LateAircraftDelaythe).
 The rest of columns were selected.
 The selected range of data is 15 years with 91,449,659
instances.

117
Proposed multinomial class (On-time)
When DepDelay <15 and ArrDelay <15 then ‘Both Ontime’.
When DepDelay >15 and ArrDelay >15 then ‘Both Delayed’.
When DepDelay >15 and ArrDelay <15 then ‘Origin Delay’.
When DepDelay <15 and ArrDelay >15 then ‘Destination Delay’.
When the Canceled is true then ‘Both Delayed’.
Features Selector (RFormula)
Rformula is used for the rest of selected columns

118
Dataset Splitting
The full dataset after the features selection process is separated
into:
 70% as a training set with 64,020,457 Instances.
 30% as a testing set with 27,429,202 instances.
The dividing and validation done using Holdout Validation
Technique.

119
Predicting the departure and
arrival flight delays in one
process
Learning phase

120
Predicting phase
Predicting the departure and
arrival flight delays in one
process

121
Prediction & validation Metrics
Prediction Instances and accuracy
Prediction metrics

122
Prediction & validation Metrics
Prediction confusion matrix

123
Shiny web Page
DPDAD model Interface
There are no delays in origin and
destination airports they are
Both Ontime
95.4%

Suggesting the top ten carriers and its probabilities.
 Running prediction on stored ML approach and
Making the whole dataset from Hadoop as a test set.
 Using Spark SQL to select the top ten carriers with
highest probabilities and prediction class equal “yes”
124

126
Scala on zeppelinDELAY PREDICTION
Figure (1). The code of how to read dataset file from Hadoop

127
DELAY PREDICTION
Figure (2). The code of how to use the Spark Sql to handle the
dataset for:
 Missing data.
 Corrupting data.
 Time values ranges.
 Making the multinomial class.
And using the RFormula as a feature selector.
Scala on zeppelin

128

129
DELAY PREDICTION Scala on Zeppelin
Figure (3). The output
sample for figure (2).

130
Figure (4). The code of splitting the dataset using the Holdout Validation Technique and
caching it as a Spark storage.
Figure (5). The counting for training data and testing data.

131
Figure (6). The running
of Naïve-Bayes
algorithm as a learning
phase.
Figure (7). The
prediction phase and
a sample of the
output.

132
Figure (8). The counting of the actual and
prediction class.
Figure (9). Calculating the confusion
matrix and metrics.

133

134

135

136
IBM BadgesCERTIFICATES
IBM - BIG DATA 101
IBM - HADOOP 101
IBM - SPARK FUNDAMENTALS I

Analyzing Big data in R and Scala using Apache Spark 17-7-19

More Related Content

What's hot (20)

Similar to Analyzing Big data in R and Scala using Apache Spark 17-7-19 (20)

Recently uploaded (20)

Analyzing Big data in R and Scala using Apache Spark 17-7-19

Editor's Notes