SlideShare a Scribd company logo
ANALYZING BIG DATA IN R AND SCALA USING
APACHE SPARK
17-07-2019
By: Ahmed Elsayed
M.Sc. Information Systems.
Director of Software Applications Department - IT
Alexandria Petroleum Maintenance Co. "Petromaint"
AhmedElsayeddb@gmail.com
 DATA SCIENCE
 BIG DATA
 HADOOP
o INSTALL & LEARN
 R
o INSTALL & LEARN
 SPARK & SPARKR
o INSTALL & LEARN
 CASES STUDY
o FIRST CASETITANICTEST EXAMPLE
o SECOND CASE CRIMES EXAMPLE
 SPARKRON R USING RSTUDIO
o THIRD CASE DELAY PREDICTION
 SCALA ON ZEPPELIN
 LEARN AND be IBM CERTIFIED
MOTIVATION
• Make a prediction from dataset to know not only what
happened or why it happened but also what will happen.
• Make the machine give the most accurate answer(class
value or group) from new data which entered by user
learning from punch of historical data which is so big to be
handled by human or even by one machine memory
Data Science
5
DATA SCIENCE
6
Machine learning (Big Data Analytics).
Choosing dataset (in progress to choose at least 5GB
dataset file).
Pre-processing on dataset for removing missing values or
refilling it.
Implementing classification and clustering
DATA SCIENCE
7
Data Science is a combination of:
Studies of managing, storing, and analyzing data.
Mathematics, statistics, programming.
Ways of capturing data might not been captured till now.
Ability to look at things 'differently'.
Activity of cleansing, preparing and aligning the data.
DATA SCIENCE
8
Machine learning
Branch of Computer Science.
Low-level algorithms to discover patterns implicit in the
data.
The more data, the more effective learning.
Which is why machine learning and big data are intricately
tied together.
DATA SCIENCE
Big Data
10
BIG DATA Why Big Data
 Used to process, analyze and store large amount of data.
 Structured and unstructured .
 (computers, mobile devices, satellites, cameras, images etc.).
 Exceeds the processing capacity of traditional DBMS.
 Over 90% of World data generated last two years.
 Scale up from single servers to thousands of machines.
 Big data is valuable for organization falls in two categories:
o Predicting new products basis on products data history.
o Data sizes from TB to many PB in a single sets of data.
 Hadoop is an open source framework which does all above.
11
BIG DATA Why Big Data
12
BIG DATA Why Big Data
13
changes our entire way of thinking about predictive analytics,
knowledge extraction and interpretation.
trial-and-error analysis, approach becomes impossible when
datasets are large and heterogeneous.
very few tools allow for processing large datasets in reasonable
amount of time.
traditional statistical solutions typically focus on static analytics that
is limited to the analysis of samples that are frozen in time, which
often results in surpassed and unreliable conclusions.
BIG DATA Big Data Analytics
14
BIG DATA Big Data Analytics
15
BIG DATA Big Data Analytics
Hadoop
17
HADOOP
 Is an open source software framework developed in java.
 Processing, querying huge amount of data.
 On large clusters of commodity hardware.
 Divide massive data into smaller chunks.
 Spread it out over many machines.
 Each machine can process those chunks in parallel.
 So results can be obtained extremely fast.
 Apache Hadoop has two main components:
 HDFS.
 MapReduce.
18
Hadoop Distributed File System (HDFS)
 Derived from the concept of Google File System (GFS).
 It is a data storage layer based on the UNIX.
 Creates multiple replicas of each data block.
 Distributes them on computers throughout a cluster.
 To enable reliable and rapid access.
 Suitable for applications have large data sets.
HADOOP
19
HADOOP Hadoop Distributed File System (HDFS)
20
MapReduce
 Core component of the Hadoop.
 Processing Big Data distributed over thousands of nodes.
 processes chunks in parallel.
 Later individual results are combined together to get result.
 This whole processing is done in two phases: (Map
,Reduce).
HADOOP
21
MapReduceHADOOP
22
YARNHADOOP
23
 Hadoop is a cluster resource management platform.
 Responsible for managing computing resources in clusters.
 Using them for scheduling of users applications
 Resource manager (one per cluster)
 Node managers running on all the nodes in the cluster
 To launch and monitor containers.
YARNHADOOP
24
 MasterNode: storing data (HDFS), parallel computations (MR).
 Slave/Worker Node: machines do all works assigned to them from
MasterNodes.
 NameNode: master of the HDFS system. maintains all the directories, files,
manages the blocks present on DataNodes.
 DataNode: machine actual storage. like slaves of HDFS. are responsible for
serving read-write requests for the clients.
 JobTracker: do parallel processing of data using MapReduce. This process is
assigned to interact with clients applications.
 TaskTracker: process that executes tasks assigned to it from JobTracker like
Map, Reduce and Shuffle.
Master-Slave architectureHADOOP
25
Master-Slave architectureHADOOP
 Load balancing , Node failures, Cluster expansion, Highly fault-tolerant.
 Typically 128 MB block size three copies(chunks) are maintained:
 One on the same node.
 One on the same rack but on different node.
 One on the other rack on different node.
 Information about all these copies is maintained on the NameNode.
 Client accesses data directly from DataNode.
 Allow move processing to data. High throughput.
 Suitable for applications with large data sets.
 Streaming access to file system data.
 Can be built out of commodity hardware.
26
multi-node clusterHADOOP
27
multi-node clusterHADOOP
28
multi-node clusterHADOOP
29
multi-node clusterHADOOP
30
multi-node clusterHADOOP
31
EcosystemsHADOOP
32
Install & learnHADOOP
HADOOP MULTI NODE CLUSTER ON UBUNTU IN 30 MINUTES
HADOOP 2.7.0 MULTI NODE CLUSTER SETUP ON UBUNTU 15.04
HADOOP TUTORIAL FOR BIG DATA ENTHUSIASTS
R
34
 R is becoming the most popular language for data science.
 R is data analysis software: statistical analysis, data visualization,
and predictive modeling.
 R is a programming language: An object-oriented language.
 R is an open-source software project: integrate with other
applications and systems.
 R is a community: thousands of contributors have created add-on
packages. With two million users, R boasts a vibrant online
community
Why RR
35
 Cloud For Bigger Data and R is the most way to analyze data.
 R is one of the fastest growing languages in the world.
 R has one of the best visualization in analytics software.
 It is open source, free, 8000 plus packages built in.
 Supported (Google, Oracle, Microsoft, Sap, Sas Institute, Ibm, etc…). With
GUI packages easily to start analyzing data in R
 Community System (conferences, help groups, books, startups, experienced
companies).
 RStudio IDE for helping business users with faster project execution and
easier transition to the R platform.
Why Should Cloud Users Learn More About R?R
36
RHadoop Example CodeR
37
Install & learnR
INSTALL R, R STUDIO AND R PACKAGES IN SIMPLE STEPS
R TUTORIAL – OUTSTANDING INTRODUCTION TO R PROGRAMMING
FOR DATA SCIENCE!
SPARK & SPARKR
• What is Spark?
• An unified, open source,
parallel, data processing
framework for Big Data
Analytics
What is Spark?SPARK
•http://spark.apache.org/
• Speed
• Ease of use
• Generality
• Integrated with Hadoop
• Scalability
Motivation to use SparkSPARK
Motivation to use SparkSPARK
Motivation to use SparkSPARK
Motivation to use SparkSPARK
Motivation to use SparkSPARK
Motivation to use SparkSPARK
• Apache Spark is an open source cluster computing
framework
• Originally developed at the University of California,
Berkeley's AMPLab
OriginSPARK
RDD (Resilient Distributed Dataset)SPARK
Iterative Operations on MapReduceSPARK
Iterative Operations on Spark RDDSPARK
Fast!
Scalable
Flexible
Statistical!
Interactive
Packages
SPARKR
How does Sparkr cluster works?SPARKR
52
Spark Driver
R JVM
RBackend
JVM
Worker
JVM
Worker
DataSources
R
R
SparkR architecture (since 2.0)SPARKR
53
IO
read.df / write.df /
createDataFrame / collect
Caching
cache / persist / unpersist /
cacheTable / uncacheTable
SQL
sql / table / saveAsTable /
registerTempTable / tables
ML Lib
glm / kmeans / Naïve Bayes
Survival regression
DataFrame API
select / subset / groupBy /
head / avg / column / dim
UDF functionality (since 2.0)
spark.lapply / dapply /
gapply / dapplyCollect
Overview of SparkR APISPARKR
Overview of SparkR APISPARKR
RStudioSPARKR
Apache ZeppelinSPARKR
57
Install & learn
INSTALL APACHE SPARK ON MULTI-NODE CLUSTER
SPARK TUTORIAL – LEARN SPARK PROGRAMMING
INSTALLING SPARKR
SPARKR AND R – DATAFRAME AND DATA.FRAME
INSTALLING R ON HADOOP CLUSTER TO RUN SPARKR
INSTALL R, R STUDIO AND R PACKAGES IN SIMPLE STEPS
SPARKR
BUILDING ZEPPELIN-WITH-R ON SPARK AND ZEPPELIN
CASE STUDIES
FIRST CASE
TITANIC TEST EXAMPLE
http://amunategui.github.io/databricks-spark-bayes/
TITANIC TEST EXAMPLE
http://amunategui.github.io/databricks-spark-bayes/
My Work
TITANIC TEST EXAMPLE
My Work
TITANIC TEST EXAMPLE
My Work
TITANIC TEST EXAMPLE
SECOND CASE
CRIMES EXAMPLE
Cluster specs
CRIMES EXAMPLE
6 Machines specifications
Hdmaster:
Processor: AMD Phenom(tm) 8600B
Cores: 3
Memory: 8 GB
Hard disk: 120 GB
Network card: Gigabit
OS: Linux (Ubuntu 14)
System type: 64-pit
Hdslave1, Hdslave2, Hdslave3, Hdslave4 and Hdslave5:
Processor: Intel Core 2 Duo CPU E8400 3.00GHz
Cores: 2
Memory: 4 GB
Hard disk: 40 GB
Network card: Gigabit
OS: Linux (Ubuntu 14)
System type: 64-pit
6 Machines connected together on 1 switch (Gigabits), speed approximately 600 Mbit.
CRIMES EXAMPLE
1Master and 5 slaves
Hadoop ClusterCRIMES EXAMPLE
1Master and 5 slaves
Hadoop ClusterCRIMES EXAMPLE
1 Driver and 6 Workers
Spark Standalone ClusterCRIMES EXAMPLE
Dataset
Spark Standalone ClusterCRIMES EXAMPLE
https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
DatasetCRIMES EXAMPLE
https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
DatasetCRIMES EXAMPLE
Dataset
https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
Dataset
https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
DatasetCRIMES EXAMPLE
https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
DatasetCRIMES EXAMPLE
https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
Dataset downloaded 19-11-2016
DatasetCRIMES EXAMPLE
SPARKR ON R USING RSTUDIO
Preprocessing
Initiating Sparkr
PreprocessingCRIMES EXAMPLE
PreprocessingCRIMES EXAMPLE
Splitting Dataset
Splitting dataset to capture specific columns not all columns in a new
dataset(Crimes2001topresent.csv 1.5 GB, 6208265 Rows).
> path1 <- file.path("hdfs://hdmaster:9000/user/ahmed/data/crimes/Crimes2001topresent.csv")
> system.time(path <- (cache(read.df( path =path1 , source = "com.databricks.spark.csv", inferSchema = "true",header="true"))))
[Stage 2:> (0 + 11) / 11][Stage 2:=====> (1 + 10) / 11][Stage
2:=====================> (4 + 7) / 11][Stage 2:==========================> (5 + 6) /
11][Stage 2:====================================> (7 + 4) / 11][Stage
2:===============================================> (9 + 2) / 11][Stage
2:===================================================> (10 + 1) / 11]
user system elapsed
0.076 0.064 35.736
PreprocessingCRIMES EXAMPLE
Splitting Dataset
> createOrReplaceTempView(path, "path")
> dssql <- sql("SELECT PrimaryType,LocationDescription,Arrest,Domestic,District,Beat,Year FROM path ")
> system.time(write.df(repartition(dssql, 1), "hdfs://hdmaster:9000/user/ahmed/data/crimes/Crimes2", source="csv", mode = "overwrite"))
[Stage 3:> (0 + 13) / 13][Stage 3:====> (1 + 12) / 13][Stage
3:=================> (4 + 9) / 13][Stage 3:==========================> (6 + 7) /
13][Stage 3:===============================> (7 + 6) / 13][Stage
3:===================================> (8 + 5) / 13][Stage
3:===========================================> (10 + 3) / 13][Stage
3:================================================> (11 + 2) / 13] 0.531
PreprocessingCRIMES EXAMPLE
Downloading output file
PreprocessingCRIMES EXAMPLE
Spark dataframe for new file crimessplited.csv
PreprocessingCRIMES EXAMPLE
Null values
PreprocessingCRIMES EXAMPLE
Preparing Main Spark dataframe from sql
PreprocessingCRIMES EXAMPLE
Naïve Bayes
Algorithm
Naïve Bayes
Splitting Dataset to (Test, Train).
Learning phaseCRIMES EXAMPLE
Naïve Bayes and predicting algorithm from spark Mlib
Naïve Bayes
Learning phaseCRIMES EXAMPLE
Apriori and samples of class weight for each column
Learning phaseCRIMES EXAMPLE
Convert spark dataframe to local R dataframe for confusionmatrix purposes
Prediction phaseCRIMES EXAMPLE
Confusionmatrix
Prediction phaseCRIMES EXAMPLE
Predicting
&
suggesting
Predicting
Prediction phaseCRIMES EXAMPLE
Suggesting
Prediction phaseCRIMES EXAMPLE
Suggesting
Prediction phaseCRIMES EXAMPLE
Visualization
Graph
Convert spark dataframe to local R dataframe for graph ggplot2 purposes
GraphCRIMES EXAMPLE
GraphCRIMES EXAMPLE
Transforming prediction and arrest to be Boolean instead of nominal for Plotting purposes.
GraphCRIMES EXAMPLE
GraphCRIMES EXAMPLE
THIRD CASE
DELAY PREDICTION
104
 The dataset made up of records of all USA domestic flights of
major carriers.
 “Airline on-time performance” downloaded as CSV file.
 Details of the arrival and departure of all commercial flights in
the US, from October 1987 to April 2008.
 Total of nearly 123 million records stored on 12 gigabytes.
DatasetDELAY PREDICTION
105
• Year : 1987-2008,
• Month: 1-12,
• DayofMonth: 1-31,
• DayOfWeek: 1 (Monday) - 7 (Sunday),
• DepTime:actual departure time,
• CRSDepTime: scheduled departure time
• ArrTime: actual arrival time,
• CRSArrTime: scheduled arrival time,
• UniqueCarrier: unique carrier code
• FlightNum: flight number,
• TailNum: plane tail number,
• ActualElapsedTim: in minutes,
• CRSElapsedTime: in minutes,
• AirTime: in minutes.
• ArrDelay: arrival delay, in minutes.
• DepDelay: departure delay in minutes.
• Origin: origin IATA airport code
• Dest: destination IATA airport code
• Distance: in miles,
• TaxiIn: taxi in time in minutes,
• TaxiOut: taxi out time in minutes,
• Cancelled: was the flight cancelled?,
• CancellationCode: reason for cancellation
(A =carrier, B =weather, C=NAS, D =
security),
• Diverted: 1 = yes 0 = no,
• CarrierDelay: in minutes,
• WeatherDelay: in minutes,
• NASDelay: in minutes,
• SecurityDelay: in minutes,
• LateAircraftDelay: in minutes.8
Variables descriptions(29 variables):
DatasetDELAY PREDICTION
106
Class
•Class was built depending on U.S. Department of
transportation federal aviation administration (FAA).
•Ontime binary class: if departure delay <15 then ‘yes’
or if it is delay>15 or is canceled then ‘no’.
•Criteria: Jan-2004, instances selected (583.9K rows).
•70% for the training (407.7K rows) and 30% for the test
(176.2K rows).
Classification Algorithms ComparisonDELAY PREDICTION
107
Performance classification comparison
As an answer to first question “what is the best classification
algorithm to use from SparkR MLib?”. And as shown in table
(4).
Classification Algorithms ComparisonDELAY PREDICTION
108
Binary Class Test
DELAY PREDICTION
109
Spark Cluster Over the Hadoop Cluster
Hadoop ClusterDELAY PREDICTION
110
Hadoop ClusterDELAY PREDICTION
Actual cluster : as
shown the true
picture illustrates the
physical cluster
machines
111
Hadoop Cluster Specs: 6 Machines specifications
Hdmaster:
Processor: AMD Phenom(tm)
8600B
Cores: 3
Memory: 8 GB
Hard disk: 120 GB
Network
card:
Gigabit
OS: Linux (Ubuntu 14)
System type: 64-pit
Hdslave (1,2,3,4, and 5)
Processor: Intel Core 2 Duo CPU E8400
3.00GHz
Cores: 2
Memory: 4 GB
Hard disk: 40 GB
Network
card:
Gigabit
OS: Linux (Ubuntu 14)
System type: 64-pit
6 Machines connected together.
Hadoop Version 2.6
Hadoop ClusterDELAY PREDICTION
112
 If departure delay <15 then on-time is ‘True’ and
 If it is >15 or is canceled then on-time is ‘False’.
Binary Class TestDELAY PREDICTION
113
Dividing Dataset
 The selected range of data is 15 years with 91,449,659
instances.
 The full dataset is separated into:
 70% as a training set with 64,020,457 Instances.
 30% as a testing set with 27,429,202 instances.
 The split and validation done using Holdout Validation
Technique.
 Training and test sets are cashed in spark dataframe
cluster.
Binary Class TestDELAY PREDICTION
114
The Result
The test for both (departure and arrival) delays prediction
Binary Class TestDELAY PREDICTION
115
Predicting The Departure And Arrival Flight
Delays In One Process
Multinomial ClassDELAY PREDICTION
116
Preprocessing Using SparkR SQL
 Many attributes pruned (10 columns) according to their lack
of data or some columns are empty.
 (AirTime, TailNum, TaxiIn, TaxiOut, CancellationCode,
CarrierDelay, WeatherDelay, NASDelay, SecurityDelay and
LateAircraftDelaythe).
 The rest of columns were selected.
 The selected range of data is 15 years with 91,449,659
instances.
Multinomial ClassDELAY PREDICTION
117
Proposed multinomial class (On-time)
When DepDelay <15 and ArrDelay <15 then ‘Both Ontime’.
When DepDelay >15 and ArrDelay >15 then ‘Both Delayed’.
When DepDelay >15 and ArrDelay <15 then ‘Origin Delay’.
When DepDelay <15 and ArrDelay >15 then ‘Destination Delay’.
When the Canceled is true then ‘Both Delayed’.
Features Selector (RFormula)
Rformula is used for the rest of selected columns
Multinomial ClassDELAY PREDICTION
118
Dataset Splitting
The full dataset after the features selection process is separated
into:
 70% as a training set with 64,020,457 Instances.
 30% as a testing set with 27,429,202 instances.
The dividing and validation done using Holdout Validation
Technique.
Multinomial ClassDELAY PREDICTION
119
Predicting the departure and
arrival flight delays in one
process
Learning phase
Multinomial ClassDELAY PREDICTION
120
Predicting phase
Predicting the departure and
arrival flight delays in one
process
Multinomial ClassDELAY PREDICTION
121
Prediction & validation Metrics
Prediction Instances and accuracy
Prediction metrics
Multinomial ClassDELAY PREDICTION
122
Prediction & validation Metrics
Prediction confusion matrix
Multinomial ClassDELAY PREDICTION
123
Shiny web Page
DPDAD model Interface
Multinomial ClassDELAY PREDICTION
There are no delays in origin and
destination airports they are
Both Ontime
95.4%
Suggesting the top ten carriers and its probabilities.
 Running prediction on stored ML approach and
Making the whole dataset from Hadoop as a test set.
 Using Spark SQL to select the top ten carriers with
highest probabilities and prediction class equal “yes”
124
Multinomial ClassDELAY PREDICTION
SCALA ON ZEPPELIN
126
Scala on zeppelinDELAY PREDICTION
Figure (1). The code of how to read dataset file from Hadoop
127
DELAY PREDICTION
Figure (2). The code of how to use the Spark Sql to handle the
dataset for:
 Missing data.
 Corrupting data.
 Time values ranges.
 Making the multinomial class.
And using the RFormula as a feature selector.
Scala on zeppelin
128
Multinomial ClassDELAY PREDICTION
129
DELAY PREDICTION Scala on Zeppelin
Figure (3). The output
sample for figure (2).
130
DELAY PREDICTION Scala on Zeppelin
Figure (4). The code of splitting the dataset using the Holdout Validation Technique and
caching it as a Spark storage.
Figure (5). The counting for training data and testing data.
131
DELAY PREDICTION Scala on Zeppelin
Figure (6). The running
of Naïve-Bayes
algorithm as a learning
phase.
Figure (7). The
prediction phase and
a sample of the
output.
132
DELAY PREDICTION Scala on Zeppelin
Figure (8). The counting of the actual and
prediction class.
Figure (9). Calculating the confusion
matrix and metrics.
133
DELAY PREDICTION Scala on Zeppelin
134
DELAY PREDICTION Scala on Zeppelin
135
DELAY PREDICTION Scala on Zeppelin
136
IBM BadgesCERTIFICATES
IBM - BIG DATA 101
IBM - HADOOP 101
IBM - SPARK FUNDAMENTALS I
Thank You

More Related Content

What's hot (20)

PPTX
Hadoop project design and a usecase
sudhakara st
 
PPTX
Big data and Hadoop
Rahul Agarwal
 
PDF
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Cognizant
 
PPTX
Big data & hadoop
TejashBansal2
 
PPTX
Big data processing with apache spark part1
Abbas Maazallahi
 
PPTX
Big data concepts
Serkan Özal
 
PPTX
Introduction to Apache Hadoop
Christopher Pezza
 
PDF
Open source analytics
Ajay Ohri
 
PDF
Introduction to Hadoop and MapReduce
eakasit_dpu
 
PDF
Seminar_Report_hadoop
Varun Narang
 
PDF
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
PPTX
Big Data Concepts
Ahmed Salman
 
PDF
Big Data: hype or necessity?
Bart Vandewoestyne
 
PDF
Introduction to Bigdata and HADOOP
vinoth kumar
 
PPTX
Top Hadoop Big Data Interview Questions and Answers for Fresher
JanBask Training
 
PPTX
Introduction to Hadoop Technology
Manish Borkar
 
PPT
Big Data and Hadoop Basics
Sonal Tiwari
 
PPTX
Big Data and Hadoop
Flavio Vit
 
PPTX
Big data Analytics Hadoop
Mishika Bharadwaj
 
PPTX
Big dataarchitecturesandecosystem+nosql
Khanderao Kand
 
Hadoop project design and a usecase
sudhakara st
 
Big data and Hadoop
Rahul Agarwal
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Cognizant
 
Big data & hadoop
TejashBansal2
 
Big data processing with apache spark part1
Abbas Maazallahi
 
Big data concepts
Serkan Özal
 
Introduction to Apache Hadoop
Christopher Pezza
 
Open source analytics
Ajay Ohri
 
Introduction to Hadoop and MapReduce
eakasit_dpu
 
Seminar_Report_hadoop
Varun Narang
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
Big Data Concepts
Ahmed Salman
 
Big Data: hype or necessity?
Bart Vandewoestyne
 
Introduction to Bigdata and HADOOP
vinoth kumar
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
JanBask Training
 
Introduction to Hadoop Technology
Manish Borkar
 
Big Data and Hadoop Basics
Sonal Tiwari
 
Big Data and Hadoop
Flavio Vit
 
Big data Analytics Hadoop
Mishika Bharadwaj
 
Big dataarchitecturesandecosystem+nosql
Khanderao Kand
 

Similar to Analyzing Big data in R and Scala using Apache Spark 17-7-19 (20)

PDF
Big data processing with apache spark
sarith divakar
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PPTX
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Dataconomy Media
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PDF
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
PPTX
Big Data Analytics Presentation on the resourcefulness of Big data
nextstep013
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PDF
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 
PPTX
selected topics in CS-CHaaapteerobe.pptx
BachaLamessaa
 
PPT
Big Data & Hadoop
Krishna Sujeer
 
PPTX
Fundamentals of big data analytics and Hadoop
Archana Gopinath
 
PPTX
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
PPTX
Big Data and Big Data Analytics PowerPoint lecture notes
MBIEDANGOMEGNIFRANKG
 
PPTX
Big Data training
vishal192091
 
PDF
Spark Driven Big Data Analytics
inoshg
 
PDF
Big data and hadoop overvew
Kunal Khanna
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PPTX
Big data
Mina Soltani
 
PDF
Liferay & Big Data Dev Con 2014
Miguel Pastor
 
PPTX
Big Data and Hadoop
MaulikLakhani
 
Big data processing with apache spark
sarith divakar
 
Apache Spark Fundamentals
Zahra Eskandari
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Dataconomy Media
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
Big Data Analytics Presentation on the resourcefulness of Big data
nextstep013
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 
selected topics in CS-CHaaapteerobe.pptx
BachaLamessaa
 
Big Data & Hadoop
Krishna Sujeer
 
Fundamentals of big data analytics and Hadoop
Archana Gopinath
 
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Big Data and Big Data Analytics PowerPoint lecture notes
MBIEDANGOMEGNIFRANKG
 
Big Data training
vishal192091
 
Spark Driven Big Data Analytics
inoshg
 
Big data and hadoop overvew
Kunal Khanna
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Big data
Mina Soltani
 
Liferay & Big Data Dev Con 2014
Miguel Pastor
 
Big Data and Hadoop
MaulikLakhani
 
Ad

Recently uploaded (20)

PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPTX
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
Group 5_RMB Final Project on circular economy
pgban24anmola
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
big data eco system fundamentals of data science
arivukarasi
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
What Is Data Integration and Transformation?
subhashenia
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Group 5_RMB Final Project on circular economy
pgban24anmola
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Ad

Analyzing Big data in R and Scala using Apache Spark 17-7-19

Editor's Notes

  • #6: REF. [4] https://github.com/RevolutionAnalytics/RHadoop/wiki
  • #7: REF. [4] https://github.com/RevolutionAnalytics/RHadoop/wiki
  • #8: REF. [4] https://github.com/RevolutionAnalytics/RHadoop/wiki
  • #9: REF. [4] https://github.com/RevolutionAnalytics/RHadoop/wiki
  • #11: REF. [3]
  • #12: REF. [3]
  • #13: REF. [3]
  • #14: REF. [4] https://github.com/RevolutionAnalytics/RHadoop/wiki
  • #15: REF. [4] https://github.com/RevolutionAnalytics/RHadoop/wiki
  • #16: REF. [4] https://github.com/RevolutionAnalytics/RHadoop/wiki
  • #18: REF. [4]
  • #19: REF. [4]
  • #20: REF. [4]
  • #21: REF. [4]
  • #22: REF. [4]
  • #23: REF. [5]
  • #24: REF. [5]
  • #25: REF. [4]
  • #26: REF. [5]
  • #27: REF. [5]
  • #28: REF. [5]
  • #29: REF. [5]
  • #30: REF. [5]
  • #31: REF. [5]
  • #32: REF. [5]
  • #33: REF. [5]
  • #35: REF. [6]
  • #36: REF. [6]
  • #37: REF. [4]
  • #38: REF. [4]
  • #51: However, there’s one drawback: Traditionally, the R internal is single-threaded. It is unclear how R programs can be effectively and concisely written to run on multiple machines. So, what if we can combine these two worlds? This is where SparkR comes in: it is a language binding that lets users write R programs that are equipped with nice statistics packages, and have them run on top of Spark.
  • #53: Worker refers to Worker machine Mention that all Spark data sources work
  • #58: REF. [4]
  • #137: REF. [5]