SlideShare a Scribd company logo
SawmillSome Lessons Learned Running R in Large Data CloudsRobert GrossmanOpen Data Group
What Do You Do if Your Data is to Big for a Database?Give up and invoke sampling.Buy a proprietary system and ask for a raise.Begin to build a custom system and explain why it is not yet done.Use Hadoop.Use an alternative large data cloud (e.g. Sector)
Basic IdeaTurn it into a pleasantly parallel problem.Use a large data cloud to manage and prepare the data.Use a Map/Bucket function to split the job.Run R on each piece using Reduce/UDF or streams.Use PMML multiple models to glue the pieces together.
Why Listen?This approach allows you to scale R relatively easily to hundreds of TB to PB.The approach is easy.(A plus: it may look hard to your colleagues, boss or clients.)There is at least an order of magnitude of performance to be gained with the right design.
Part 1.  Stacks for Big Data5
The Google Data StackThe Google File System (2003)MapReduce: Simplified Data Processing… (2004)BigTable: A Distributed Storage System… (2006)6
Map-Reduce ExampleInput is file with one document per recordUser specifies map functionkey = document URLValue = terms that document contains“it”, 1“was”, 1“the”, 1“best”, 1(“doc cdickens”,“it was the best of times”)map
Example (cont’d)MapReduce library gathers together all pairs with the same key value (shuffle/sort phase)The user-defined reduce function combines all the values associated with the same keykey = “it”values = 1, 1“it”, 2“was”, 2“best”, 1“worst”, 1key = “was”values = 1, 1reducekey = “best”values = 1key = “worst”values = 1
Applying MapReduce to the Data in Storage Cloudshuffle/reducemap9
Google’s Large Data CloudCompute ServicesData ServicesStorage Services10ApplicationsGoogle’s MapReduceGoogle’s BigTableGoogle File System (GFS)Google’s Stack
Hadoop’s Large Data CloudApplicationsCompute Services11Hadoop’sMapReduceData ServicesNoSQL DatabasesHadoop Distributed File System (HDFS)Storage ServicesHadoop’s Stack
Amazon Style Data CloudLoad BalancerSimple Queue Service12SDBEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstancesEC2 InstancesS3 Storage Services
Sector’s Large Data Cloud13ApplicationsCompute ServicesSphere’s UDFsData ServicesSector’s Distributed File System (SDFS)Storage ServicesRouting & Transport ServicesUDP-based Data Transport Protocol (UDT)Sector’s Stack
Apply User Defined Functions (UDF) to Files in Storage Cloudmapshuffle /reduce14UDFUDF
Folklore MapReduce is great.But sometimes it is easier to use UDFs or other parallel programming frameworks for large data clouds.And often it is easier to use Hadoop streams, Sector streams, etc.
Sphere UDF vsMapReduce
Terasort BenchmarkSector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.
MalStone18entitiessitesdk-2dk-1dktime
MalStoneSector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack.  Data consisted of 20 nodes with 500 million 100-byte records / node.
Part 2Predictive Model Markup Language
Problems Deploying Models Models are deployed in proprietary formats Models are application dependent Models are system dependent Models are architecture dependant Time required to deploy models and to integrate models with other applications can be long.
Predictive ModelMarkup Language (PMML) Based on XML  Benefits of PMMLOpen standard for Data Mining & Statistical  Models Not concerned with the process of creating a modelProvides independence from application, platform, and operating systemSimplifies use of data mining models by other applications (consumers of data mining models)
PMML Document ComponentsData dictionaryMining schemaTransformation DictionaryMultiple models, including segments and ensembles.Model verification, …Univariate Statistics (ModelStats)Optional Extensions
PMML Models polynomial regression logistic regression general regression center based clusters density based clusterstrees
 associations
 neural nets
 naïve Bayes
 sequences
 text models
 support vector machines
rulesetPMML Producer & Consumers25Modeling Environment211Model ProducerDataData Pre-processingPMMLModelDeployment Environment 2PMMLModel331Model ConsumerPost Processingdataactionsscoresrules
Part 3Sawmill
Step 2: Invoke R on each segment/bucket and build PMML model Step 1: Preprocess data using MapReduce or UDFmodelsStep 3: Gather the models together to form a multiple model PMML file

More Related Content

What's hot (20)

PPTX
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Safir Shah
 
PPTX
Introduction to MapReduce
Hassan A-j
 
PPTX
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Govt.Engineering college, Idukki
 
PPTX
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
Xiao Qin
 
PDF
Relational Algebra and MapReduce
Pietro Michiardi
 
PPTX
Hadoop interview questions
barbie0909
 
PPTX
Map reduce presentation
ateeq ateeq
 
PDF
EMR AWS Demo
Rim Moussa
 
PPT
Map Reduce
Sri Prasanna
 
PPTX
Hadoop
Bhushan Kulkarni
 
PPTX
Apache Hadoop Big Data Technology
Jay Nagar
 
PPT
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
PDF
MapReduce in Cloud Computing
Mohammad Mustaqeem
 
PDF
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
IJSRD
 
PDF
Reduce Side Joins
Edureka!
 
PPTX
Introduction to MapReduce
Chicago Hadoop Users Group
 
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
PDF
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
PPSX
MapReduce Scheduling Algorithms
Leila panahi
 
PDF
Eg4301808811
IJERA Editor
 
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Safir Shah
 
Introduction to MapReduce
Hassan A-j
 
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Govt.Engineering college, Idukki
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
Xiao Qin
 
Relational Algebra and MapReduce
Pietro Michiardi
 
Hadoop interview questions
barbie0909
 
Map reduce presentation
ateeq ateeq
 
EMR AWS Demo
Rim Moussa
 
Map Reduce
Sri Prasanna
 
Apache Hadoop Big Data Technology
Jay Nagar
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
MapReduce in Cloud Computing
Mohammad Mustaqeem
 
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
IJSRD
 
Reduce Side Joins
Edureka!
 
Introduction to MapReduce
Chicago Hadoop Users Group
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
MapReduce Scheduling Algorithms
Leila panahi
 
Eg4301808811
IJERA Editor
 

Viewers also liked (7)

PDF
Operationalizing R with Azure ML
Chris McHenry
 
PDF
Architectures for Data Commons (XLDB 15 Lightning Talk)
Robert Grossman
 
PDF
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Robert Grossman
 
PDF
Keynote on 2015 Yale Day of Data
Robert Grossman
 
PDF
AnalyticOps - Chicago PAW 2016
Robert Grossman
 
PDF
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
Robert Grossman
 
PDF
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
Robert Grossman
 
Operationalizing R with Azure ML
Chris McHenry
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Robert Grossman
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Robert Grossman
 
Keynote on 2015 Yale Day of Data
Robert Grossman
 
AnalyticOps - Chicago PAW 2016
Robert Grossman
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
Robert Grossman
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
Robert Grossman
 
Ad

Similar to Sawmill - Integrating R and Large Data Clouds (20)

PPT
Hadoop mapreduce and yarn frame work- unit5
RojaT4
 
PPTX
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Yahoo Developer Network
 
PDF
Performance evaluation and estimation model using regression method for hadoo...
redpel dot com
 
PDF
Report Hadoop Map Reduce
Urvashi Kataria
 
PPTX
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
Robert Grossman
 
PDF
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
PDF
Design Issues and Challenges of Peer-to-Peer Video on Demand System
cscpconf
 
PPTX
Big data
rajsandhu1989
 
PDF
an detailed notes on Hadoop Map-Reduce.pdf
YASWANTHP717822I163
 
PPTX
Hadoop bigdata overview
harithakannan
 
PPTX
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Yahoo Developer Network
 
PDF
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Sumeet Singh
 
PPTX
Distributed computing poli
ivascucristian
 
PDF
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
PDF
Dremel
Anhua Xu
 
PDF
Finding URL pattern with MapReduce and Apache Hadoop
Nushrat
 
PDF
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
PDF
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
PPTX
Big data & Hadoop
Ahmed Gamil
 
PPTX
My Other Computer is a Data Center: The Sector Perspective on Big Data
Robert Grossman
 
Hadoop mapreduce and yarn frame work- unit5
RojaT4
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Yahoo Developer Network
 
Performance evaluation and estimation model using regression method for hadoo...
redpel dot com
 
Report Hadoop Map Reduce
Urvashi Kataria
 
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
Robert Grossman
 
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
cscpconf
 
Big data
rajsandhu1989
 
an detailed notes on Hadoop Map-Reduce.pdf
YASWANTHP717822I163
 
Hadoop bigdata overview
harithakannan
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Yahoo Developer Network
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Sumeet Singh
 
Distributed computing poli
ivascucristian
 
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
Dremel
Anhua Xu
 
Finding URL pattern with MapReduce and Apache Hadoop
Nushrat
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
Big data & Hadoop
Ahmed Gamil
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
Robert Grossman
 
Ad

More from Robert Grossman (20)

PDF
Some Frameworks for Improving Analytic Operations at Your Company
Robert Grossman
 
PDF
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Robert Grossman
 
PDF
A Gen3 Perspective of Disparate Data
Robert Grossman
 
PDF
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Robert Grossman
 
PDF
A Data Biosphere for Biomedical Research
Robert Grossman
 
PDF
What is Data Commons and How Can Your Organization Build One?
Robert Grossman
 
PDF
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
Robert Grossman
 
PDF
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
Robert Grossman
 
PDF
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Robert Grossman
 
PDF
What is a Data Commons and Why Should You Care?
Robert Grossman
 
PDF
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Robert Grossman
 
PDF
Big Data, The Community and The Commons (May 12, 2014)
Robert Grossman
 
PDF
What Are Science Clouds?
Robert Grossman
 
PDF
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Robert Grossman
 
PDF
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
Robert Grossman
 
PDF
Using the Open Science Data Cloud for Data Science Research
Robert Grossman
 
PDF
The Open Science Data Cloud: Empowering the Long Tail of Science
Robert Grossman
 
PDF
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Robert Grossman
 
PDF
Big Data - Lab A1 (SC 11 Tutorial)
Robert Grossman
 
PDF
Managing Big Data (Chapter 2, SC 11 Tutorial)
Robert Grossman
 
Some Frameworks for Improving Analytic Operations at Your Company
Robert Grossman
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Robert Grossman
 
A Gen3 Perspective of Disparate Data
Robert Grossman
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Robert Grossman
 
A Data Biosphere for Biomedical Research
Robert Grossman
 
What is Data Commons and How Can Your Organization Build One?
Robert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
Robert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
Robert Grossman
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Robert Grossman
 
What is a Data Commons and Why Should You Care?
Robert Grossman
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Robert Grossman
 
Big Data, The Community and The Commons (May 12, 2014)
Robert Grossman
 
What Are Science Clouds?
Robert Grossman
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Robert Grossman
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
Robert Grossman
 
Using the Open Science Data Cloud for Data Science Research
Robert Grossman
 
The Open Science Data Cloud: Empowering the Long Tail of Science
Robert Grossman
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Robert Grossman
 
Big Data - Lab A1 (SC 11 Tutorial)
Robert Grossman
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Robert Grossman
 

Recently uploaded (20)

PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
July Patch Tuesday
Ivanti
 
Python basic programing language for automation
DanialHabibi2
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
July Patch Tuesday
Ivanti
 

Sawmill - Integrating R and Large Data Clouds

  • 1. SawmillSome Lessons Learned Running R in Large Data CloudsRobert GrossmanOpen Data Group
  • 2. What Do You Do if Your Data is to Big for a Database?Give up and invoke sampling.Buy a proprietary system and ask for a raise.Begin to build a custom system and explain why it is not yet done.Use Hadoop.Use an alternative large data cloud (e.g. Sector)
  • 3. Basic IdeaTurn it into a pleasantly parallel problem.Use a large data cloud to manage and prepare the data.Use a Map/Bucket function to split the job.Run R on each piece using Reduce/UDF or streams.Use PMML multiple models to glue the pieces together.
  • 4. Why Listen?This approach allows you to scale R relatively easily to hundreds of TB to PB.The approach is easy.(A plus: it may look hard to your colleagues, boss or clients.)There is at least an order of magnitude of performance to be gained with the right design.
  • 5. Part 1. Stacks for Big Data5
  • 6. The Google Data StackThe Google File System (2003)MapReduce: Simplified Data Processing… (2004)BigTable: A Distributed Storage System… (2006)6
  • 7. Map-Reduce ExampleInput is file with one document per recordUser specifies map functionkey = document URLValue = terms that document contains“it”, 1“was”, 1“the”, 1“best”, 1(“doc cdickens”,“it was the best of times”)map
  • 8. Example (cont’d)MapReduce library gathers together all pairs with the same key value (shuffle/sort phase)The user-defined reduce function combines all the values associated with the same keykey = “it”values = 1, 1“it”, 2“was”, 2“best”, 1“worst”, 1key = “was”values = 1, 1reducekey = “best”values = 1key = “worst”values = 1
  • 9. Applying MapReduce to the Data in Storage Cloudshuffle/reducemap9
  • 10. Google’s Large Data CloudCompute ServicesData ServicesStorage Services10ApplicationsGoogle’s MapReduceGoogle’s BigTableGoogle File System (GFS)Google’s Stack
  • 11. Hadoop’s Large Data CloudApplicationsCompute Services11Hadoop’sMapReduceData ServicesNoSQL DatabasesHadoop Distributed File System (HDFS)Storage ServicesHadoop’s Stack
  • 12. Amazon Style Data CloudLoad BalancerSimple Queue Service12SDBEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstancesEC2 InstancesS3 Storage Services
  • 13. Sector’s Large Data Cloud13ApplicationsCompute ServicesSphere’s UDFsData ServicesSector’s Distributed File System (SDFS)Storage ServicesRouting & Transport ServicesUDP-based Data Transport Protocol (UDT)Sector’s Stack
  • 14. Apply User Defined Functions (UDF) to Files in Storage Cloudmapshuffle /reduce14UDFUDF
  • 15. Folklore MapReduce is great.But sometimes it is easier to use UDFs or other parallel programming frameworks for large data clouds.And often it is easier to use Hadoop streams, Sector streams, etc.
  • 17. Terasort BenchmarkSector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.
  • 19. MalStoneSector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.
  • 20. Part 2Predictive Model Markup Language
  • 21. Problems Deploying Models Models are deployed in proprietary formats Models are application dependent Models are system dependent Models are architecture dependant Time required to deploy models and to integrate models with other applications can be long.
  • 22. Predictive ModelMarkup Language (PMML) Based on XML Benefits of PMMLOpen standard for Data Mining & Statistical Models Not concerned with the process of creating a modelProvides independence from application, platform, and operating systemSimplifies use of data mining models by other applications (consumers of data mining models)
  • 23. PMML Document ComponentsData dictionaryMining schemaTransformation DictionaryMultiple models, including segments and ensembles.Model verification, …Univariate Statistics (ModelStats)Optional Extensions
  • 24. PMML Models polynomial regression logistic regression general regression center based clusters density based clusterstrees
  • 30. support vector machines
  • 31. rulesetPMML Producer & Consumers25Modeling Environment211Model ProducerDataData Pre-processingPMMLModelDeployment Environment 2PMMLModel331Model ConsumerPost Processingdataactionsscoresrules
  • 33. Step 2: Invoke R on each segment/bucket and build PMML model Step 1: Preprocess data using MapReduce or UDFmodelsStep 3: Gather the models together to form a multiple model PMML file
  • 34. Step 1: Preprocess data using MapReduce or UDFStep 2: Build separate model in each segment using RStep 1: Preprocess data using MapReduce or UDFStep 2: Score data in each segment using R
  • 35. Sawmill SummaryUse HadoopMapReduce or Sector UDFsto preprocess the dataUse HadoopMap or Sector buckets to segment the data to gain parallelismBuild separate statistical model for each segment using R & Hadoop / Sector StreamsUse multiple models specification in PMML version 4.0 to specify segmentationExample: use Hadoop Map function to send all data for each web site to different segment (on different processor)
  • 36. Small Example: Scoring Engine written in RR processed a typical segment in 20 minutes
  • 37. Using R to score 2 segments concatenated together = 60 minutes
  • 38. Using R to score 3 segments concatenated together = 140 minutesWith Sawmill Framework1 month of data, about 50 GB, hundreds of segments
  • 39. 300 mapper keys / segments
  • 40. Mapping and Reducing < 2 minutes
  • 41. Scoring: 20 minutes * max of segments per reducer
  • 42. Had anywhere from 2 to 3 reducers per node and 2 to 8 segments per reducer.
  • 43. Often ran in under 2 hours.Reducer R Process?There are at least three ways to tie theMapReduceprocess to the R process.
  • 44. MACHINE: One instance of the R process on each data node (ornper node)
  • 45. REDUCER: One instance of the R process bound to each reducer
  • 46. SEGMENT: Instances can be launched by the reducers as necessary (when keys are reduced)TradeoffsYou need to have a general idea of
  • 47. how long the records for a key take to be reduced.
  • 48. how long the application takes to process the segment
  • 49. how many keys are seen per reducer
  • 50. In order to prevent bottlenecksThank You!www.opendatagroup.com