Sawmill - Integrating R and Large Data Clouds

SawmillSome Lessons Learned Running R in Large Data CloudsRobert GrossmanOpen Data Group

What Do You Do if Your Data is to Big for a Database?Give up and invoke sampling.Buy a proprietary system and ask for a raise.Begin to build a custom system and explain why it is not yet done.Use Hadoop.Use an alternative large data cloud (e.g. Sector)

Basic IdeaTurn it into a pleasantly parallel problem.Use a large data cloud to manage and prepare the data.Use a Map/Bucket function to split the job.Run R on each piece using Reduce/UDF or streams.Use PMML multiple models to glue the pieces together.

Why Listen?This approach allows you to scale R relatively easily to hundreds of TB to PB.The approach is easy.(A plus: it may look hard to your colleagues, boss or clients.)There is at least an order of magnitude of performance to be gained with the right design.

The Google Data StackThe Google File System (2003)MapReduce: Simplified Data Processing… (2004)BigTable: A Distributed Storage System… (2006)6

Map-Reduce ExampleInput is file with one document per recordUser specifies map functionkey = document URLValue = terms that document contains“it”, 1“was”, 1“the”, 1“best”, 1(“doc cdickens”,“it was the best of times”)map

Example (cont’d)MapReduce library gathers together all pairs with the same key value (shuffle/sort phase)The user-defined reduce function combines all the values associated with the same keykey = “it”values = 1, 1“it”, 2“was”, 2“best”, 1“worst”, 1key = “was”values = 1, 1reducekey = “best”values = 1key = “worst”values = 1

Applying MapReduce to the Data in Storage Cloudshuffle/reducemap9

Google’s Large Data CloudCompute ServicesData ServicesStorage Services10ApplicationsGoogle’s MapReduceGoogle’s BigTableGoogle File System (GFS)Google’s Stack

Hadoop’s Large Data CloudApplicationsCompute Services11Hadoop’sMapReduceData ServicesNoSQL DatabasesHadoop Distributed File System (HDFS)Storage ServicesHadoop’s Stack

Amazon Style Data CloudLoad BalancerSimple Queue Service12SDBEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstancesEC2 InstancesS3 Storage Services

Sector’s Large Data Cloud13ApplicationsCompute ServicesSphere’s UDFsData ServicesSector’s Distributed File System (SDFS)Storage ServicesRouting & Transport ServicesUDP-based Data Transport Protocol (UDT)Sector’s Stack

Apply User Defined Functions (UDF) to Files in Storage Cloudmapshuffle /reduce14UDFUDF

Folklore MapReduce is great.But sometimes it is easier to use UDFs or other parallel programming frameworks for large data clouds.And often it is easier to use Hadoop streams, Sector streams, etc.

Terasort BenchmarkSector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.

MalStone18entitiessitesdk-2dk-1dktime

MalStoneSector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.

Part 2Predictive Model Markup Language

Problems Deploying Models Models are deployed in proprietary formats Models are application dependent Models are system dependent Models are architecture dependant Time required to deploy models and to integrate models with other applications can be long.

Predictive ModelMarkup Language (PMML) Based on XML Benefits of PMMLOpen standard for Data Mining & Statistical Models Not concerned with the process of creating a modelProvides independence from application, platform, and operating systemSimplifies use of data mining models by other applications (consumers of data mining models)

PMML Document ComponentsData dictionaryMining schemaTransformation DictionaryMultiple models, including segments and ensembles.Model verification, …Univariate Statistics (ModelStats)Optional Extensions

PMML Models polynomial regression logistic regression general regression center based clusters density based clusterstrees

rulesetPMML Producer & Consumers25Modeling Environment211Model ProducerDataData Pre-processingPMMLModelDeployment Environment 2PMMLModel331Model ConsumerPost Processingdataactionsscoresrules

Step 2: Invoke R on each segment/bucket and build PMML model Step 1: Preprocess data using MapReduce or UDFmodelsStep 3: Gather the models together to form a multiple model PMML file

Sawmill - Integrating R and Large Data Clouds

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Sawmill - Integrating R and Large Data Clouds (20)

More from Robert Grossman (20)

Recently uploaded (20)

Sawmill - Integrating R and Large Data Clouds