SlideShare a Scribd company logo
R Statistics with MongoDB

R Statistics with Mon‐
goDB
Dr. Markus Schmidberger
October 14th, 2013 Munich, Germany
Email: markus@mongosoup.de
Twitter: @cloudHPC

1 von 36
Dr. Markus Schmidberger

R Statistics with MongoDB

2 von 36
R Statistics with MongoDB

Outline

Introduction to Big Data, MongoSoup and R
R statistics with MongoDB and Examples
Summary & Questions

3 von 36
R Statistics with MongoDB

Big Data
Wikipedia: … a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management
tools or traditional data processing. …
storing
processing

4 von 36
Storing: NoSQL - MongoDB

R Statistics with MongoDB

databases using looser consistency models to store data
German MongoDB as a Service: MongoSoup
cloudControl Add-On
currently running on AWS EU-Region (Ireland)
all features available: shared / dedicated hosting, replica
set, sharding
24/7 support available

5 von 36
R Statistics with MongoDB

MongoSoup in < 5 min

go to cloudControl: www.cloudcontrol.com
add an account and a billing address
create a new app, e.g. “rmongodb”
install cloudControl command line tools: cctrlapp
enable your preferred MongoSoup hosting: cctrlapp
rmongodb/default addon.add mongosoup.medium
go to the cloudControl Web-Console-AddOns and get your
credentials
https://www.cloudcontrol.com/console/app/rmongodb

6 von 36
Processing: Analyzing with R and Hadoop
R Statistics with MongoDB

backward-looking analysis is outdated
today: quasi real-time analysis
tomorrow: forward-looking predictive analysis
more complex methods, more data available, more
processing time required
Check my Strata London Tutorial “Big Data Analyses with R”

7 von 36
R Statistics with MongoDB

Introduction to R

R is a free software environment for statistical computing
and graphics
offers tools to manage and analyze data
standard statistical methods are implemented
compiles and runs under different OS
support via huge community

www.r-project.org

8 von 36
huge online-libraries with > 5000 R-packages:

R Statistics with MongoDB

http://cran.r-project.org
possibility to write personalized code and to contribute new
packages
really famous since January 6, 2009: The New York Times,
“Data Analysts Captivated by R's Power”

9 von 36
R Statistics with MongoDB

RStudio IDE

http://www.rstudio.com

10 von 36
R Statistics with MongoDB

R as calculator

(5+5) - 1 * 3
[1] 7
x <- 3
x
[1] 3
x^2 + 4
[1] 13

11 von 36
R Statistics with MongoDB

y <- c(1,2,3)
y
[1] 1 2 3
x <- 1:10
x
[1]

1

2

3

4

5

6

7

8

9 10

x < 5
[1] TRUE TRUE TRUE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE

12 von 36
R Statistics with MongoDB

x[3:7]

[1] 3 4 5 6 7
mean(x)
[1] 5.5
help("mean")
?mean

13 von 36
R Statistics with MongoDB

14 von 36
Many Statistical Functions

R Statistics with MongoDB

kmeans(dat, 4)
K-means clustering with 4 clusters of sizes
21, 18, 30, 31
Cluster means:
[,1]
[,2]
1 0.7755 0.8509
2 -0.1557 -0.2305
3 1.2299 1.1472
4 0.1510 0.1507
Clustering vector:
[1] 4 2 4 4 2 4 4
2 2 4 4 4 2 4 2 4 4
[36] 4 4 4 4 4 4 4
3 1 3 3 3 1 1 3 3 3
[71] 1 3 1 1 3 3 3
1 3 1 3 3 3 3 1 3 3

4
2
4
3
3
3

2
4
2
1
1

4
2
4
3
1

4
2
2
1
3

4
4
2
3
3

2 2 4 4 1 4 2
4
4 2 2 1 1 1 1
3
1 1 1 3 3 3 3

Within cluster sum of squares by cluster:
[1] 3.318 1.166 4.019 3.195
(between_SS / total_SS = 83.0 %)
Available components:
[1] "cluster"
"centers"
"totss"
"withinss"
[5] "tot.withinss" "betweenss"
"size"

15 von 36
R Statistics with MongoDB

plot(dat, col = cl$cluster, cex=2, pch=16)
points(cl$centers, col = 1:4, pch = 13, cex
= 4)

16 von 36
R Shiny - easy web application

R Statistics with MongoDB

developed by RStudio
turns R analyses into interactive web applications that
anyone can use
let your users choose input parameters using friendly
controls like sliders, drop-downs, and text fields
easily incorporate any number of outputs like plots, tables,
and summaries
no HTML or JavaScript knowledge is necessary, only R
http://www.rstudio.com/shiny/

17 von 36
R Statistics with MongoDB

R and Databases
SQL provides a standard language to filter, aggregate, group,
sort data
SQL in new places: Hive, Impala, …
ODBC provides SQL interface to non-database data (Excel,
CSV, text files)
R stores relational data in data.frames (extended lists)

18 von 36
R Statistics with MongoDB

data(iris)
head(iris, n=3)
Sepal.Length Sepal.Width Petal.Length
Petal.Width Species
1
5.1
3.5
1.4
0.2 setosa
2
4.9
3.0
1.4
0.2 setosa
3
4.7
3.2
1.3
0.2 setosa
class(iris)
[1] "data.frame"

19 von 36
R Statistics with MongoDB

R package: sqldf

running SQL statements on R data frames
library(sqldf)
sqldf("select * from iris limit 2")
Sepal_Length Sepal_Width Petal_Length
Petal_Width Species
1
5.1
3.5
1.4
0.2 setosa
2
4.9
3.0
1.4
0.2 setosa
sqldf("select count(*) from iris")
count(*)
1
150

20 von 36
Other relational R package

R Statistics with MongoDB

RMySQL package provides an interface to MySQL
RPostgreSQL package provides an interface to PostgreSQL
ROracle package provides an interface for Oracle
RJDBC package provides access to databases through a
JDBC interface
RSQLite package provides access to SQLite
(SQLite engine is included)
One big problem:
all packages read the full result in R memory

21 von 36
R Statistics with MongoDB

R and MongoDB

on CRAN there are two packages to connect R with MongoDB
rmongodb supported by MongoDB, Inc.
powerful for big data
difficult to use due to BSON objects
RMongo
easy to use
limited functionality
reads full results in R memory
does not work on MAC OS X

22 von 36
R Statistics with MongoDB

R package: RMongo

library(Rmongo)
mongo <- mongoDbConnect("cc_JwQcDLJSYQJb",
"dbs001.mongosoup.de", 27017)
dbAuthenticate(mongo,
username="JwQcDLJSYQJb",
password="RSXPkUkXXXXX")
dbShowCollections(mongo)
dbGetQuery(mongo, "zips","{'state':'AL'}")
dbInsertDocument(mongo, "test_data",
'{"foo": "bar", "size": 5 }')
dbDisconnect(mongo)

23 von 36
R Statistics with MongoDB

R package: rmongodb

developed on top of the MongoDB supported C driver
library(rmongodb)
mongo <mongo.create(host="dbs001.mongosoup.de",
db="cc_JwQcDLJSYQJb",
username="JwQcDLJSYQJb",
password="RSXPkUkXXXXX")
mongo
[1] 0
attr(,"mongo")
<pointer: 0x105a1de80>
attr(,"class")
[1] "mongo"
attr(,"host")
[1] "dbs001.mongosoup.de"
attr(,"name")
[1] ""
attr(,"username")
[1] "JwQcDLJSYQJb"
attr(,"password")
[1] "RSXPkUkxRdOX"
attr(,"db")
[1] "cc_JwQcDLJSYQJb"
attr(,"timeout")
[1] 0

24 von 36
R Statistics with MongoDB

mongo.get.database.collections(mongo,
"cc_JwQcDLJSYQJb")
[1] "cc_JwQcDLJSYQJb.zips"
"cc_JwQcDLJSYQJb.ccp" "cc_JwQcDLJSYQJb.test"
mongo <- mongo.disconnect(mongo)

25 von 36
R Statistics with MongoDB

buf <- mongo.bson.buffer.create()
mongo.bson.buffer.append(buf, "state", "AL")
[1] TRUE
query <- mongo.bson.from.buffer(buf)
query
state : 2

26 von 36

AL
R Statistics with MongoDB

res <- mongo.find.one(mongo,
"cc_JwQcDLJSYQJb.zips", query)
res
city : 2
loc : 4
0 : 1
1 : 1
pop : 16
state : 2
_id : 2

27 von 36

ACMAR

6055
AL
35004

-86.515570
33.584132
R Statistics with MongoDB

out <- mongo.bson.to.list(res)
out$loc
[1] -86.52

33.58

typeof(out$loc)
[1] "double"
out$pop
[1] 6055
out$state
[1] "AL"

28 von 36
R Statistics with MongoDB

cursor <- mongo.find(mongo,
"cc_JwQcDLJSYQJb.zips", query)
res <- NULL
while (mongo.cursor.next(cursor)){
value <- mongo.cursor.value(cursor)
Rvalue <- mongo.bson.to.list(value)
res <- rbind(res, Rvalue)
}
err <- mongo.cursor.destroy(cursor)
head(res, n=4)
city
_id
Rvalue "ACMAR"
"35004"
Rvalue "ADAMSVILLE"
"35005"
Rvalue "ADGER"
"35006"
Rvalue "KEYSTONE"
"35007"

29 von 36

loc

pop

Numeric,2 6055

state
"AL"

Numeric,2 10616 "AL"
Numeric,2 3205

"AL"

Numeric,2 14218 "AL"
It is all about creating BSON query or field objects

R Statistics with MongoDB

b <- mongo.bson.from.list(
list(name="Fred", age=29, city="Boston"))
b
name : 2
age : 1
city : 2

Fred
29.000000
Boston

mongo.bson.to.list(b)
$name
[1] "Fred"
$age
[1] 29
$city
[1] "Boston"

30 von 36
R Statistics with MongoDB

?mongo.bson
?mongo.bson.buffer.append
?mongo.bson.buffer.start.array
?mongo.bson.buffer.start.object
buf <- mongo.bson.buffer.create()
mongo.bson.buffer.append(buf, "aggregate",
"zips")
mongo.bson.buffer.start.array(buf,
"pipeline")
mongo.bson.buffer.start.object(buf,
"$group")
mongo.bson.buffer.append(buf, "_id",
"$state")
mongo.bson.buffer.start.object(buf,
"totalPop")
mongo.bson.buffer.append(buf, "$sum",
"$pop")
mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.start.object(buf, "$match")
mongo.bson.buffer.start.object(buf,
"totalPop")
mongo.bson.buffer.append(buf, "$gte",
"10000")
mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.finish.object(buf)
query <- mongo.bson.from.buffer(buf)

31 von 36
CCP Web Analytics Challenge

R Statistics with MongoDB

buf <- mongo.bson.buffer.create()
query <- mongo.bson.from.buffer(buf)
buf <- mongo.bson.buffer.create()
err <- mongo.bson.buffer.append(buf, "user",
1)
err <- mongo.bson.buffer.append(buf, "type",
1)
field <- mongo.bson.from.buffer(buf)
out <- mongo.find(mongo,
"cc_JwQcDLJSYQJb.ccp", query, fields=field,
limit=1000)
res <- NULL
while (mongo.cursor.next(out)){
value <- mongo.cursor.value(out)
Rvalue <- mongo.bson.to.list(value)
res <- rbind(res, Rvalue)
}

32 von 36
R Statistics with MongoDB

boxplot( as.integer(table(unlist(res[,2]))
), cex=4, horizontal=TRUE, main="Number of
actions per user")

33 von 36
R Statistics with MongoDB

Shiny Mongo
R based MongoDB User Interface
R packages shiny and rmongodb
less than 200 lines of code
DEMO: http://localhost:8100

https://github.com/comsysto/ShinyMongo

34 von 36
R Statistics with MongoDB

Summary
R is a powerful statistical tool to analyse many different kind
of data
R can access databases
MongoDB and rmongodb ready for Big Data
start playing around with R, Big Data and MongoDB
http://www.r-project.org
http://www.mongodb.org
http://www.mongosoup.de 

35 von 36
R Statistics with MongoDB

See you soon

thanks a lot for your attention
there are R trainings in December 2013 in Munich
http://comsysto.com/events.html#r
we are hosting many events and meetups
meet you at the MongoSoup booth

Email: markus@mongosoup.de
Twitter: @cloudHPC

36 von 36

More Related Content

What's hot (20)

PDF
SparkSQL and Dataframe
Namgee Lee
 
ODP
Data Analysis in Python
Richard Herrell
 
PDF
Getting started with pandas
maikroeder
 
PPTX
Data engineering and analytics using python
Purna Chander
 
PDF
R Introduction
Sangeetha S
 
PDF
Pandas
maikroeder
 
PDF
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
Zurich_R_User_Group
 
PPTX
Python and Data Analysis
Praveen Nair
 
PDF
Pivoting Data with SparkSQL by Andrew Ray
Spark Summit
 
PPTX
Introduction to pandas
Piyush rai
 
PDF
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
PDF
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Alexander Hendorf
 
PPTX
ComputeFest 2012: Intro To R for Physical Sciences
alexstorer
 
PDF
Data profiling with Apache Calcite
Julian Hyde
 
PDF
AfterGlow
Raffael Marty
 
PPTX
Predicting the relevance of search results for e-commerce systems
Universiti Technologi Malaysia (UTM)
 
PDF
Python for R Users
Ajay Ohri
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PDF
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
 
PPTX
Spark meetup v2.0.5
Yan Zhou
 
SparkSQL and Dataframe
Namgee Lee
 
Data Analysis in Python
Richard Herrell
 
Getting started with pandas
maikroeder
 
Data engineering and analytics using python
Purna Chander
 
R Introduction
Sangeetha S
 
Pandas
maikroeder
 
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
Zurich_R_User_Group
 
Python and Data Analysis
Praveen Nair
 
Pivoting Data with SparkSQL by Andrew Ray
Spark Summit
 
Introduction to pandas
Piyush rai
 
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Alexander Hendorf
 
ComputeFest 2012: Intro To R for Physical Sciences
alexstorer
 
Data profiling with Apache Calcite
Julian Hyde
 
AfterGlow
Raffael Marty
 
Predicting the relevance of search results for e-commerce systems
Universiti Technologi Malaysia (UTM)
 
Python for R Users
Ajay Ohri
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
 
Spark meetup v2.0.5
Yan Zhou
 

Similar to R Statistics With MongoDB (20)

PDF
Getting Started with MongoDB
Michael Redlich
 
PPTX
Data Science Stack with MongoDB and RStudio
Winston Chen
 
PDF
Getting Started with MongoDB (TCF ITPC 2014)
Michael Redlich
 
PDF
Los Angeles R users group - Dec 14 2010 - Part 2
rusersla
 
PDF
Final Project - Ricardo B Lourenço
Ricardo Barros Lourenço
 
PPTX
Introduction To R
Michael Driscoll
 
PPT
Introduction to mongodb
neela madheswari
 
PPTX
Munching the mongo
VulcanMinds
 
PPTX
introtomongodb
saikiran
 
PDF
Open source analytics
Ajay Ohri
 
PDF
Precog & MongoDB User Group: Skyrocket Your Analytics
MongoDB
 
PDF
Big dataclasses 2019_nosql
Alexandre BERGERE
 
PDF
Which Questions We Should Have
Oracle Korea
 
PDF
Advanced Analytics & Statistics with MongoDB
John De Goes
 
PDF
Mongo db notes for professionals
Zafer Galip Ozberk
 
PDF
Mongodb.pdf
ARUN AV
 
PDF
SQLBits Module 2 RStats Introduction to R and Statistics
Jen Stirrup
 
PPTX
Mondodb
Paulo Fagundes
 
PPTX
Mango Database - Web Development
mssaman
 
PDF
MongoDB classes 2019
Alexandre BERGERE
 
Getting Started with MongoDB
Michael Redlich
 
Data Science Stack with MongoDB and RStudio
Winston Chen
 
Getting Started with MongoDB (TCF ITPC 2014)
Michael Redlich
 
Los Angeles R users group - Dec 14 2010 - Part 2
rusersla
 
Final Project - Ricardo B Lourenço
Ricardo Barros Lourenço
 
Introduction To R
Michael Driscoll
 
Introduction to mongodb
neela madheswari
 
Munching the mongo
VulcanMinds
 
introtomongodb
saikiran
 
Open source analytics
Ajay Ohri
 
Precog & MongoDB User Group: Skyrocket Your Analytics
MongoDB
 
Big dataclasses 2019_nosql
Alexandre BERGERE
 
Which Questions We Should Have
Oracle Korea
 
Advanced Analytics & Statistics with MongoDB
John De Goes
 
Mongo db notes for professionals
Zafer Galip Ozberk
 
Mongodb.pdf
ARUN AV
 
SQLBits Module 2 RStats Introduction to R and Statistics
Jen Stirrup
 
Mango Database - Web Development
mssaman
 
MongoDB classes 2019
Alexandre BERGERE
 
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
Ad

Recently uploaded (20)

PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Digital Circuits, important subject in CS
contactparinay1
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 

R Statistics With MongoDB

  • 1. R Statistics with MongoDB R Statistics with Mon‐ goDB Dr. Markus Schmidberger October 14th, 2013 Munich, Germany Email: [email protected] Twitter: @cloudHPC 1 von 36
  • 2. Dr. Markus Schmidberger R Statistics with MongoDB 2 von 36
  • 3. R Statistics with MongoDB Outline Introduction to Big Data, MongoSoup and R R statistics with MongoDB and Examples Summary & Questions 3 von 36
  • 4. R Statistics with MongoDB Big Data Wikipedia: … a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing. … storing processing 4 von 36
  • 5. Storing: NoSQL - MongoDB R Statistics with MongoDB databases using looser consistency models to store data German MongoDB as a Service: MongoSoup cloudControl Add-On currently running on AWS EU-Region (Ireland) all features available: shared / dedicated hosting, replica set, sharding 24/7 support available 5 von 36
  • 6. R Statistics with MongoDB MongoSoup in < 5 min go to cloudControl: www.cloudcontrol.com add an account and a billing address create a new app, e.g. “rmongodb” install cloudControl command line tools: cctrlapp enable your preferred MongoSoup hosting: cctrlapp rmongodb/default addon.add mongosoup.medium go to the cloudControl Web-Console-AddOns and get your credentials https://www.cloudcontrol.com/console/app/rmongodb 6 von 36
  • 7. Processing: Analyzing with R and Hadoop R Statistics with MongoDB backward-looking analysis is outdated today: quasi real-time analysis tomorrow: forward-looking predictive analysis more complex methods, more data available, more processing time required Check my Strata London Tutorial “Big Data Analyses with R” 7 von 36
  • 8. R Statistics with MongoDB Introduction to R R is a free software environment for statistical computing and graphics offers tools to manage and analyze data standard statistical methods are implemented compiles and runs under different OS support via huge community www.r-project.org 8 von 36
  • 9. huge online-libraries with > 5000 R-packages: R Statistics with MongoDB http://cran.r-project.org possibility to write personalized code and to contribute new packages really famous since January 6, 2009: The New York Times, “Data Analysts Captivated by R's Power” 9 von 36
  • 10. R Statistics with MongoDB RStudio IDE http://www.rstudio.com 10 von 36
  • 11. R Statistics with MongoDB R as calculator (5+5) - 1 * 3 [1] 7 x <- 3 x [1] 3 x^2 + 4 [1] 13 11 von 36
  • 12. R Statistics with MongoDB y <- c(1,2,3) y [1] 1 2 3 x <- 1:10 x [1] 1 2 3 4 5 6 7 8 9 10 x < 5 [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE 12 von 36
  • 13. R Statistics with MongoDB x[3:7] [1] 3 4 5 6 7 mean(x) [1] 5.5 help("mean") ?mean 13 von 36
  • 14. R Statistics with MongoDB 14 von 36
  • 15. Many Statistical Functions R Statistics with MongoDB kmeans(dat, 4) K-means clustering with 4 clusters of sizes 21, 18, 30, 31 Cluster means: [,1] [,2] 1 0.7755 0.8509 2 -0.1557 -0.2305 3 1.2299 1.1472 4 0.1510 0.1507 Clustering vector: [1] 4 2 4 4 2 4 4 2 2 4 4 4 2 4 2 4 4 [36] 4 4 4 4 4 4 4 3 1 3 3 3 1 1 3 3 3 [71] 1 3 1 1 3 3 3 1 3 1 3 3 3 3 1 3 3 4 2 4 3 3 3 2 4 2 1 1 4 2 4 3 1 4 2 2 1 3 4 4 2 3 3 2 2 4 4 1 4 2 4 4 2 2 1 1 1 1 3 1 1 1 3 3 3 3 Within cluster sum of squares by cluster: [1] 3.318 1.166 4.019 3.195 (between_SS / total_SS = 83.0 %) Available components: [1] "cluster" "centers" "totss" "withinss" [5] "tot.withinss" "betweenss" "size" 15 von 36
  • 16. R Statistics with MongoDB plot(dat, col = cl$cluster, cex=2, pch=16) points(cl$centers, col = 1:4, pch = 13, cex = 4) 16 von 36
  • 17. R Shiny - easy web application R Statistics with MongoDB developed by RStudio turns R analyses into interactive web applications that anyone can use let your users choose input parameters using friendly controls like sliders, drop-downs, and text fields easily incorporate any number of outputs like plots, tables, and summaries no HTML or JavaScript knowledge is necessary, only R http://www.rstudio.com/shiny/ 17 von 36
  • 18. R Statistics with MongoDB R and Databases SQL provides a standard language to filter, aggregate, group, sort data SQL in new places: Hive, Impala, … ODBC provides SQL interface to non-database data (Excel, CSV, text files) R stores relational data in data.frames (extended lists) 18 von 36
  • 19. R Statistics with MongoDB data(iris) head(iris, n=3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa class(iris) [1] "data.frame" 19 von 36
  • 20. R Statistics with MongoDB R package: sqldf running SQL statements on R data frames library(sqldf) sqldf("select * from iris limit 2") Sepal_Length Sepal_Width Petal_Length Petal_Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa sqldf("select count(*) from iris") count(*) 1 150 20 von 36
  • 21. Other relational R package R Statistics with MongoDB RMySQL package provides an interface to MySQL RPostgreSQL package provides an interface to PostgreSQL ROracle package provides an interface for Oracle RJDBC package provides access to databases through a JDBC interface RSQLite package provides access to SQLite (SQLite engine is included) One big problem: all packages read the full result in R memory 21 von 36
  • 22. R Statistics with MongoDB R and MongoDB on CRAN there are two packages to connect R with MongoDB rmongodb supported by MongoDB, Inc. powerful for big data difficult to use due to BSON objects RMongo easy to use limited functionality reads full results in R memory does not work on MAC OS X 22 von 36
  • 23. R Statistics with MongoDB R package: RMongo library(Rmongo) mongo <- mongoDbConnect("cc_JwQcDLJSYQJb", "dbs001.mongosoup.de", 27017) dbAuthenticate(mongo, username="JwQcDLJSYQJb", password="RSXPkUkXXXXX") dbShowCollections(mongo) dbGetQuery(mongo, "zips","{'state':'AL'}") dbInsertDocument(mongo, "test_data", '{"foo": "bar", "size": 5 }') dbDisconnect(mongo) 23 von 36
  • 24. R Statistics with MongoDB R package: rmongodb developed on top of the MongoDB supported C driver library(rmongodb) mongo <mongo.create(host="dbs001.mongosoup.de", db="cc_JwQcDLJSYQJb", username="JwQcDLJSYQJb", password="RSXPkUkXXXXX") mongo [1] 0 attr(,"mongo") <pointer: 0x105a1de80> attr(,"class") [1] "mongo" attr(,"host") [1] "dbs001.mongosoup.de" attr(,"name") [1] "" attr(,"username") [1] "JwQcDLJSYQJb" attr(,"password") [1] "RSXPkUkxRdOX" attr(,"db") [1] "cc_JwQcDLJSYQJb" attr(,"timeout") [1] 0 24 von 36
  • 25. R Statistics with MongoDB mongo.get.database.collections(mongo, "cc_JwQcDLJSYQJb") [1] "cc_JwQcDLJSYQJb.zips" "cc_JwQcDLJSYQJb.ccp" "cc_JwQcDLJSYQJb.test" mongo <- mongo.disconnect(mongo) 25 von 36
  • 26. R Statistics with MongoDB buf <- mongo.bson.buffer.create() mongo.bson.buffer.append(buf, "state", "AL") [1] TRUE query <- mongo.bson.from.buffer(buf) query state : 2 26 von 36 AL
  • 27. R Statistics with MongoDB res <- mongo.find.one(mongo, "cc_JwQcDLJSYQJb.zips", query) res city : 2 loc : 4 0 : 1 1 : 1 pop : 16 state : 2 _id : 2 27 von 36 ACMAR 6055 AL 35004 -86.515570 33.584132
  • 28. R Statistics with MongoDB out <- mongo.bson.to.list(res) out$loc [1] -86.52 33.58 typeof(out$loc) [1] "double" out$pop [1] 6055 out$state [1] "AL" 28 von 36
  • 29. R Statistics with MongoDB cursor <- mongo.find(mongo, "cc_JwQcDLJSYQJb.zips", query) res <- NULL while (mongo.cursor.next(cursor)){ value <- mongo.cursor.value(cursor) Rvalue <- mongo.bson.to.list(value) res <- rbind(res, Rvalue) } err <- mongo.cursor.destroy(cursor) head(res, n=4) city _id Rvalue "ACMAR" "35004" Rvalue "ADAMSVILLE" "35005" Rvalue "ADGER" "35006" Rvalue "KEYSTONE" "35007" 29 von 36 loc pop Numeric,2 6055 state "AL" Numeric,2 10616 "AL" Numeric,2 3205 "AL" Numeric,2 14218 "AL"
  • 30. It is all about creating BSON query or field objects R Statistics with MongoDB b <- mongo.bson.from.list( list(name="Fred", age=29, city="Boston")) b name : 2 age : 1 city : 2 Fred 29.000000 Boston mongo.bson.to.list(b) $name [1] "Fred" $age [1] 29 $city [1] "Boston" 30 von 36
  • 31. R Statistics with MongoDB ?mongo.bson ?mongo.bson.buffer.append ?mongo.bson.buffer.start.array ?mongo.bson.buffer.start.object buf <- mongo.bson.buffer.create() mongo.bson.buffer.append(buf, "aggregate", "zips") mongo.bson.buffer.start.array(buf, "pipeline") mongo.bson.buffer.start.object(buf, "$group") mongo.bson.buffer.append(buf, "_id", "$state") mongo.bson.buffer.start.object(buf, "totalPop") mongo.bson.buffer.append(buf, "$sum", "$pop") mongo.bson.buffer.finish.object(buf) mongo.bson.buffer.finish.object(buf) mongo.bson.buffer.start.object(buf, "$match") mongo.bson.buffer.start.object(buf, "totalPop") mongo.bson.buffer.append(buf, "$gte", "10000") mongo.bson.buffer.finish.object(buf) mongo.bson.buffer.finish.object(buf) mongo.bson.buffer.finish.object(buf) query <- mongo.bson.from.buffer(buf) 31 von 36
  • 32. CCP Web Analytics Challenge R Statistics with MongoDB buf <- mongo.bson.buffer.create() query <- mongo.bson.from.buffer(buf) buf <- mongo.bson.buffer.create() err <- mongo.bson.buffer.append(buf, "user", 1) err <- mongo.bson.buffer.append(buf, "type", 1) field <- mongo.bson.from.buffer(buf) out <- mongo.find(mongo, "cc_JwQcDLJSYQJb.ccp", query, fields=field, limit=1000) res <- NULL while (mongo.cursor.next(out)){ value <- mongo.cursor.value(out) Rvalue <- mongo.bson.to.list(value) res <- rbind(res, Rvalue) } 32 von 36
  • 33. R Statistics with MongoDB boxplot( as.integer(table(unlist(res[,2])) ), cex=4, horizontal=TRUE, main="Number of actions per user") 33 von 36
  • 34. R Statistics with MongoDB Shiny Mongo R based MongoDB User Interface R packages shiny and rmongodb less than 200 lines of code DEMO: http://localhost:8100 https://github.com/comsysto/ShinyMongo 34 von 36
  • 35. R Statistics with MongoDB Summary R is a powerful statistical tool to analyse many different kind of data R can access databases MongoDB and rmongodb ready for Big Data start playing around with R, Big Data and MongoDB http://www.r-project.org http://www.mongodb.org http://www.mongosoup.de  35 von 36
  • 36. R Statistics with MongoDB See you soon thanks a lot for your attention there are R trainings in December 2013 in Munich http://comsysto.com/events.html#r we are hosting many events and meetups meet you at the MongoSoup booth Email: [email protected] Twitter: @cloudHPC 36 von 36