SlideShare a Scribd company logo
Big Data Step-by-Step
                              Boston Predictive Analytics
                                 Big Data Workshop
                                Microsoft New England Research &
                               Development Center, Cambridge, MA
                                    Saturday, March 10, 2012



                                                            by Jeffrey Breen

                                                        President and Co-Founder
         http://atms.gr/bigdata0310                   Atmosphere Research Group
                                                      email: jeffrey@atmosgrp.com
                                                             Twitter: @JeffreyBreen

Saturday, March 10, 2012
n ee d a
             just         AM
                  mo re R
           little
                   Big Data Infrastructure
                           Part 2: Running R + RStudio on Amazon EC2




    Code & more on github:
    https://github.com/jeffreybreen/tutorial-201203-big-data
Saturday, March 10, 2012
Overview

                    • Sometimes you just need a little more
                           RAM, CPU, or disk space than you have
                    • Let’s try launching an instance on Amazon
                           EC2 and conguring it to do some work
                    • We’ll install R and RStudio and call it a day


Saturday, March 10, 2012
Some details we’ll skip
                    • Signing up (it’s not that hard)
                           http://aws.amazon.com/ec2/

                    • Pricing (it keeps dropping)
                           http://aws.amazon.com/ec2/pricing/

                    • The alphabet soup of services (we care
                           about EC2 computing and S3 storage)



Saturday, March 10, 2012
Just look for biggest button on the page...




Saturday, March 10, 2012
Select an Amazon Machine Image
                ami-7385461a is a good, recent 64-bit CentOS
                image published by RightScale




Saturday, March 10, 2012
Only use EBS images
                • Instance-storage machines lose their data
                      upon shutdown (termination)
                • EBS instances can be stopped and restarted,
                      or terminated when you’re done forever




Saturday, March 10, 2012
Pick a size
                See http://aws.amazon.com/ec2/instance-types/




                                          Already out of date! Amazon introduced
                                          new “m1.medium” instance type this week.




Saturday, March 10, 2012
Avoid Premature Termination
                Set Termination Protection + Shutdown Behavior




Saturday, March 10, 2012
Name your instance




Saturday, March 10, 2012
Create a key pair
                Don’t forget to download it (and keep it safe!)




Saturday, March 10, 2012
Create a Security Group
                All TCP, UDP and ICMP from your IP address




Saturday, March 10, 2012
Don’t know your IP address?
                Don’t ask me. Ask Google!




                (simply append “/32” when entering into firewall rules)



Saturday, March 10, 2012
3... 2... 1...




Saturday, March 10, 2012
State = running
                Up and running at specied domain name




Saturday, March 10, 2012
Time to get all command line
                    • You’ll need an ssh client and the key pair we
                           generated in order to connect with your
                           instance
                    • We’ll use the Cloudera VM to control versions,
                           options, etc.
                    • ssh won’t use your key pair if its file permissions
                           are too lax
                           $ chmod og-rwx rstudio-ec2.pem


                    • Log in as root to your domain name
                           $ ssh -i rstudio-ec2.pem root@YOURDOMAINHERE.amazonaws.com
                                                        (from previous slide)



Saturday, March 10, 2012
Install R and RStudio
                • Create a user login for yourself (RStudio needs this)
                      # useradd jbreen
                      # passwd jbreen


                • EPEL is already installed, so R is easy
                      # yum -y install R


                • Follow RStudio’s download instructions
                      http://www.rstudio.org/download/server
                      # wget http://download2.rstudio.org/rstudio-server-0.95.262-x86_64.rpm

                      # rpm -Uvh rstudio-server-0.95.262-x86_64.rpm


                • Browse to port 8787 and use the login and password
                      e.g., http://ec2-107-22-109-130.compute-1.amazonaws.com:8787/



Saturday, March 10, 2012
Success!




Saturday, March 10, 2012
The meter’s running
                    • Amazon charges by the hour (or fraction
                           thereof). So when you’re done, you should
                           probably shutdown
                    • via command line
                           $ sudo shutdown -h now


                    • or with the “Stop” Instance Action in the
                           AWS Management Console
                    • (use “Terminate” if you never want to use it
                           again)


Saturday, March 10, 2012
Next up:
                           How to launch Hadoop
                            clusters in the cloud
                            without really trying



Saturday, March 10, 2012

More Related Content

PDF
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Jeffrey Breen
 
PDF
Big Data Step-by-Step: Infrastructure 1/3: Local VM
Jeffrey Breen
 
PDF
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Jeffrey Breen
 
PDF
Set up Hadoop Cluster on Amazon EC2
IMC Institute
 
PDF
Cloudera hadoop installation
Sumitra Pundlik
 
PPT
Hadoop on ec2
Mark Kerzner
 
PDF
How to operate containerized OpenStack
Nalee Jang
 
PPTX
DNS for Developers - NDC Oslo 2016
Maarten Balliauw
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Jeffrey Breen
 
Big Data Step-by-Step: Infrastructure 1/3: Local VM
Jeffrey Breen
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Jeffrey Breen
 
Set up Hadoop Cluster on Amazon EC2
IMC Institute
 
Cloudera hadoop installation
Sumitra Pundlik
 
Hadoop on ec2
Mark Kerzner
 
How to operate containerized OpenStack
Nalee Jang
 
DNS for Developers - NDC Oslo 2016
Maarten Balliauw
 

What's hot (20)

PDF
12 core technologies you should learn, love, and hate to be a 'real' technocrat
Jonathan Linowes
 
PPTX
How to implement a gdpr solution in a cloudera architecture
Tiago SimĂľes
 
PDF
Automating CloudStack with Puppet - David Nalley
Puppet
 
PDF
Globus toolkit4installationguide
Adarsh Patil
 
PDF
Getting Started with Ansible
ahamilton55
 
PDF
Drupal camp South Florida 2011 - Introduction to the Aegir hosting platform
Hector Iribarne
 
DOCX
Azure File Share and File Sync guide (Beginners Edition)
Naseem Khoodoruth
 
PPTX
Sherlock Homepage - A detective story about running large web services - WebN...
Maarten Balliauw
 
PDF
Tapping the Data Deluge with R
Jeffrey Breen
 
PDF
[Open infra] how to calculate the cloud system operating rate
Nalee Jang
 
PDF
[JSDC 2016] Codex: Conditional Modules Strike Back
Alex Liu
 
ODP
Infrastructure as code with Puppet and Apache CloudStack
ke4qqq
 
PDF
Using OpenStack With Fog
Mike Hagedorn
 
PPT
Architecting cloud
Tahsin Hasan
 
PDF
An introduction to cgroups and cgroupspy
vpetersson
 
PPTX
Administering and Monitoring SolrCloud Clusters
Sematext Group, Inc.
 
PDF
Apache Cassandra and Go
DataStax Academy
 
KEY
PyCon AU 2010 - Getting Started With Apache/mod_wsgi.
Graham Dumpleton
 
PDF
Cloudera User Group Chicago - Cloudera Manager: APIs & Extensibility
ClouderaUserGroups
 
PDF
Running your Java EE 6 applications in the cloud
Arun Gupta
 
12 core technologies you should learn, love, and hate to be a 'real' technocrat
Jonathan Linowes
 
How to implement a gdpr solution in a cloudera architecture
Tiago SimĂľes
 
Automating CloudStack with Puppet - David Nalley
Puppet
 
Globus toolkit4installationguide
Adarsh Patil
 
Getting Started with Ansible
ahamilton55
 
Drupal camp South Florida 2011 - Introduction to the Aegir hosting platform
Hector Iribarne
 
Azure File Share and File Sync guide (Beginners Edition)
Naseem Khoodoruth
 
Sherlock Homepage - A detective story about running large web services - WebN...
Maarten Balliauw
 
Tapping the Data Deluge with R
Jeffrey Breen
 
[Open infra] how to calculate the cloud system operating rate
Nalee Jang
 
[JSDC 2016] Codex: Conditional Modules Strike Back
Alex Liu
 
Infrastructure as code with Puppet and Apache CloudStack
ke4qqq
 
Using OpenStack With Fog
Mike Hagedorn
 
Architecting cloud
Tahsin Hasan
 
An introduction to cgroups and cgroupspy
vpetersson
 
Administering and Monitoring SolrCloud Clusters
Sematext Group, Inc.
 
Apache Cassandra and Go
DataStax Academy
 
PyCon AU 2010 - Getting Started With Apache/mod_wsgi.
Graham Dumpleton
 
Cloudera User Group Chicago - Cloudera Manager: APIs & Extensibility
ClouderaUserGroups
 
Running your Java EE 6 applications in the cloud
Arun Gupta
 
Ad

Similar to Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2 (20)

PDF
Debian on EC2
Aya Komuro
 
ODP
Hosting Drupal on Amazon EC2
Kornel Lugosi
 
PPT
Day of Cloud: Amazon EC2
cmcavoy
 
PPTX
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud
 
PPTX
Oracle on AWS partner webinar series
Tom Laszewski
 
PDF
Austin Web Architecture
joaquincasares
 
PPT
The Future is Now: Leveraging the Cloud with Ruby
Robert Dempsey
 
PPTX
How to run your Hadoop Cluster in 10 minutes
Vladimir Simek
 
PDF
Sqlsat154 maintain your dbs with help from ola hallengren
Andy Galbraith
 
PPTX
3rd meetup - Intro to Amazon EMR
Faizan Javed
 
PPTX
Data in the Azure Cloud, by Julie Lerman
Julie Lerman
 
PPTX
Cost effective BigData Processing on Amazon EC2
Sujee Maniyam
 
PDF
MongoDB - Who, What & Where!
Mark Hillick
 
PDF
Cloud Computing Bootcamp On The Google App Engine [v1.1]
Matthew McCullough
 
PPTX
Creating Your Own Static Website Generator
Sean O'Mahoney
 
PDF
Getting started with AWS
Jungwon Seo
 
PPTX
Databricks for Dummies
Rodney Joyce
 
PDF
Continuous Deployment @ AWS Re:Invent
John Schneider
 
PDF
Harnessing The Cloud
Dan Quellhorst
 
PPTX
SharePoint Performance
Jeroen Schoenmakers
 
Debian on EC2
Aya Komuro
 
Hosting Drupal on Amazon EC2
Kornel Lugosi
 
Day of Cloud: Amazon EC2
cmcavoy
 
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud
 
Oracle on AWS partner webinar series
Tom Laszewski
 
Austin Web Architecture
joaquincasares
 
The Future is Now: Leveraging the Cloud with Ruby
Robert Dempsey
 
How to run your Hadoop Cluster in 10 minutes
Vladimir Simek
 
Sqlsat154 maintain your dbs with help from ola hallengren
Andy Galbraith
 
3rd meetup - Intro to Amazon EMR
Faizan Javed
 
Data in the Azure Cloud, by Julie Lerman
Julie Lerman
 
Cost effective BigData Processing on Amazon EC2
Sujee Maniyam
 
MongoDB - Who, What & Where!
Mark Hillick
 
Cloud Computing Bootcamp On The Google App Engine [v1.1]
Matthew McCullough
 
Creating Your Own Static Website Generator
Sean O'Mahoney
 
Getting started with AWS
Jungwon Seo
 
Databricks for Dummies
Rodney Joyce
 
Continuous Deployment @ AWS Re:Invent
John Schneider
 
Harnessing The Cloud
Dan Quellhorst
 
SharePoint Performance
Jeroen Schoenmakers
 
Ad

More from Jeffrey Breen (8)

PDF
Getting started with R & Hadoop
Jeffrey Breen
 
PDF
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Jeffrey Breen
 
KEY
R by example: mining Twitter for consumer attitudes towards airlines
Jeffrey Breen
 
PDF
Accessing Databases from R
Jeffrey Breen
 
PDF
Reshaping Data in R
Jeffrey Breen
 
PDF
Grouping & Summarizing Data in R
Jeffrey Breen
 
PDF
R + 15 minutes = Hadoop cluster
Jeffrey Breen
 
PDF
FAA Aviation Forecasts 2011-2031 overview
Jeffrey Breen
 
Getting started with R & Hadoop
Jeffrey Breen
 
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Jeffrey Breen
 
R by example: mining Twitter for consumer attitudes towards airlines
Jeffrey Breen
 
Accessing Databases from R
Jeffrey Breen
 
Reshaping Data in R
Jeffrey Breen
 
Grouping & Summarizing Data in R
Jeffrey Breen
 
R + 15 minutes = Hadoop cluster
Jeffrey Breen
 
FAA Aviation Forecasts 2011-2031 overview
Jeffrey Breen
 

Recently uploaded (20)

PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Software Development Methodologies in 2025
KodekX
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 

Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2

  • 1. Big Data Step-by-Step Boston Predictive Analytics Big Data Workshop Microsoft New England Research & Development Center, Cambridge, MA Saturday, March 10, 2012 by Jeffrey Breen President and Co-Founder http://atms.gr/bigdata0310 Atmosphere Research Group email: [email protected] Twitter: @JeffreyBreen Saturday, March 10, 2012
  • 2. n ee d a just AM mo re R little Big Data Infrastructure Part 2: Running R + RStudio on Amazon EC2 Code & more on github: https://github.com/jeffreybreen/tutorial-201203-big-data Saturday, March 10, 2012
  • 3. Overview • Sometimes you just need a little more RAM, CPU, or disk space than you have • Let’s try launching an instance on Amazon EC2 and conguring it to do some work • We’ll install R and RStudio and call it a day Saturday, March 10, 2012
  • 4. Some details we’ll skip • Signing up (it’s not that hard) http://aws.amazon.com/ec2/ • Pricing (it keeps dropping) http://aws.amazon.com/ec2/pricing/ • The alphabet soup of services (we care about EC2 computing and S3 storage) Saturday, March 10, 2012
  • 5. Just look for biggest button on the page... Saturday, March 10, 2012
  • 6. Select an Amazon Machine Image ami-7385461a is a good, recent 64-bit CentOS image published by RightScale Saturday, March 10, 2012
  • 7. Only use EBS images • Instance-storage machines lose their data upon shutdown (termination) • EBS instances can be stopped and restarted, or terminated when you’re done forever Saturday, March 10, 2012
  • 8. Pick a size See http://aws.amazon.com/ec2/instance-types/ Already out of date! Amazon introduced new “m1.medium” instance type this week. Saturday, March 10, 2012
  • 9. Avoid Premature Termination Set Termination Protection + Shutdown Behavior Saturday, March 10, 2012
  • 11. Create a key pair Don’t forget to download it (and keep it safe!) Saturday, March 10, 2012
  • 12. Create a Security Group All TCP, UDP and ICMP from your IP address Saturday, March 10, 2012
  • 13. Don’t know your IP address? Don’t ask me. Ask Google! (simply append “/32” when entering into rewall rules) Saturday, March 10, 2012
  • 14. 3... 2... 1... Saturday, March 10, 2012
  • 15. State = running Up and running at specied domain name Saturday, March 10, 2012
  • 16. Time to get all command line • You’ll need an ssh client and the key pair we generated in order to connect with your instance • We’ll use the Cloudera VM to control versions, options, etc. • ssh won’t use your key pair if its le permissions are too lax $ chmod og-rwx rstudio-ec2.pem • Log in as root to your domain name $ ssh -i rstudio-ec2.pem [email protected] (from previous slide) Saturday, March 10, 2012
  • 17. Install R and RStudio • Create a user login for yourself (RStudio needs this) # useradd jbreen # passwd jbreen • EPEL is already installed, so R is easy # yum -y install R • Follow RStudio’s download instructions http://www.rstudio.org/download/server # wget http://download2.rstudio.org/rstudio-server-0.95.262-x86_64.rpm # rpm -Uvh rstudio-server-0.95.262-x86_64.rpm • Browse to port 8787 and use the login and password e.g., http://ec2-107-22-109-130.compute-1.amazonaws.com:8787/ Saturday, March 10, 2012
  • 19. The meter’s running • Amazon charges by the hour (or fraction thereof). So when you’re done, you should probably shutdown • via command line $ sudo shutdown -h now • or with the “Stop” Instance Action in the AWS Management Console • (use “Terminate” if you never want to use it again) Saturday, March 10, 2012
  • 20. Next up: How to launch Hadoop clusters in the cloud without really trying Saturday, March 10, 2012