SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Analyzing Hadoop Data with
Sparklyr
Accessing and working with data in Cloudera
Enterprise through the popular RStudio IDE.
2© Cloudera, Inc. All rights reserved.
Your Speakers
Nathan Stephens
Director of Solutions
Engineering (RStudio)
Sean Anderson
Data Science and
Engineering (Cloudera)
3© Cloudera, Inc. All rights reserved.
Data
Preparation
Data
Modeling
Model
Deployment
(maybe)
What does a Data Scientist Do?
4© Cloudera, Inc. All rights reserved.
Challenges in the data science process
Data Engineering Data Science (Exploratory) Production (Operational)
Data Wrangling
Visualization
and Analysis
Model Training
& Testing
Production
Model
Preparation Batch Scoring
Online Scoring
Serving
Dev Tools: IDEs/Notebooks, Collaboration Ops Tools: Versioning, Scheduling, Workflow, Publishing
Data GovernanceGovernance
Processing
Acquisition
Model Quality
& Performance
Experiments
1. Data scientists cannot easily access
Hadoop data or compute using their
favorite languages/frameworks.
“Laptop data science” silos persist.
2. For IT, duplicate environments are
costly and inconsistent, with limited
security and governance. Wants to
move to shared data and infrastructure.
5© Cloudera, Inc. All rights reserved.
Common Limitations
Access
Many times secured clusters are
hard for data science professionals
to connect either because they
don’t have the right permissions or
resources are to scarce to afford
them access. In addition popular
frameworks and libraries don’t read
Hadoop data formats out-of-the-
box.
Scale
Notebook environments seldom
have large enough data storage
for medium, let alone big data.
Data scientists are often
relegated to sample data and
constrained when working on
distributed systems. Popular
frameworks and libraries don’t
easily parallelize across the
cluster.
Developer Experience
Popular notebooks don’t work well
with access engines like Spark
and package deployment and
dependency management across
multiple software versions is often
hard to manage. Then once a
model is built there is no easy
path from model development to
production
6© Cloudera, Inc. All rights reserved.
Cloudera’s Data Science Solution
Familiar tools for
distributed data science
IDE
Integration
Interactive search and
immediate exploration
Search
Audit, lineage,
encryption, key
management, & policy
lifecycles
Navigator
Easy deployment and
flexible scaling
Cloud
Deployment
Modern Real-time
Analytics Engine
Spark
Large-scale ETL & batch
processing engine
Hive-on-
Spark
Multi-Storage, Multi-Environment
7© Cloudera, Inc. All rights reserved.
Hadoop as a Data Science Platform
• Leverage Big Data
• Enable real-time use cases
• Provide sufficient toolset for the Data Analysts
• Provide sufficient toolset for the Data Scientists
+ Data Engineers
• Provide standard data governance capabilities
• Provide standard security across the stack
• Provide flexible deployment options
• Integrate with partner tools
• Provide management tools that make it easy
for IT to deploy/maintain
Hadoop
Kafka, Spark Streaming, Kudu
Spark, Hive, Impala, Hue
Partners
Navigator + Partners
Kerberos, Sentry, Record Service, KMS/KTS
Cloudera Director
Rich Ecosystem
Cloudera Manager/Director
8© Cloudera, Inc. All rights reserved.
Our Goal: Bring more data science users to Hadoop
Help more data scientists
use the power of Hadoop
Use a powerful, familiar
environment with direct access to
Hadoop data and compute
Data Scientist
Data Engineer
Make it easy and secure to
add new users, use cases
Offer secure self-service analytics
and a faster path to production on
common, affordable infrastructure
Enterprise Architect
Hadoop Admin
9© Cloudera, Inc. All rights reserved.
R Studio
• Integrated Development Environment
for R (i.e. like Eclipse or Visual Studio but
for R)
• Runs on the Desktop (Windows, OSX,
Linux) or over the Web as a server to
enable shared resources and
collaboration
• Released in 2012 and now the most
popular front-end for R users with
thousands of downloads per day
10© Cloudera, Inc. All rights reserved.
RStudio
Using R with Spark
11© Cloudera, Inc. All rights reserved.
12© Cloudera, Inc. All rights reserved.
13© Cloudera, Inc. All rights reserved.
What’s new with sparklyr 0.5
14© Cloudera, Inc. All rights reserved.
R for data science toolchain
“You’ll learn how to get your data into R [with Spark], get it into the
most useful structure, transform it, visualize it and model it.”
15© Cloudera, Inc. All rights reserved.
Understand
Transform
dplyr +
sparklyr +
SparkSQL
Visualize
ggplot2
Model
sparklyr +
ML
Spark and R
16© Cloudera, Inc. All rights reserved.
Communicate
R Markdown
Notebooks
R Markdown
17© Cloudera, Inc. All rights reserved.
Spark Deployment Modes
Hadoop Yarn
Driver Node
Livy
Standalone
Cluster
Local
Apache Mesos
EC2
SPARK
EXECUTOR
SPARK
EXECUTOR
YARN
DRIVER LIVY
RStudio
Server
RStudio
Desktop
18© Cloudera, Inc. All rights reserved.
spark.rstudio.com
19© Cloudera, Inc. All rights reserved.
Demo
Analyzing 1 billion records with Spark and R
http://colorado.rstudio.com:3939/content/262/
20© Cloudera, Inc. All rights reserved.
Questions?

More Related Content

What's hot (20)

PPTX
Kudu Forrester Webinar
Cloudera, Inc.
 
PPTX
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Cloudera, Inc.
 
PPTX
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Cloudera, Inc.
 
PPTX
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
Cloudera, Inc.
 
PPT
A Community Approach to Fighting Cyber Threats
Cloudera, Inc.
 
PPTX
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
 
PPTX
Simplifying Real-Time Architectures for IoT with Apache Kudu
Cloudera, Inc.
 
PPTX
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Cloudera, Inc.
 
PPTX
The Big Picture: Learned Behaviors in Churn
Cloudera, Inc.
 
PPTX
Consolidate your data marts for fast, flexible analytics 5.24.18
Cloudera, Inc.
 
PPTX
Big data journey to the cloud rohit pujari 5.30.18
Cloudera, Inc.
 
PDF
Hadoop on Cloud: Why and How?
Cloudera, Inc.
 
PDF
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
NOVA DATASCIENCE
 
PPTX
End to End Streaming Architectures
Cloudera, Inc.
 
PPTX
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Cloudera, Inc.
 
PPTX
Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
PPTX
Apache Impala (incubating) 2.5 Performance Update
Cloudera, Inc.
 
PPTX
Big data journey to the cloud maz chaudhri 5.30.18
Cloudera, Inc.
 
PPTX
Solr consistency and recovery internals
Cloudera, Inc.
 
PPTX
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
Cloudera, Inc.
 
Kudu Forrester Webinar
Cloudera, Inc.
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Cloudera, Inc.
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Cloudera, Inc.
 
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
Cloudera, Inc.
 
A Community Approach to Fighting Cyber Threats
Cloudera, Inc.
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Cloudera, Inc.
 
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Cloudera, Inc.
 
The Big Picture: Learned Behaviors in Churn
Cloudera, Inc.
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Cloudera, Inc.
 
Big data journey to the cloud rohit pujari 5.30.18
Cloudera, Inc.
 
Hadoop on Cloud: Why and How?
Cloudera, Inc.
 
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
NOVA DATASCIENCE
 
End to End Streaming Architectures
Cloudera, Inc.
 
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Cloudera, Inc.
 
Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Apache Impala (incubating) 2.5 Performance Update
Cloudera, Inc.
 
Big data journey to the cloud maz chaudhri 5.30.18
Cloudera, Inc.
 
Solr consistency and recovery internals
Cloudera, Inc.
 
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
Cloudera, Inc.
 

Viewers also liked (15)

PPTX
Enabling the Connected Car Revolution

Cloudera, Inc.
 
PPTX
Using Big Data to Transform Your Customer’s Experience - Part 1

Cloudera, Inc.
 
PPTX
Top 5 IoT Use Cases
Cloudera, Inc.
 
PPTX
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Cloudera, Inc.
 
PPTX
The Vortex of Change - Digital Transformation (Presented by Intel)
Cloudera, Inc.
 
PPTX
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Cloudera, Inc.
 
PPTX
How Big Data Can Enable Analytics from the Cloud (Technical Workshop)
Cloudera, Inc.
 
PDF
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
iwrigley
 
PPTX
The role of Big Data and Modern Data Management in Driving a Customer 360 fro...
Cloudera, Inc.
 
PPTX
Introduction to Spark
David Smelker
 
PPTX
Modernizing Your IT Infrastructure with Hadoop - Cloudera Summer Webinar Seri...
Cloudera, Inc.
 
PPTX
Chicago Data Summit: Cloudera's Distribution including Apache Hadoop & Cloude...
Cloudera, Inc.
 
PDF
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
Shuya Tsukamoto
 
PDF
Scaling PyData Up and Out
Travis Oliphant
 
PDF
Case study: Hadoop as ELT for Leading US Retailer - Happiest Minds
Happiest Minds Technologies
 
Enabling the Connected Car Revolution

Cloudera, Inc.
 
Using Big Data to Transform Your Customer’s Experience - Part 1

Cloudera, Inc.
 
Top 5 IoT Use Cases
Cloudera, Inc.
 
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Cloudera, Inc.
 
The Vortex of Change - Digital Transformation (Presented by Intel)
Cloudera, Inc.
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Cloudera, Inc.
 
How Big Data Can Enable Analytics from the Cloud (Technical Workshop)
Cloudera, Inc.
 
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
iwrigley
 
The role of Big Data and Modern Data Management in Driving a Customer 360 fro...
Cloudera, Inc.
 
Introduction to Spark
David Smelker
 
Modernizing Your IT Infrastructure with Hadoop - Cloudera Summer Webinar Seri...
Cloudera, Inc.
 
Chicago Data Summit: Cloudera's Distribution including Apache Hadoop & Cloude...
Cloudera, Inc.
 
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
Shuya Tsukamoto
 
Scaling PyData Up and Out
Travis Oliphant
 
Case study: Hadoop as ELT for Leading US Retailer - Happiest Minds
Happiest Minds Technologies
 
Ad

Similar to Analyzing Hadoop Data Using Sparklyr
 (20)

PPTX
Data Science and CDSW
Jason Hubbard
 
PPTX
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
PPTX
From Insight to Action: Using Data Science to Transform Your Organization
Cloudera, Inc.
 
PPTX
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
PPTX
Unlocking data science in the enterprise - with Oracle and Cloudera
Cloudera, Inc.
 
PPTX
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
PPTX
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stefan Lipp
 
PPTX
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
PPTX
Spark One Platform Webinar
Cloudera, Inc.
 
PDF
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
 
PPTX
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Stefan Lipp
 
PPT
Data Science Day New York: Data Science: A Personal History
Cloudera, Inc.
 
PPTX
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
Cloudera, Inc.
 
PPTX
Turning Data into Business Value with a Modern Data Platform
Cloudera, Inc.
 
PPTX
Data Science in Enterprise
Josh Yeh
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Intel and Cloudera: Accelerating Enterprise Big Data Success
Cloudera, Inc.
 
PPTX
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
PPTX
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
PPTX
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
Data Science and CDSW
Jason Hubbard
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
From Insight to Action: Using Data Science to Transform Your Organization
Cloudera, Inc.
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
Unlocking data science in the enterprise - with Oracle and Cloudera
Cloudera, Inc.
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stefan Lipp
 
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
Spark One Platform Webinar
Cloudera, Inc.
 
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Stefan Lipp
 
Data Science Day New York: Data Science: A Personal History
Cloudera, Inc.
 
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
Cloudera, Inc.
 
Turning Data into Business Value with a Modern Data Platform
Cloudera, Inc.
 
Data Science in Enterprise
Josh Yeh
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Cloudera, Inc.
 
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
PPTX
Cloudera SDX
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Cloudera SDX
Cloudera, Inc.
 

Recently uploaded (20)

PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
PPTX
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 

Analyzing Hadoop Data Using Sparklyr


  • 1. 1© Cloudera, Inc. All rights reserved. Analyzing Hadoop Data with Sparklyr Accessing and working with data in Cloudera Enterprise through the popular RStudio IDE.
  • 2. 2© Cloudera, Inc. All rights reserved. Your Speakers Nathan Stephens Director of Solutions Engineering (RStudio) Sean Anderson Data Science and Engineering (Cloudera)
  • 3. 3© Cloudera, Inc. All rights reserved. Data Preparation Data Modeling Model Deployment (maybe) What does a Data Scientist Do?
  • 4. 4© Cloudera, Inc. All rights reserved. Challenges in the data science process Data Engineering Data Science (Exploratory) Production (Operational) Data Wrangling Visualization and Analysis Model Training & Testing Production Model Preparation Batch Scoring Online Scoring Serving Dev Tools: IDEs/Notebooks, Collaboration Ops Tools: Versioning, Scheduling, Workflow, Publishing Data GovernanceGovernance Processing Acquisition Model Quality & Performance Experiments 1. Data scientists cannot easily access Hadoop data or compute using their favorite languages/frameworks. “Laptop data science” silos persist. 2. For IT, duplicate environments are costly and inconsistent, with limited security and governance. Wants to move to shared data and infrastructure.
  • 5. 5© Cloudera, Inc. All rights reserved. Common Limitations Access Many times secured clusters are hard for data science professionals to connect either because they don’t have the right permissions or resources are to scarce to afford them access. In addition popular frameworks and libraries don’t read Hadoop data formats out-of-the- box. Scale Notebook environments seldom have large enough data storage for medium, let alone big data. Data scientists are often relegated to sample data and constrained when working on distributed systems. Popular frameworks and libraries don’t easily parallelize across the cluster. Developer Experience Popular notebooks don’t work well with access engines like Spark and package deployment and dependency management across multiple software versions is often hard to manage. Then once a model is built there is no easy path from model development to production
  • 6. 6© Cloudera, Inc. All rights reserved. Cloudera’s Data Science Solution Familiar tools for distributed data science IDE Integration Interactive search and immediate exploration Search Audit, lineage, encryption, key management, & policy lifecycles Navigator Easy deployment and flexible scaling Cloud Deployment Modern Real-time Analytics Engine Spark Large-scale ETL & batch processing engine Hive-on- Spark Multi-Storage, Multi-Environment
  • 7. 7© Cloudera, Inc. All rights reserved. Hadoop as a Data Science Platform • Leverage Big Data • Enable real-time use cases • Provide sufficient toolset for the Data Analysts • Provide sufficient toolset for the Data Scientists + Data Engineers • Provide standard data governance capabilities • Provide standard security across the stack • Provide flexible deployment options • Integrate with partner tools • Provide management tools that make it easy for IT to deploy/maintain Hadoop Kafka, Spark Streaming, Kudu Spark, Hive, Impala, Hue Partners Navigator + Partners Kerberos, Sentry, Record Service, KMS/KTS Cloudera Director Rich Ecosystem Cloudera Manager/Director
  • 8. 8© Cloudera, Inc. All rights reserved. Our Goal: Bring more data science users to Hadoop Help more data scientists use the power of Hadoop Use a powerful, familiar environment with direct access to Hadoop data and compute Data Scientist Data Engineer Make it easy and secure to add new users, use cases Offer secure self-service analytics and a faster path to production on common, affordable infrastructure Enterprise Architect Hadoop Admin
  • 9. 9© Cloudera, Inc. All rights reserved. R Studio • Integrated Development Environment for R (i.e. like Eclipse or Visual Studio but for R) • Runs on the Desktop (Windows, OSX, Linux) or over the Web as a server to enable shared resources and collaboration • Released in 2012 and now the most popular front-end for R users with thousands of downloads per day
  • 10. 10© Cloudera, Inc. All rights reserved. RStudio Using R with Spark
  • 11. 11© Cloudera, Inc. All rights reserved.
  • 12. 12© Cloudera, Inc. All rights reserved.
  • 13. 13© Cloudera, Inc. All rights reserved. What’s new with sparklyr 0.5
  • 14. 14© Cloudera, Inc. All rights reserved. R for data science toolchain “You’ll learn how to get your data into R [with Spark], get it into the most useful structure, transform it, visualize it and model it.”
  • 15. 15© Cloudera, Inc. All rights reserved. Understand Transform dplyr + sparklyr + SparkSQL Visualize ggplot2 Model sparklyr + ML Spark and R
  • 16. 16© Cloudera, Inc. All rights reserved. Communicate R Markdown Notebooks R Markdown
  • 17. 17© Cloudera, Inc. All rights reserved. Spark Deployment Modes Hadoop Yarn Driver Node Livy Standalone Cluster Local Apache Mesos EC2 SPARK EXECUTOR SPARK EXECUTOR YARN DRIVER LIVY RStudio Server RStudio Desktop
  • 18. 18© Cloudera, Inc. All rights reserved. spark.rstudio.com
  • 19. 19© Cloudera, Inc. All rights reserved. Demo Analyzing 1 billion records with Spark and R http://colorado.rstudio.com:3939/content/262/
  • 20. 20© Cloudera, Inc. All rights reserved. Questions?