SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
dplyr Interfaces to Large-Scale Data
Ian Cook
@ianmcook
ian@cloudera.com
2© Cloudera, Inc. All rights reserved.
Mission for Cloudera: Provide a platform for data analysts, data scientists to
efficiently query, analyze, model large-scale data in clusters, cloud storage
• By distributing Apache Spark, Apache Impala, other tools
• By enabling productive use of these tools
Python and R users often have difficulty moving from smaller data to large-scale
distributed data
• Familiar packages, methods don’t work the same way on distributed data
Context
3© Cloudera, Inc. All rights reserved.
Poll question
4© Cloudera, Inc. All rights reserved.
]
SQLPySpark
SparkR
SQL
SQL or
DataFrame API
SQL or
DataFrame API
SQL or
DataFrame API
SQL or
DataFrame API
SQL or
DataFrame API
SQL or
DataFrame API
5© Cloudera, Inc. All rights reserved.
Poll question
6© Cloudera, Inc. All rights reserved.
]
SQLPySpark
SparkR
SQL
dplyr
7© Cloudera, Inc. All rights reserved.
dplyr provides a set of verbs that perform common data manipulation steps
• select() to select columns
• filter() to filter rows
• arrange() to order rows
• mutate() to create new columns
• summarise() to aggregate
• group_by() to perform operations by group
dplyr works on local data and with remote data sources
• For remote sources, dplyr commands are translated into SQL
dplyr
8© Cloudera, Inc. All rights reserved.
Poll question
9© Cloudera, Inc. All rights reserved.
Demonstration
Example code at
github.com/ianmcook/dplyr-examples
10© Cloudera, Inc. All rights reserved.
dplyr SQL backends
dplyr
↕
dbplyr
↕
dplyr SQL backend package*
↕
DBI
↕
DBI-compatible interface package
↕
database driver or connector
↕
database/engine
* optional
11© Cloudera, Inc. All rights reserved.
• Provides a SQL backend to dplyr for Spark
• Also exposes the MLlib API and a subset of the Spark DataFrames API
• Developed by RStudio
spark.rstudio.com
sparklyr
12© Cloudera, Inc. All rights reserved.
• Provides a SQL backend to dplyr for Impala
• Uses ODBC or JDBC to connect to Impala
• Developed at Cloudera
tiny.cloudera.com/implyr
implyr
implyr
13© Cloudera, Inc. All rights reserved.
Five tips for using dplyr
with SQL data sources
14© Cloudera, Inc. All rights reserved.
Use show_query()
1
15© Cloudera, Inc. All rights reserved.
filter() early
arrange() late
2
16© Cloudera, Inc. All rights reserved.
Check your data types
3
17© Cloudera, Inc. All rights reserved.
Know your SQL engine
4
18© Cloudera, Inc. All rights reserved.
Know when to collect()
5
19© Cloudera, Inc. All rights reserved.
Questions?
Ian Cook
@ianmcook
ian@cloudera.com
20© Cloudera, Inc. All rights reserved.
Cloudera Data Science Workbench
More information
tiny.cloudera.com/cdsw
OnDemand training
tiny.cloudera.com/cdsw-training

More Related Content

What's hot (20)

PPTX
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Cloudera, Inc.
 
PPTX
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
PPTX
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
PPT
A Community Approach to Fighting Cyber Threats
Cloudera, Inc.
 
PPTX
Part 3: Models in Production: A Look From Beginning to End
Cloudera, Inc.
 
PPTX
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
Cloudera, Inc.
 
PPTX
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
 
PPTX
Spark One Platform Webinar
Cloudera, Inc.
 
PPTX
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera, Inc.
 
PPTX
How Data Drives Business at Choice Hotels
Cloudera, Inc.
 
PPTX
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Cloudera, Inc.
 
PDF
Hadoop on Cloud: Why and How?
Cloudera, Inc.
 
PPTX
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
PPTX
Analyzing Hadoop Data Using Sparklyr

Cloudera, Inc.
 
PPTX
Multi-Tenant Operations with Cloudera 5.7 & BT
Cloudera, Inc.
 
PDF
One Hadoop, Multiple Clouds - NYC Big Data Meetup
Andrei Savu
 
PPTX
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Cloudera, Inc.
 
PPTX
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Cloudera, Inc.
 
PPTX
Hadoop Hadoop & Spark meetup - Altiscale
Mark Kerzner
 
PPTX
How Big Data Can Enable Analytics from the Cloud (Technical Workshop)
Cloudera, Inc.
 
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Cloudera, Inc.
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
A Community Approach to Fighting Cyber Threats
Cloudera, Inc.
 
Part 3: Models in Production: A Look From Beginning to End
Cloudera, Inc.
 
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
Cloudera, Inc.
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
 
Spark One Platform Webinar
Cloudera, Inc.
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera, Inc.
 
How Data Drives Business at Choice Hotels
Cloudera, Inc.
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Cloudera, Inc.
 
Hadoop on Cloud: Why and How?
Cloudera, Inc.
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
Analyzing Hadoop Data Using Sparklyr

Cloudera, Inc.
 
Multi-Tenant Operations with Cloudera 5.7 & BT
Cloudera, Inc.
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
Andrei Savu
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Cloudera, Inc.
 
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Cloudera, Inc.
 
Hadoop Hadoop & Spark meetup - Altiscale
Mark Kerzner
 
How Big Data Can Enable Analytics from the Cloud (Technical Workshop)
Cloudera, Inc.
 

Similar to Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data (20)

PDF
dplyr Interfaces to Large-Scale Data
Cloudera, Inc.
 
PPTX
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
PDF
Data Science Languages and Industry Analytics
Wes McKinney
 
PPTX
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stefan Lipp
 
PPTX
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
PDF
Applications on Hadoop
markgrover
 
PPTX
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
PPTX
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Stefan Lipp
 
PPTX
Power of the AWR Warehouse- HotSos Symposium 2015
Kellyn Pot'Vin-Gorman
 
PPTX
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
DataWorks Summit
 
PPTX
Building Efficient Pipelines in Apache Spark
Jeremy Beard
 
PDF
OOW-TBE-12c-CON7307-Sharable
Obaidur (OB) Rashid
 
PPTX
Oracle Database Cloud Service
Jean-Philippe PINTE
 
PDF
Oracle NoSQL Database release 3.0 overview
Paulo Fagundes
 
PDF
PySpark Best Practices
Cloudera, Inc.
 
PPTX
Impala 2.0 - The Best Analytic Database for Hadoop
Cloudera, Inc.
 
PDF
Turning Relational Database Tables into Hadoop Datasources by Kuassi Mensah
Data Con LA
 
PPTX
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
PPTX
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
PDF
Session 203 iouc summit database
OUGTH Oracle User Group in Thailand
 
dplyr Interfaces to Large-Scale Data
Cloudera, Inc.
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
Data Science Languages and Industry Analytics
Wes McKinney
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stefan Lipp
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
Applications on Hadoop
markgrover
 
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Stefan Lipp
 
Power of the AWR Warehouse- HotSos Symposium 2015
Kellyn Pot'Vin-Gorman
 
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
DataWorks Summit
 
Building Efficient Pipelines in Apache Spark
Jeremy Beard
 
OOW-TBE-12c-CON7307-Sharable
Obaidur (OB) Rashid
 
Oracle Database Cloud Service
Jean-Philippe PINTE
 
Oracle NoSQL Database release 3.0 overview
Paulo Fagundes
 
PySpark Best Practices
Cloudera, Inc.
 
Impala 2.0 - The Best Analytic Database for Hadoop
Cloudera, Inc.
 
Turning Relational Database Tables into Hadoop Datasources by Kuassi Mensah
Data Con LA
 
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
Session 203 iouc summit database
OUGTH Oracle User Group in Thailand
 
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Ad

Recently uploaded (20)

PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Biography of Daniel Podor.pdf
Daniel Podor
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 

Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data

  • 1. 1© Cloudera, Inc. All rights reserved. dplyr Interfaces to Large-Scale Data Ian Cook @ianmcook [email protected]
  • 2. 2© Cloudera, Inc. All rights reserved. Mission for Cloudera: Provide a platform for data analysts, data scientists to efficiently query, analyze, model large-scale data in clusters, cloud storage • By distributing Apache Spark, Apache Impala, other tools • By enabling productive use of these tools Python and R users often have difficulty moving from smaller data to large-scale distributed data • Familiar packages, methods don’t work the same way on distributed data Context
  • 3. 3© Cloudera, Inc. All rights reserved. Poll question
  • 4. 4© Cloudera, Inc. All rights reserved. ] SQLPySpark SparkR SQL SQL or DataFrame API SQL or DataFrame API SQL or DataFrame API SQL or DataFrame API SQL or DataFrame API SQL or DataFrame API
  • 5. 5© Cloudera, Inc. All rights reserved. Poll question
  • 6. 6© Cloudera, Inc. All rights reserved. ] SQLPySpark SparkR SQL dplyr
  • 7. 7© Cloudera, Inc. All rights reserved. dplyr provides a set of verbs that perform common data manipulation steps • select() to select columns • filter() to filter rows • arrange() to order rows • mutate() to create new columns • summarise() to aggregate • group_by() to perform operations by group dplyr works on local data and with remote data sources • For remote sources, dplyr commands are translated into SQL dplyr
  • 8. 8© Cloudera, Inc. All rights reserved. Poll question
  • 9. 9© Cloudera, Inc. All rights reserved. Demonstration Example code at github.com/ianmcook/dplyr-examples
  • 10. 10© Cloudera, Inc. All rights reserved. dplyr SQL backends dplyr ↕ dbplyr ↕ dplyr SQL backend package* ↕ DBI ↕ DBI-compatible interface package ↕ database driver or connector ↕ database/engine * optional
  • 11. 11© Cloudera, Inc. All rights reserved. • Provides a SQL backend to dplyr for Spark • Also exposes the MLlib API and a subset of the Spark DataFrames API • Developed by RStudio spark.rstudio.com sparklyr
  • 12. 12© Cloudera, Inc. All rights reserved. • Provides a SQL backend to dplyr for Impala • Uses ODBC or JDBC to connect to Impala • Developed at Cloudera tiny.cloudera.com/implyr implyr implyr
  • 13. 13© Cloudera, Inc. All rights reserved. Five tips for using dplyr with SQL data sources
  • 14. 14© Cloudera, Inc. All rights reserved. Use show_query() 1
  • 15. 15© Cloudera, Inc. All rights reserved. filter() early arrange() late 2
  • 16. 16© Cloudera, Inc. All rights reserved. Check your data types 3
  • 17. 17© Cloudera, Inc. All rights reserved. Know your SQL engine 4
  • 18. 18© Cloudera, Inc. All rights reserved. Know when to collect() 5
  • 19. 19© Cloudera, Inc. All rights reserved. Questions? Ian Cook @ianmcook [email protected]
  • 20. 20© Cloudera, Inc. All rights reserved. Cloudera Data Science Workbench More information tiny.cloudera.com/cdsw OnDemand training tiny.cloudera.com/cdsw-training