Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data

1 like997 views

Cloudera aims to empower data analysts and scientists to efficiently work with large-scale distributed data using tools like Apache Spark and Impala. The dplyr package facilitates common data manipulation tasks and translates commands for remote data sources into SQL, making it usable for both local and distributed environments. Key tips for effective use with SQL data sources include using show_query(), filtering early, checking data types, understanding your SQL engine, and knowing when to collect data.

Technology

1© Cloudera, Inc. All rights reserved.
dplyr Interfaces to Large-Scale Data
Ian Cook
@ianmcook
ian@cloudera.com

2© Cloudera, Inc. All rights reserved.
Mission for Cloudera: Provide a platform for data analysts, data scientists to
efficiently query, analyze, model large-scale data in clusters, cloud storage
• By distributing Apache Spark, Apache Impala, other tools
• By enabling productive use of these tools
Python and R users often have difficulty moving from smaller data to large-scale
distributed data
• Familiar packages, methods don’t work the same way on distributed data
Context

3© Cloudera, Inc. All rights reserved.
Poll question

4© Cloudera, Inc. All rights reserved.
]
SQLPySpark
SparkR
SQL
SQL or
DataFrame API
SQL or
DataFrame API
SQL or
DataFrame API
SQL or
DataFrame API
SQL or
DataFrame API
SQL or
DataFrame API

5© Cloudera, Inc. All rights reserved.
Poll question

6© Cloudera, Inc. All rights reserved.
]
SQLPySpark
SparkR
SQL
dplyr

7© Cloudera, Inc. All rights reserved.
dplyr provides a set of verbs that perform common data manipulation steps
• select() to select columns
• filter() to filter rows
• arrange() to order rows
• mutate() to create new columns
• summarise() to aggregate
• group_by() to perform operations by group
dplyr works on local data and with remote data sources
• For remote sources, dplyr commands are translated into SQL
dplyr

8© Cloudera, Inc. All rights reserved.
Poll question

9© Cloudera, Inc. All rights reserved.
Demonstration
Example code at
github.com/ianmcook/dplyr-examples

10© Cloudera, Inc. All rights reserved.
dplyr SQL backends
dplyr
↕
dbplyr
↕
dplyr SQL backend package*
↕
DBI
↕
DBI-compatible interface package
↕
database driver or connector
↕
database/engine
* optional

11© Cloudera, Inc. All rights reserved.
• Provides a SQL backend to dplyr for Spark
• Also exposes the MLlib API and a subset of the Spark DataFrames API
• Developed by RStudio
spark.rstudio.com
sparklyr

12© Cloudera, Inc. All rights reserved.
• Provides a SQL backend to dplyr for Impala
• Uses ODBC or JDBC to connect to Impala
• Developed at Cloudera
tiny.cloudera.com/implyr
implyr
implyr

13© Cloudera, Inc. All rights reserved.
Five tips for using dplyr
with SQL data sources

14© Cloudera, Inc. All rights reserved.
Use show_query()
1

15© Cloudera, Inc. All rights reserved.
filter() early
arrange() late
2

16© Cloudera, Inc. All rights reserved.
Check your data types
3

17© Cloudera, Inc. All rights reserved.
Know your SQL engine
4

18© Cloudera, Inc. All rights reserved.
Know when to collect()
5

19© Cloudera, Inc. All rights reserved.
Questions?
Ian Cook
@ianmcook
ian@cloudera.com

20© Cloudera, Inc. All rights reserved.
Cloudera Data Science Workbench
More information
tiny.cloudera.com/cdsw
OnDemand training
tiny.cloudera.com/cdsw-training

More Related Content

What's hot (20)

PPTX

Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudCloudera, Inc.

PPTX

Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Cloudera, Inc.

PPTX

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.

PPT

A Community Approach to Fighting Cyber ThreatsCloudera, Inc.

PPTX

Part 3: Models in Production: A Look From Beginning to EndCloudera, Inc.

PPTX

Extreme Sports & Beyond: Exploring a new frontier in data with GoProCloudera, Inc.

PPTX

Part 1: Lambda Architectures: Simplified by Apache KuduCloudera, Inc.

PPTX

Spark One Platform WebinarCloudera, Inc.

PPTX

Cloudera Altus: Big Data in the Cloud Made EasyCloudera, Inc.

PPTX

How Data Drives Business at Choice HotelsCloudera, Inc.

PPTX

Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...Cloudera, Inc.

PDF

Hadoop on Cloud: Why and How?Cloudera, Inc.

PPTX

Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.

PPTX

Analyzing Hadoop Data Using Sparklyr Cloudera, Inc.

PPTX

Multi-Tenant Operations with Cloudera 5.7 & BTCloudera, Inc.

PDF

One Hadoop, Multiple Clouds - NYC Big Data MeetupAndrei Savu

PPTX

Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Cloudera, Inc.

PPTX

Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...Cloudera, Inc.

PPTX

Hadoop Hadoop & Spark meetup - AltiscaleMark Kerzner

PPTX

How Big Data Can Enable Analytics from the Cloud (Technical Workshop)Cloudera, Inc.

Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudCloudera, Inc.

Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Cloudera, Inc.

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.

A Community Approach to Fighting Cyber ThreatsCloudera, Inc.

Part 3: Models in Production: A Look From Beginning to EndCloudera, Inc.

Extreme Sports & Beyond: Exploring a new frontier in data with GoProCloudera, Inc.

Part 1: Lambda Architectures: Simplified by Apache KuduCloudera, Inc.

Spark One Platform WebinarCloudera, Inc.

Cloudera Altus: Big Data in the Cloud Made EasyCloudera, Inc.

How Data Drives Business at Choice HotelsCloudera, Inc.

Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...Cloudera, Inc.

Hadoop on Cloud: Why and How?Cloudera, Inc.

Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.

Analyzing Hadoop Data Using Sparklyr Cloudera, Inc.

Multi-Tenant Operations with Cloudera 5.7 & BTCloudera, Inc.

One Hadoop, Multiple Clouds - NYC Big Data MeetupAndrei Savu

Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Cloudera, Inc.

Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...Cloudera, Inc.

Hadoop Hadoop & Spark meetup - AltiscaleMark Kerzner

How Big Data Can Enable Analytics from the Cloud (Technical Workshop)Cloudera, Inc.

Similar to Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data (20)

PDF

dplyr Interfaces to Large-Scale DataCloudera, Inc.

PPTX

Part 2: A Visual Dive into Machine Learning and Deep Learning  Cloudera, Inc.

PDF

Data Science Languages and Industry AnalyticsWes McKinney

PPTX

Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Stefan Lipp

PPTX

Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson

PDF

Applications on Hadoopmarkgrover

PPTX

Twitter with hadoop for oowGwen (Chen) Shapira

PPTX

Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Stefan Lipp

PPTX

Power of the AWR Warehouse- HotSos Symposium 2015Kellyn Pot'Vin-Gorman

PPTX

Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...DataWorks Summit

PPTX

Building Efficient Pipelines in Apache SparkJeremy Beard

PDF

OOW-TBE-12c-CON7307-SharableObaidur (OB) Rashid

PPTX

Oracle Database Cloud ServiceJean-Philippe PINTE

PDF

Oracle NoSQL Database release 3.0 overviewPaulo Fagundes

PDF

PySpark Best PracticesCloudera, Inc.

PPTX

Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.

PDF

Turning Relational Database Tables into Hadoop Datasources by Kuassi MensahData Con LA

PPTX

Apache Spark in Scientific ApplciationsDr. Mirko Kämpf

PPTX

Apache Spark in Scientific ApplicationsDr. Mirko Kämpf

PDF

Session 203 iouc summit databaseOUGTH Oracle User Group in Thailand

dplyr Interfaces to Large-Scale DataCloudera, Inc.

Part 2: A Visual Dive into Machine Learning and Deep Learning  Cloudera, Inc.

Data Science Languages and Industry AnalyticsWes McKinney

Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Stefan Lipp

Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson

Applications on Hadoopmarkgrover

Twitter with hadoop for oowGwen (Chen) Shapira

Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Stefan Lipp

Power of the AWR Warehouse- HotSos Symposium 2015Kellyn Pot'Vin-Gorman

Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...DataWorks Summit

Building Efficient Pipelines in Apache SparkJeremy Beard

OOW-TBE-12c-CON7307-SharableObaidur (OB) Rashid

Oracle Database Cloud ServiceJean-Philippe PINTE

Oracle NoSQL Database release 3.0 overviewPaulo Fagundes

PySpark Best PracticesCloudera, Inc.

Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.

Turning Relational Database Tables into Hadoop Datasources by Kuassi MensahData Con LA

Apache Spark in Scientific ApplciationsDr. Mirko Kämpf

Apache Spark in Scientific ApplicationsDr. Mirko Kämpf

Session 203 iouc summit databaseOUGTH Oracle User Group in Thailand

More from Cloudera, Inc. (20)

PPTX

Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.

PPTX

Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.

PPTX

2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.

PPTX

Edc event vienna presentation 1 oct 2019Cloudera, Inc.

PPTX

Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.

PPTX

Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.

PPTX

Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.

PPTX

Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.

PPTX

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.

PPTX

Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.

PPTX

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.

PPTX

Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.

PPTX

Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.

PPTX

Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.

PPTX

Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.

PPTX

Extending Cloudera SDX beyond the PlatformCloudera, Inc.

PPTX

Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.

PPTX

Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.

PPTX

Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.

PPTX

Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.

Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.

2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.

Edc event vienna presentation 1 oct 2019Cloudera, Inc.

Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.

Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.

Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.

Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.

Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.

Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.

Extending Cloudera SDX beyond the PlatformCloudera, Inc.

Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.

Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.

Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.

Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.

Recently uploaded (20)

PPTX

OpenID AuthZEN - Analyst Briefing July 2025David Brossard

PPTX

COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGISSharanya Sarkar

PDF

HubSpot Main Hub: A Unified Growth PlatformJaswinder Singh

PDF

[Newgen] NewgenONE Marvin Brochure 1.pdfdarshakparmar

PDF

The Rise of AI and IoT in Mobile App Tech.pdfIMG Global Infotech

PDF

Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025faizk77g

PDF

HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...mcastillo49

PDF

Reverse Engineering of Security Products: Developing an Advanced Microsoft De...nwbxhhcyjv

PDF

IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...Rejig Digital

PDF

Smart Trailers 2025 Update with History and OverviewPaul Menig

PDF

DevBcn - Building 10x Organizations Using Modern Productivity MetricsJustin Reock

PDF

Biography of Daniel Podor.pdfDaniel Podor

PDF

From Code to Challenge: Crafting Skill-Based Games That Engage and Rewardaiyshauae

PDF

Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...darshakparmar

PDF

Mastering Financial Management in Direct SellingEpixel MLM Software

PPTX

Building Search Using OpenSearch: Limitations and WorkaroundsSease

PPTX

WooCommerce Workshop: Bring Your LaptopLaura Hartwig

PDF

"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...Fwdays

PDF

Bitcoin for Millennials podcast with Bram, Power Laws of BitcoinStephen Perrenod

PDF

Empower Inclusion Through Accessible Java ApplicationsAna-Maria Mihalceanu

OpenID AuthZEN - Analyst Briefing July 2025David Brossard

COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGISSharanya Sarkar

HubSpot Main Hub: A Unified Growth PlatformJaswinder Singh

[Newgen] NewgenONE Marvin Brochure 1.pdfdarshakparmar

The Rise of AI and IoT in Mobile App Tech.pdfIMG Global Infotech

Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025faizk77g

HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...mcastillo49

Reverse Engineering of Security Products: Developing an Advanced Microsoft De...nwbxhhcyjv

IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...Rejig Digital

Smart Trailers 2025 Update with History and OverviewPaul Menig

DevBcn - Building 10x Organizations Using Modern Productivity MetricsJustin Reock

Biography of Daniel Podor.pdfDaniel Podor

From Code to Challenge: Crafting Skill-Based Games That Engage and Rewardaiyshauae

Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...darshakparmar

Mastering Financial Management in Direct SellingEpixel MLM Software

Building Search Using OpenSearch: Limitations and WorkaroundsSease

WooCommerce Workshop: Bring Your LaptopLaura Hartwig

"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...Fwdays

Bitcoin for Millennials podcast with Bram, Power Laws of BitcoinStephen Perrenod

Empower Inclusion Through Accessible Java ApplicationsAna-Maria Mihalceanu

Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data

2. 2© Cloudera, Inc. All rights reserved. Mission for Cloudera: Provide a platform for data analysts, data scientists to efficiently query, analyze, model large-scale data in clusters, cloud storage • By distributing Apache Spark, Apache Impala, other tools • By enabling productive use of these tools Python and R users often have difficulty moving from smaller data to large-scale distributed data • Familiar packages, methods don’t work the same way on distributed data Context

7. 7© Cloudera, Inc. All rights reserved. dplyr provides a set of verbs that perform common data manipulation steps • select() to select columns • filter() to filter rows • arrange() to order rows • mutate() to create new columns • summarise() to aggregate • group_by() to perform operations by group dplyr works on local data and with remote data sources • For remote sources, dplyr commands are translated into SQL dplyr

10. 10© Cloudera, Inc. All rights reserved. dplyr SQL backends dplyr ↕ dbplyr ↕ dplyr SQL backend package* ↕ DBI ↕ DBI-compatible interface package ↕ database driver or connector ↕ database/engine * optional

11. 11© Cloudera, Inc. All rights reserved. • Provides a SQL backend to dplyr for Spark • Also exposes the MLlib API and a subset of the Spark DataFrames API • Developed by RStudio spark.rstudio.com sparklyr