SlideShare a Scribd company logo
Performant data processing
with PySpark, SparkR and
DataFrame API
Ryuji Tamagawa from Osaka
Many Thanks to Holden Karau,
for the discussion we had about this talk.
Agenda
Who am I ?
Spark
Spark and non-JVM languages
DataFrame APIs come to rescue
Examples
Who am I ?
Software engineer working for
Sky, from architecture design to
troubleshooting in the field
Translator working with O’Reilly
Japan
‘Learning Spark’ is the 27th book
Prized Rakuten tech award
Silver 2010 for translating
‘Hadoop the definitive guide’
A bed for 6 cats
Works of 2015
Available
Jan, 2016 ?
Works of past
Motivation for
today’s talk
I want to deal with my ‘Big’ data, 

WITH PYTHON !!
Apache Spark
Apache Spark
You may already
have heard a lot
Fast, distributed
data processing
framework with
high-level APIs
Written in Scala,
run in JVM
OS
HDFS
Hive e.t.c.
HBaseMapReduce
YARN
Impala
e.t.c(in-
memory SQL
engine)
Spark
(Spark Streaming, MLlib,
GraphX, Spark SQL)
Why it’s fast
Do not need to write temporary data to storage every time
Do not need to invoke JVM process every time
map
JVM Invocation
I/0
HDFS
reduce
JVM Invocation
I/0
map
JVM Invocation
I/0
reduce
JVM Invocation
I/0
f1(read data to RDD)
Executor(JVM)Invocation
HDFS
I/O
f2
f3
f4(persist to storage)
f5(does shuffle) I/O
f6
f7
Memory(RDDs)
access
access
access
access I/O
access
access
MapReduce Spark
Apache Spark
and
non-JVM languages
Spark supports
non-JVM languages
Shells
PySpark, 

for Python users
SparkR, 

for R users
GUI Environment : 

Jupiter, RStudio
You can write application code in
these languages
The Web UI tells us a lot
http://<address>:4040
Performance problems
with those languages
Data processing
performance with
those languages
may be several
times slower than
JVM languages
The reason lies in
the architecture https://cwiki.apache.org/confluence/
display/SPARK/PySpark+Internals
The choices you
have had
Learn Scala
Write (more lines of) code in Java
Use non-JVM languages with more
CPU cores to make up the
performance gap
DataFrame APIs
come to the rescue !
DataFrame
Tabular data with schema based on RDD
Successor of Schema RDD (Since 1.4)
Has rich set of APIs for data operation
Or, you can simply use SQL!
Do it within JVM
When you call
DataFrame APIs from
non-JVM Languages,
data will not be
transferred between JVM
and the language
runtime
Obviously, the
performance is almost
same compared to JVM
languages
Only code goes
through
Executor
DataFrame APIs compared to
RDD APIs by Examples
JVM
DataFrame,
Cached
Python
lambda items:
items[0] == ‘abc’
transfer
DataFrame,
result
transfer
Driver
Executor
DataFrame APIs compared to
RDD APIs by Examples
JVM
DataFrame,
Cached
filter(df[“_1”]
== “abc”)
transfer
DataFrame,
result
Driver
Watch out for UDFs
You can write UDFs
in Python
You can use
lambdas in Python,
too
Once you use them,
data flows between
the two worlds
slen = udf(
lambda s: len(s),
IntegerType())
df.select(
slen(df.name))
.collect()
Make it small first,
then use UDFs
Filter or sample your
‘big’ data with
DataFrame APIs
Then use UDFs
SQL optimizer does
not take it into
account when making
plans (so far)
‘BIG’ data
in DataFrame
filtering with
‘native APIs’
‘Small’ data in DataFrame
whatever
operation with
UDFs
Make it small first,
then use UDFs
Filter or sample your
‘big’ data with
DataFrame APIs
Then use UDFs
SQL optimizer does
not take it into
account when making
plans (so far)
slen = udf(
lambda s: len(s),
IntegerType())
sqc.SQL(
‘select…
from df
where fname like “tama%”
and slen(name)’
).collect()
processed first !
Ingesting Data
It’s slow to Deal with files like CSVs by non-JVM driver
Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first
You can process Such files directly from JVM processes (executors) even when
using non-JVM languages
Executor
JVM
DataFrameDriver
Local Data
Py4J
Driver Machine
HDFS (Parquet)
Driver Machine
Ingesting Data
Executor
JVM
DataFrameDriver Py4Jcode only
HDFS (Parquet)
code only
It’s slow to Deal with files like CSVs by non-JVM driver
Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first
You can process Such files directly from JVM processes (executors) even when
using non-JVM languages
Appendix : Parquet
Parquet: general purpose file
format for analytic workload
Columnar storage : reduces I/O
significantly
High compression rate
projection pushdown
Today, workloads become CPU-
intensive : very fast read, CPU-internal-
aware

More Related Content

What's hot (20)

PDF
PySpark Best Practices
Cloudera, Inc.
 
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
PPTX
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
PPTX
Spark r under the hood with Hossein Falaki
Databricks
 
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
PPTX
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
PDF
New Developments in Spark
Databricks
 
PDF
Introduction to Apache Spark
Samy Dindane
 
PPTX
Parallelizing Existing R Packages with SparkR
Databricks
 
PPTX
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
PDF
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PDF
Apache Arrow and Pandas UDF on Apache Spark
Takuya UESHIN
 
PDF
Spark Meetup at Uber
Databricks
 
PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
PySpark Best Practices
Cloudera, Inc.
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Spark r under the hood with Hossein Falaki
Databricks
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
New Developments in Spark
Databricks
 
Introduction to Apache Spark
Samy Dindane
 
Parallelizing Existing R Packages with SparkR
Databricks
 
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Introduction to Apache Spark
Rahul Jain
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
 
Building a modern Application with DataFrames
Spark Summit
 
Apache Arrow and Pandas UDF on Apache Spark
Takuya UESHIN
 
Spark Meetup at Uber
Databricks
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 

Viewers also liked (20)

PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
PDF
Getting The Best Performance With PySpark
Spark Summit
 
PDF
High Performance Python on Apache Spark
Wes McKinney
 
PDF
PySpark in practice slides
Dat Tran
 
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
PPTX
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Alton Alexander
 
PDF
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
PDF
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
Ryuji Tamagawa
 
PDF
lessons learned from talking at rakuten technology conference
Ryuji Tamagawa
 
PDF
ヘルシープログラマ・翻訳と実践
Ryuji Tamagawa
 
PDF
20161215 python pandas-spark四方山話
Ryuji Tamagawa
 
PDF
Google Big Query
Ryuji Tamagawa
 
PDF
You might be paying too much for BigQuery
Ryuji Tamagawa
 
PDF
Mongo dbを知ろう devlove関西
Ryuji Tamagawa
 
PDF
Spark workshop
Wojciech Pituła
 
PDF
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
PDF
Google BigQueryについて 紹介と推測
Ryuji Tamagawa
 
PDF
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Nexus, Inc.
 
PPTX
Master Data Mastery – Strategies to improve procurement performance
Verdantis Inc.
 
PDF
An excursion into Text Analytics with Apache Spark
Krishna Sankar
 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Getting The Best Performance With PySpark
Spark Summit
 
High Performance Python on Apache Spark
Wes McKinney
 
PySpark in practice slides
Dat Tran
 
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Alton Alexander
 
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
Ryuji Tamagawa
 
lessons learned from talking at rakuten technology conference
Ryuji Tamagawa
 
ヘルシープログラマ・翻訳と実践
Ryuji Tamagawa
 
20161215 python pandas-spark四方山話
Ryuji Tamagawa
 
Google Big Query
Ryuji Tamagawa
 
You might be paying too much for BigQuery
Ryuji Tamagawa
 
Mongo dbを知ろう devlove関西
Ryuji Tamagawa
 
Spark workshop
Wojciech Pituła
 
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
Google BigQueryについて 紹介と推測
Ryuji Tamagawa
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Nexus, Inc.
 
Master Data Mastery – Strategies to improve procurement performance
Verdantis Inc.
 
An excursion into Text Analytics with Apache Spark
Krishna Sankar
 
Ad

Similar to Performant data processing with PySpark, SparkR and DataFrame API (20)

PDF
Big data beyond the JVM - DDTX 2018
Holden Karau
 
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
PPTX
Building a modern Application with DataFrames
Databricks
 
PDF
Introduction to Spark with Python
Gokhan Atil
 
PDF
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
PDF
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 
PPTX
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
David Taieb
 
PDF
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
PDF
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
PDF
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Holden Karau
 
PDF
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Provectus
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
PDF
Let's start with Spark
Milos Milovanovic
 
PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PDF
Dive into PySpark
Mateusz Buśkiewicz
 
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
Building a modern Application with DataFrames
Databricks
 
Introduction to Spark with Python
Gokhan Atil
 
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
David Taieb
 
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Holden Karau
 
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Provectus
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Let's start with Spark
Milos Milovanovic
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Dive into PySpark
Mateusz Buśkiewicz
 
Ad

More from Ryuji Tamagawa (20)

PDF
20171012 found IT #9 PySparkの勘所
Ryuji Tamagawa
 
PDF
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
Ryuji Tamagawa
 
PPTX
hbstudy 74 Site Reliability Engineering
Ryuji Tamagawa
 
PDF
PySparkの勘所(20170630 sapporo db analytics showcase)
Ryuji Tamagawa
 
PDF
20170210 sapporotechbar7
Ryuji Tamagawa
 
PDF
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
Ryuji Tamagawa
 
PDF
20160708 データ処理のプラットフォームとしてのpython 札幌
Ryuji Tamagawa
 
PDF
20160127三木会 RDB経験者のためのspark
Ryuji Tamagawa
 
PDF
20151205 Japan.R SparkRとParquet
Ryuji Tamagawa
 
PDF
Apache Sparkの紹介
Ryuji Tamagawa
 
PDF
足を地に着け落ち着いて考える
Ryuji Tamagawa
 
PDF
BigQueryの課金、節約しませんか
Ryuji Tamagawa
 
PDF
Seleniumをもっと知るための本の話
Ryuji Tamagawa
 
PDF
データベース勉強会 In 広島 mongodb
Ryuji Tamagawa
 
PDF
Invitation to mongo db @ Rakuten TechTalk
Ryuji Tamagawa
 
PDF
MongoDB tuning on AWS
Ryuji Tamagawa
 
PDF
初めてのMongo db
Ryuji Tamagawa
 
PDF
RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)
Ryuji Tamagawa
 
PDF
初めてのAws elastic map reduce
Ryuji Tamagawa
 
PDF
初めてのAws rds for sql server
Ryuji Tamagawa
 
20171012 found IT #9 PySparkの勘所
Ryuji Tamagawa
 
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
Ryuji Tamagawa
 
hbstudy 74 Site Reliability Engineering
Ryuji Tamagawa
 
PySparkの勘所(20170630 sapporo db analytics showcase)
Ryuji Tamagawa
 
20170210 sapporotechbar7
Ryuji Tamagawa
 
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
Ryuji Tamagawa
 
20160708 データ処理のプラットフォームとしてのpython 札幌
Ryuji Tamagawa
 
20160127三木会 RDB経験者のためのspark
Ryuji Tamagawa
 
20151205 Japan.R SparkRとParquet
Ryuji Tamagawa
 
Apache Sparkの紹介
Ryuji Tamagawa
 
足を地に着け落ち着いて考える
Ryuji Tamagawa
 
BigQueryの課金、節約しませんか
Ryuji Tamagawa
 
Seleniumをもっと知るための本の話
Ryuji Tamagawa
 
データベース勉強会 In 広島 mongodb
Ryuji Tamagawa
 
Invitation to mongo db @ Rakuten TechTalk
Ryuji Tamagawa
 
MongoDB tuning on AWS
Ryuji Tamagawa
 
初めてのMongo db
Ryuji Tamagawa
 
RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)
Ryuji Tamagawa
 
初めてのAws elastic map reduce
Ryuji Tamagawa
 
初めてのAws rds for sql server
Ryuji Tamagawa
 

Recently uploaded (20)

PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PPTX
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Tally software_Introduction_Presentation
AditiBansal54083
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 

Performant data processing with PySpark, SparkR and DataFrame API

  • 1. Performant data processing with PySpark, SparkR and DataFrame API Ryuji Tamagawa from Osaka Many Thanks to Holden Karau, for the discussion we had about this talk.
  • 2. Agenda Who am I ? Spark Spark and non-JVM languages DataFrame APIs come to rescue Examples
  • 3. Who am I ? Software engineer working for Sky, from architecture design to troubleshooting in the field Translator working with O’Reilly Japan ‘Learning Spark’ is the 27th book Prized Rakuten tech award Silver 2010 for translating ‘Hadoop the definitive guide’ A bed for 6 cats
  • 6. Motivation for today’s talk I want to deal with my ‘Big’ data, 
 WITH PYTHON !!
  • 8. Apache Spark You may already have heard a lot Fast, distributed data processing framework with high-level APIs Written in Scala, run in JVM OS HDFS Hive e.t.c. HBaseMapReduce YARN Impala e.t.c(in- memory SQL engine) Spark (Spark Streaming, MLlib, GraphX, Spark SQL)
  • 9. Why it’s fast Do not need to write temporary data to storage every time Do not need to invoke JVM process every time map JVM Invocation I/0 HDFS reduce JVM Invocation I/0 map JVM Invocation I/0 reduce JVM Invocation I/0 f1(read data to RDD) Executor(JVM)Invocation HDFS I/O f2 f3 f4(persist to storage) f5(does shuffle) I/O f6 f7 Memory(RDDs) access access access access I/O access access MapReduce Spark
  • 11. Spark supports non-JVM languages Shells PySpark, 
 for Python users SparkR, 
 for R users GUI Environment : 
 Jupiter, RStudio You can write application code in these languages
  • 12. The Web UI tells us a lot http://<address>:4040
  • 13. Performance problems with those languages Data processing performance with those languages may be several times slower than JVM languages The reason lies in the architecture https://cwiki.apache.org/confluence/ display/SPARK/PySpark+Internals
  • 14. The choices you have had Learn Scala Write (more lines of) code in Java Use non-JVM languages with more CPU cores to make up the performance gap
  • 15. DataFrame APIs come to the rescue !
  • 16. DataFrame Tabular data with schema based on RDD Successor of Schema RDD (Since 1.4) Has rich set of APIs for data operation Or, you can simply use SQL!
  • 17. Do it within JVM When you call DataFrame APIs from non-JVM Languages, data will not be transferred between JVM and the language runtime Obviously, the performance is almost same compared to JVM languages Only code goes through
  • 18. Executor DataFrame APIs compared to RDD APIs by Examples JVM DataFrame, Cached Python lambda items: items[0] == ‘abc’ transfer DataFrame, result transfer Driver
  • 19. Executor DataFrame APIs compared to RDD APIs by Examples JVM DataFrame, Cached filter(df[“_1”] == “abc”) transfer DataFrame, result Driver
  • 20. Watch out for UDFs You can write UDFs in Python You can use lambdas in Python, too Once you use them, data flows between the two worlds slen = udf( lambda s: len(s), IntegerType()) df.select( slen(df.name)) .collect()
  • 21. Make it small first, then use UDFs Filter or sample your ‘big’ data with DataFrame APIs Then use UDFs SQL optimizer does not take it into account when making plans (so far) ‘BIG’ data in DataFrame filtering with ‘native APIs’ ‘Small’ data in DataFrame whatever operation with UDFs
  • 22. Make it small first, then use UDFs Filter or sample your ‘big’ data with DataFrame APIs Then use UDFs SQL optimizer does not take it into account when making plans (so far) slen = udf( lambda s: len(s), IntegerType()) sqc.SQL( ‘select… from df where fname like “tama%” and slen(name)’ ).collect() processed first !
  • 23. Ingesting Data It’s slow to Deal with files like CSVs by non-JVM driver Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first You can process Such files directly from JVM processes (executors) even when using non-JVM languages Executor JVM DataFrameDriver Local Data Py4J Driver Machine HDFS (Parquet)
  • 24. Driver Machine Ingesting Data Executor JVM DataFrameDriver Py4Jcode only HDFS (Parquet) code only It’s slow to Deal with files like CSVs by non-JVM driver Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first You can process Such files directly from JVM processes (executors) even when using non-JVM languages
  • 26. Parquet: general purpose file format for analytic workload Columnar storage : reduces I/O significantly High compression rate projection pushdown Today, workloads become CPU- intensive : very fast read, CPU-internal- aware