SlideShare a Scribd company logo
Sparkling Pandas
Scaling Pandas beyond a single machine
(or letting Pandas Roam)
With Special thanks to Juliet Hougland :)
Sparkling Pandas
Scaling Pandas beyond a single machine
(or letting Pandas Roam)
With Special thanks to Juliet Hougland :)
Who am I?
Holden
â—Ź I prefer she/her for pronouns
â—Ź Co-author of the Learning Spark book
â—Ź Engineer at Alpine Data Labs
â—‹ previously DataBricks, Google, Foursquare, Amazon
â—Ź @holdenkarau
â—Ź http://www.slideshare.net/hkarau
â—Ź https://www.linkedin.com/in/holdenkarau
What is Pandas?
user_id panda_ty
pe
01234 giant
12345 red
23456 giant
34567 giant
45678 red
56789 giant
â—Ź DataFrames--Indexed, tabular data structures
â—Ź Easy slicing, indexing, subsetting/filtering
â—Ź Excellent support for time series data
â—Ź Data alignment and reshaping
http://pandas.pydata.org/
What is Spark?
Fast general engine for in memory data
processing.
tl;dr - 100x faster than Hadoop MapReduce*
The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML bagel &
Grah X
MLLib
Community
Packages
Some Spark terms
Spark Context (aka sc)
â—Ź The window to the world of Spark
sqlContext
â—Ź The window to the world of DataFrames
Transformation
â—Ź Takes an RDD (or DataFrame) and returns a new RDD
or DataFrame
Action
â—Ź Causes an RDD to be evaluated (often storing the
result)
Dataframes between Spark & Pandas
Spark
â—Ź Fast
â—Ź Distributed
â—Ź Limited API
â—Ź Some ML
â—Ź I/O Options
â—Ź Not indexed
Pandas
â—Ź Fast
â—Ź Single Machine
â—Ź Full Feature API
â—Ź Integration with ML
â—Ź Different I/O
Options
â—Ź Indexed
â—Ź Easy to visualize
Panda IMG by Peter
Beardsley
Simple Spark SQL Example
input = sqlContext.jsonFile(inputFile)
input.registerTempTable("tweets")
topTweets = sqlContext.sql("SELECT text, retweetCount" +
"FROM tweets ORDER BY retweetCount LIMIT 10")
local = topTweets.collect()
Convert a Spark DataFrame to Pandas
import pandas
...
ddf = sqlContext.read.json("hdfs://...")
# Some Spark transformations
transformedDdf = ddf.filter(ddf['age'] > 21)
return transformedDdf.toPandas()
Convert a Pandas DataFrame to Spark
import pandas
...
df = panda.DataFrame(...)
...
ddf = sqlContext.DataFrame(df)
Let’s combine the two
â—Ź Spark DataFrames already provides some of what we
need
â—‹ Add UDFs / UDAFS
â—‹ Use bits of Pandas code
â—Ź http://spark-packages.org - excellent pace to get
libraries
So where does the PB&J go?
Spark
DataFrame
Sparkling
Pandas API
Custom
UDFS
Pandas
Code
Sparkling
Pandas
Scala Code
PySpark
RDDs
Pandas
Code
Internal
State
Extending Spark - adding index support
self._index_names
def collect(self):
"""Collect the elements in an Dataframe
and concatenate the partition."""
df = self._schema_rdd.toPandas()
df = _update_index_on_df(df, self._index_names)
return df
Extending Spark - adding index support
def _update_index_on_df(df, index_names):
if index_names:
df = df.set_index(index_names)
# Remove names from unnamed indexes
index_names = _denormalize_names(index_names)
df.index.names = index_names
return df
Adding a UDF in Python
sqlContext.registerFunction("strLenPython", lambda x:
len(x), IntegerType())
Extending Spark SQL w/Scala for fun &
profit
// functions we want to be callable from python
object functions {
def kurtosis(e: Column): Column =
new Column(Kurtosis(EvilSqlTools.getExpr(e)))
def registerUdfs(sqlCtx: SQLContext): Unit = {
sqlCtx.udf.register("rowKurtosis", helpers.rowKurtosis _)
}
}
Extending Spark SQL w/Scala for fun &
profit
def _create_function(name, doc=""):
def _(col):
sc = SparkContext._active_spark_context
f = sc._jvm.com.sparklingpandas.functions, name
jc = getattr(f)(col._jc if isinstance(col, Column) else
col)
return Column(jc)
return _
_functions = {
'kurtosis': 'Calculate the kurtosis, maybe!',
}
Simple graphing with Sparkling Pandas
import matplotlib.pyplot as plt
plot = speaker_pronouns["pronoun"].plot()
plot.get_figure().savefig("/tmp/fig")
Not yet
merged in
Why is SparklingPandas fast*?
Keep stuff in the JVM as much as
possible.
Lazy operations
Distributed
*For really flexible versions of the word fast
Coffee
by eltpics
Panda image by Stéfan
Panda image by cactusroot
Supported operations:
DataFrames
â—Ź to_spark_sql
â—Ź applymap
â—Ź groupby
â—Ź collect
â—Ź stats
â—Ź query
â—Ź axes
â—Ź ftype
â—Ź dtype
Context
â—Ź simple
â—Ź read_csv
â—Ź from_data_frame
â—Ź parquetFile
â—Ź read_json
â—Ź stop
GroupBy
â—Ź groups
â—Ź indices
â—Ź first
â—Ź median
â—Ź mean
â—Ź sum
â—Ź aggregate
Always onwards and upwards
Now
Hypothetical, Wonderful Future
Workdone
Time
Related Works
Blaze
â—Ź http://continuum.io/blog/blaze
AdaTao’s Distributed DataFrame
â—Ź http://spark-summit.org/2014/talk/distributed-dataframe-
ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-
us
Numba
â—Ź http://numba.pydata.org/
Using Sparkling Pandas
You can get Sparkling Pandas from
â—Ź Website:
http://www.sparklingpandas.com
â—Ź Code:
https://github.com/sparklingpandas/sparklingpandas
â—Ź Mailing List
https://groups.google.com/d/forum/sparklingpandas
Getting Sparkling Pandas friends
The examples from this will get merged into master.
Pandas
â—Ź http://pandas.pydata.org/ (or pip)
Spark
â—Ź http://spark.apache.org/
many pandas by David Goehring
Any
questions?

More Related Content

What's hot (20)

PPTX
Beyond shuffling - Strata London 2016
Holden Karau
 
PPTX
Up and running with pyspark
Krishna Sangeeth KS
 
PDF
PySaprk
Giivee The
 
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
PDF
Spark overview
Lisa Hua
 
PDF
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Holden Karau
 
PDF
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Holden Karau
 
PDF
Spark ML for custom models - FOSDEM HPC 2017
Holden Karau
 
PPTX
Spark tutorial
Sahan Bulathwela
 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PDF
Introduction to spark
Duyhai Doan
 
PDF
Apache Spark RDDs
Dean Chen
 
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
PDF
Introduction to Spark Internals
Pietro Michiardi
 
PDF
BDM25 - Spark runtime internal
David Lauzon
 
PDF
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
PDF
Spark shuffle introduction
colorant
 
PPTX
Intro to Apache Spark
Robert Sanders
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Beyond shuffling - Strata London 2016
Holden Karau
 
Up and running with pyspark
Krishna Sangeeth KS
 
PySaprk
Giivee The
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
Spark overview
Lisa Hua
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Holden Karau
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Holden Karau
 
Spark ML for custom models - FOSDEM HPC 2017
Holden Karau
 
Spark tutorial
Sahan Bulathwela
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
Introduction to spark
Duyhai Doan
 
Apache Spark RDDs
Dean Chen
 
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Introduction to Spark Internals
Pietro Michiardi
 
BDM25 - Spark runtime internal
David Lauzon
 
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
Spark shuffle introduction
colorant
 
Intro to Apache Spark
Robert Sanders
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 

Viewers also liked (20)

PPT
El 7 de febrero
lroczey
 
PPT
El 23 de febrero
lroczey
 
DOC
Virtual child (infant & toddler)
khiara_albaran
 
PPSX
Nadal2011 rosa
trasnoparoleiro
 
PPTX
Bio chapter 37
allybove
 
PPT
El 8 de febrero
lroczey
 
DOCX
Cover letter and resume rene
khiara_albaran
 
PDF
Spark with Elasticsearch - umd version 2014
Holden Karau
 
PDF
IGPS I Assignment 4: Overarching Presentation
ze1337
 
DOCX
Cover Letter and Resume
khiara_albaran
 
PPT
El 10 de febrero
lroczey
 
DOC
Virtual child health (infant & toddler)
khiara_albaran
 
RTF
Experience
empiricalmyth
 
PPT
Helping agencies
khiara_albaran
 
PPT
El 6 de febrero
lroczey
 
PPT
El 13 de febrero
lroczey
 
PPTX
El 2 de enero
lroczey
 
PPT
El tres de marzo
lroczey
 
PPTX
El 3 de enero
lroczey
 
PPTX
El 19 de diciembre
lroczey
 
El 7 de febrero
lroczey
 
El 23 de febrero
lroczey
 
Virtual child (infant & toddler)
khiara_albaran
 
Nadal2011 rosa
trasnoparoleiro
 
Bio chapter 37
allybove
 
El 8 de febrero
lroczey
 
Cover letter and resume rene
khiara_albaran
 
Spark with Elasticsearch - umd version 2014
Holden Karau
 
IGPS I Assignment 4: Overarching Presentation
ze1337
 
Cover Letter and Resume
khiara_albaran
 
El 10 de febrero
lroczey
 
Virtual child health (infant & toddler)
khiara_albaran
 
Experience
empiricalmyth
 
Helping agencies
khiara_albaran
 
El 6 de febrero
lroczey
 
El 13 de febrero
lroczey
 
El 2 de enero
lroczey
 
El tres de marzo
lroczey
 
El 3 de enero
lroczey
 
El 19 de diciembre
lroczey
 
Ad

Similar to Sparkling pandas Letting Pandas Roam - PyData Seattle 2015 (20)

PDF
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
PDF
Big data beyond the JVM - DDTX 2018
Holden Karau
 
PDF
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
PPTX
Big Data Certification
Adam Doyle
 
PDF
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA
 
PPTX
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
PDF
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Databricks
 
PPTX
HDPCD Spark using Python (pyspark)
Durga Gadiraju
 
PDF
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Edureka!
 
PDF
Adios hadoop, Hola Spark! T3chfest 2015
dhiguero
 
PDF
A super fast introduction to Spark and glance at BEAM
Holden Karau
 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
PPTX
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
 
PPTX
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
PDF
Koalas: How Well Does Koalas Work?
Databricks
 
PPTX
Big data clustering
Jagadeesan A S
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
Big Data Certification
Adam Doyle
 
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Databricks
 
HDPCD Spark using Python (pyspark)
Durga Gadiraju
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Edureka!
 
Adios hadoop, Hola Spark! T3chfest 2015
dhiguero
 
A super fast introduction to Spark and glance at BEAM
Holden Karau
 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
 
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Koalas: How Well Does Koalas Work?
Databricks
 
Big data clustering
Jagadeesan A S
 
Ad

Recently uploaded (20)

PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPTX
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 

Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

  • 1. Sparkling Pandas Scaling Pandas beyond a single machine (or letting Pandas Roam) With Special thanks to Juliet Hougland :)
  • 2. Sparkling Pandas Scaling Pandas beyond a single machine (or letting Pandas Roam) With Special thanks to Juliet Hougland :)
  • 3. Who am I? Holden â—Ź I prefer she/her for pronouns â—Ź Co-author of the Learning Spark book â—Ź Engineer at Alpine Data Labs â—‹ previously DataBricks, Google, Foursquare, Amazon â—Ź @holdenkarau â—Ź http://www.slideshare.net/hkarau â—Ź https://www.linkedin.com/in/holdenkarau
  • 4. What is Pandas? user_id panda_ty pe 01234 giant 12345 red 23456 giant 34567 giant 45678 red 56789 giant â—Ź DataFrames--Indexed, tabular data structures â—Ź Easy slicing, indexing, subsetting/filtering â—Ź Excellent support for time series data â—Ź Data alignment and reshaping http://pandas.pydata.org/
  • 5. What is Spark? Fast general engine for in memory data processing. tl;dr - 100x faster than Hadoop MapReduce*
  • 6. The different pieces of Spark Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Grah X MLLib Community Packages
  • 7. Some Spark terms Spark Context (aka sc) â—Ź The window to the world of Spark sqlContext â—Ź The window to the world of DataFrames Transformation â—Ź Takes an RDD (or DataFrame) and returns a new RDD or DataFrame Action â—Ź Causes an RDD to be evaluated (often storing the result)
  • 8. Dataframes between Spark & Pandas Spark â—Ź Fast â—Ź Distributed â—Ź Limited API â—Ź Some ML â—Ź I/O Options â—Ź Not indexed Pandas â—Ź Fast â—Ź Single Machine â—Ź Full Feature API â—Ź Integration with ML â—Ź Different I/O Options â—Ź Indexed â—Ź Easy to visualize
  • 9. Panda IMG by Peter Beardsley
  • 10. Simple Spark SQL Example input = sqlContext.jsonFile(inputFile) input.registerTempTable("tweets") topTweets = sqlContext.sql("SELECT text, retweetCount" + "FROM tweets ORDER BY retweetCount LIMIT 10") local = topTweets.collect()
  • 11. Convert a Spark DataFrame to Pandas import pandas ... ddf = sqlContext.read.json("hdfs://...") # Some Spark transformations transformedDdf = ddf.filter(ddf['age'] > 21) return transformedDdf.toPandas()
  • 12. Convert a Pandas DataFrame to Spark import pandas ... df = panda.DataFrame(...) ... ddf = sqlContext.DataFrame(df)
  • 13. Let’s combine the two â—Ź Spark DataFrames already provides some of what we need â—‹ Add UDFs / UDAFS â—‹ Use bits of Pandas code â—Ź http://spark-packages.org - excellent pace to get libraries
  • 14. So where does the PB&J go? Spark DataFrame Sparkling Pandas API Custom UDFS Pandas Code Sparkling Pandas Scala Code PySpark RDDs Pandas Code Internal State
  • 15. Extending Spark - adding index support self._index_names def collect(self): """Collect the elements in an Dataframe and concatenate the partition.""" df = self._schema_rdd.toPandas() df = _update_index_on_df(df, self._index_names) return df
  • 16. Extending Spark - adding index support def _update_index_on_df(df, index_names): if index_names: df = df.set_index(index_names) # Remove names from unnamed indexes index_names = _denormalize_names(index_names) df.index.names = index_names return df
  • 17. Adding a UDF in Python sqlContext.registerFunction("strLenPython", lambda x: len(x), IntegerType())
  • 18. Extending Spark SQL w/Scala for fun & profit // functions we want to be callable from python object functions { def kurtosis(e: Column): Column = new Column(Kurtosis(EvilSqlTools.getExpr(e))) def registerUdfs(sqlCtx: SQLContext): Unit = { sqlCtx.udf.register("rowKurtosis", helpers.rowKurtosis _) } }
  • 19. Extending Spark SQL w/Scala for fun & profit def _create_function(name, doc=""): def _(col): sc = SparkContext._active_spark_context f = sc._jvm.com.sparklingpandas.functions, name jc = getattr(f)(col._jc if isinstance(col, Column) else col) return Column(jc) return _ _functions = { 'kurtosis': 'Calculate the kurtosis, maybe!', }
  • 20. Simple graphing with Sparkling Pandas import matplotlib.pyplot as plt plot = speaker_pronouns["pronoun"].plot() plot.get_figure().savefig("/tmp/fig") Not yet merged in
  • 21. Why is SparklingPandas fast*? Keep stuff in the JVM as much as possible. Lazy operations Distributed *For really flexible versions of the word fast Coffee by eltpics Panda image by StĂ©fan Panda image by cactusroot
  • 22. Supported operations: DataFrames â—Ź to_spark_sql â—Ź applymap â—Ź groupby â—Ź collect â—Ź stats â—Ź query â—Ź axes â—Ź ftype â—Ź dtype Context â—Ź simple â—Ź read_csv â—Ź from_data_frame â—Ź parquetFile â—Ź read_json â—Ź stop GroupBy â—Ź groups â—Ź indices â—Ź first â—Ź median â—Ź mean â—Ź sum â—Ź aggregate
  • 23. Always onwards and upwards Now Hypothetical, Wonderful Future Workdone Time
  • 24. Related Works Blaze â—Ź http://continuum.io/blog/blaze AdaTao’s Distributed DataFrame â—Ź http://spark-summit.org/2014/talk/distributed-dataframe- ddf-on-apache-spark-simplifying-big-data-for-the-rest-of- us Numba â—Ź http://numba.pydata.org/
  • 25. Using Sparkling Pandas You can get Sparkling Pandas from â—Ź Website: http://www.sparklingpandas.com â—Ź Code: https://github.com/sparklingpandas/sparklingpandas â—Ź Mailing List https://groups.google.com/d/forum/sparklingpandas
  • 26. Getting Sparkling Pandas friends The examples from this will get merged into master. Pandas â—Ź http://pandas.pydata.org/ (or pip) Spark â—Ź http://spark.apache.org/
  • 27. many pandas by David Goehring Any questions?