Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Sparkling Pandas
Scaling Pandas beyond a single machine
(or letting Pandas Roam)
With Special thanks to Juliet Hougland :)

Who am I?
Holden
● I prefer she/her for pronouns
● Co-author of the Learning Spark book
● Engineer at Alpine Data Labs
○ previously DataBricks, Google, Foursquare, Amazon
● @holdenkarau
● http://www.slideshare.net/hkarau
● https://www.linkedin.com/in/holdenkarau

What is Pandas?
user_id panda_ty
pe
01234 giant
12345 red
23456 giant
34567 giant
45678 red
56789 giant
● DataFrames--Indexed, tabular data structures
● Easy slicing, indexing, subsetting/filtering
● Excellent support for time series data
● Data alignment and reshaping
http://pandas.pydata.org/

What is Spark?
Fast general engine for in memory data
processing.
tl;dr - 100x faster than Hadoop MapReduce*

The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML bagel &
Grah X
MLLib
Community
Packages

Some Spark terms
Spark Context (aka sc)
● The window to the world of Spark
sqlContext
● The window to the world of DataFrames
Transformation
● Takes an RDD (or DataFrame) and returns a new RDD
or DataFrame
Action
● Causes an RDD to be evaluated (often storing the
result)

Dataframes between Spark & Pandas
Spark
● Fast
● Distributed
● Limited API
● Some ML
● I/O Options
● Not indexed
Pandas
● Fast
● Single Machine
● Full Feature API
● Integration with ML
● Different I/O
Options
● Indexed
● Easy to visualize

Simple Spark SQL Example
input = sqlContext.jsonFile(inputFile)
input.registerTempTable("tweets")
topTweets = sqlContext.sql("SELECT text, retweetCount" +
"FROM tweets ORDER BY retweetCount LIMIT 10")
local = topTweets.collect()

Convert a Spark DataFrame to Pandas
import pandas
...
ddf = sqlContext.read.json("hdfs://...")
# Some Spark transformations
transformedDdf = ddf.filter(ddf['age'] > 21)
return transformedDdf.toPandas()

Convert a Pandas DataFrame to Spark
import pandas
...
df = panda.DataFrame(...)
...
ddf = sqlContext.DataFrame(df)

Let’s combine the two
● Spark DataFrames already provides some of what we
need
○ Add UDFs / UDAFS
○ Use bits of Pandas code
● http://spark-packages.org - excellent pace to get
libraries

So where does the PB&J go?
Spark
DataFrame
Sparkling
Pandas API
Custom
UDFS
Pandas
Code
Sparkling
Pandas
Scala Code
PySpark
RDDs
Pandas
Code
Internal
State

Extending Spark - adding index support
self._index_names
def collect(self):
"""Collect the elements in an Dataframe
and concatenate the partition."""
df = self._schema_rdd.toPandas()
df = _update_index_on_df(df, self._index_names)
return df

Extending Spark - adding index support
def _update_index_on_df(df, index_names):
if index_names:
df = df.set_index(index_names)
# Remove names from unnamed indexes
index_names = _denormalize_names(index_names)
df.index.names = index_names
return df

Adding a UDF in Python
sqlContext.registerFunction("strLenPython", lambda x:
len(x), IntegerType())

Extending Spark SQL w/Scala for fun &
profit
// functions we want to be callable from python
object functions {
def kurtosis(e: Column): Column =
new Column(Kurtosis(EvilSqlTools.getExpr(e)))
def registerUdfs(sqlCtx: SQLContext): Unit = {
sqlCtx.udf.register("rowKurtosis", helpers.rowKurtosis _)
}
}

Extending Spark SQL w/Scala for fun &
profit
def _create_function(name, doc=""):
def _(col):
sc = SparkContext._active_spark_context
f = sc._jvm.com.sparklingpandas.functions, name
jc = getattr(f)(col._jc if isinstance(col, Column) else
col)
return Column(jc)
return _
_functions = {
'kurtosis': 'Calculate the kurtosis, maybe!',
}

Simple graphing with Sparkling Pandas
import matplotlib.pyplot as plt
plot = speaker_pronouns["pronoun"].plot()
plot.get_figure().savefig("/tmp/fig")
Not yet
merged in

Why is SparklingPandas fast*?
Keep stuff in the JVM as much as
possible.
Lazy operations
Distributed
*For really flexible versions of the word fast
Coffee
by eltpics
Panda image by Stéfan
Panda image by cactusroot

Supported operations:
DataFrames
● to_spark_sql
● applymap
● groupby
● collect
● stats
● query
● axes
● ftype
● dtype
Context
● simple
● read_csv
● from_data_frame
● parquetFile
● read_json
● stop
GroupBy
● groups
● indices
● first
● median
● mean
● sum
● aggregate

Always onwards and upwards
Now
Hypothetical, Wonderful Future
Workdone
Time

Related Works
Blaze
● http://continuum.io/blog/blaze
AdaTao’s Distributed DataFrame
● http://spark-summit.org/2014/talk/distributed-dataframe-
ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-
us
Numba
● http://numba.pydata.org/

Using Sparkling Pandas
You can get Sparkling Pandas from
● Website:
http://www.sparklingpandas.com
● Code:
https://github.com/sparklingpandas/sparklingpandas
● Mailing List
https://groups.google.com/d/forum/sparklingpandas

Getting Sparkling Pandas friends
The examples from this will get merged into master.
Pandas
● http://pandas.pydata.org/ (or pip)
Spark
● http://spark.apache.org/

many pandas by David Goehring
Any
questions?

Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Sparkling pandas Letting Pandas Roam - PyData Seattle 2015 (20)

Recently uploaded (20)

Sparkling pandas Letting Pandas Roam - PyData Seattle 2015