SlideShare a Scribd company logo
Big Data Beyond the JVM
With a lot of a Spark focus
Rachel
● My name is Rachel
● Prefered pronouns are she/her
● Data scientist at Salesforce Einstein
● previously at Alpine
● co-author of High Performance Spark
● @warre_n_peace
● Slide share http://www.slideshare.net/rachelbwarren
● Linkedin https://www.linkedin.com/in/rachelbwarren
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC :)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
Big data beyond the JVM -  DDTX 2018
Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
● On twitter @BooProgrammer
Who I think you wonderful humans are?
● Nice enough people
● Don’t mind pictures of cats
● Might know some Apache Spark - might not
● Possibly know some Python or R
● Or are tired of scala/ java
Lori Erickson
What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most
active)
● Must faster than Hadoop Map/Reduce
● Good when problems become too big
for a single machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
Spark specific terms in this talk
● RDD
○ Resilient Distributed Dataset - Like a distributed collection. Supports
many of the same operations as Seq’s in Scala but automatically
distributed and fault tolerant. Lazily evaluated, and handles faults by
recompute. Any* Java or Kyro serializable object.
● DataFrame
○ Spark DataFrame - not a Pandas or R DataFrame. Distributed,
supports a limited set of operations. Columnar structured, runtime
schema information only. Limited* data types. Sql (ish) API
● Dataset
○ Compile time typed version of DataFrame (generic)
skdevitt
What will be covered?
● A more detailed look at the current state of PySpark
● Why it isn’t good enough
● Why things are finally changing
● A brief tour of options for non-JVM languages in the Big Data space
● My even less subtle attempts to get you to buy my new book
● Pictures of cats & stuffed animals
● tl;dr - We’ve* made some bad** choices historically, and projects like Arrow &
friends can save us from some of these (yay!)
What’s the state of non-JVM big data?
Most of the tools are built in the JVM, so how do we play together?
● Pickling, Strings, JSON, XML, oh my!
● Unix pipes
● Sockets
What about if we don’t want to copy the data all the time?
● Or standalone “pure”* re-implementations of everything
○ Reasonable option for things like Kafka where you would have the I/O regardless.
○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem
David Brown
PySpark:
● The Python interface to Spark
● Same general technique used as the bases for the C#, R, Julia, etc.
interfaces to Spark
● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API
● Has some serious performance hurdles from the design
Yes, we have wordcount! :p
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(output)
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
These are still
combined and
executed in
one python
executor
Trish Hamme
A quick detour into PySpark’s internals
+ + JSON
Spark in Scala, how does PySpark work?
● Py4J + pickling + JSON and magic
○ Py4j in the driver
○ Pipes to start python process from java exec
○ cloudPickle to serialize data between JVM and python executors
(transmitted via sockets)
○ Json for dataframe schema
● Data from Spark worker serialized and piped to Python
worker --> then piped back to jvm
○ Multiple iterator-to-iterator transformations are still pipelined :)
○ So serialization happens only once per stage
● Spark SQL (and DataFrames) avoid some of this
kristin klein
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
So how does that impact PySpark?
● Double serialization cost makes everything more
expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
● Error messages make ~0 sense
● Spark Features aren’t automatically exposed, but
exposing them is normally simple
Our saviour from serialization: DataFrames
● For the most part keeps data in the JVM
○ Notable exception is UDFs written in Python
● Takes our python calls and turns it into a query plan if
we need more than the native operations in Spark’s
DataFrames
● be wary of Distributed Systems bringing claims of
usability….
Andy
Blackledge
So what are Spark DataFrames?
● More than SQL tables
● Not Pandas or R DataFrames
● Semi-structured (have schema information)
● tabular
● work on expression as well as lambdas
○ e.g. df.filter(df.col(“happy”) == true) instead of rdd.filter(lambda x:
x.happy == true))
● Not a subset of Spark “Datasets” - since Dataset API
isn’t exposed in Python yet :(
Quinn Dombrowski
Word count w/Dataframes
df = sqlCtx.read.load(src)
# Returns an RDD
words = df.select("text").flatMap(lambda x: x.text.split(" "))
words_df = words.map(
lambda x: Row(word=x, cnt=1)).toDF()
word_count = words_df.groupBy("word").sum()
word_count.write.format("parquet").save("wc.parquet")
Still have the double
serialization here :(
We can see the difference easily:
Andrew Skudder
*
*Vendor
benchmark.
Trust but verify.
*For a small price of your fun libraries. Proof-of-Concept.
That was a bad idea, buuut…..
● Work going on in Scala land to translate simple Scala
into SQL expressions - need the Dataset API
○ Maybe we can try similar approaches with Python?
● Very early work going on to use Jython for simple UDFs
(e.g. 2.7 compat & no native libraries) - SPARK-15369
○ Early benchmarking w/word count 5% slower than native Scala UDF,
close to 2x faster than regular Python
● Willing to share your Python UDFs for benchmarking? -
http://bit.ly/pySparkUDF
*The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its
ok!
Big data beyond the JVM -  DDTX 2018
The “future”*: faster interchange
● By future I mean availability starting in the next 3-6 months (with more
improvements after).
○ Yes much of this code exists, it just isn’t released yet so I’m sure we’ll find all sorts of bugs
and ways to improve.
○ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early!
● Unifying our cross-language experience
○ And not just “normal” languages, CUDA counts yo
Tambako The Jaguar
Andrew Skudder
*Arrow: likely the future. I really hope so. Spark 2.3 and beyond!
* *
What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark.
Trust but verify.
Beyond wordcount: depencies?
● Your machines probably already have pandas
○ But maybe an old version
● But they might not have “special_business_logic”
○ Very special business logic, no one wants change fortran code*.
● Option 1: Talk to your vendor**
● Option 2: Try some sketchy open source software from
a hack day
● We’re going to focus on option 2!
*Because it’s perfect, it is fortran after all.
** I don’t like this option because the vendor I work for doesn’t have an answer.
coffee_boat to the rescue*
# You can tell it's alpha cause were installing from github
!pip install --upgrade
git+https://github.com/nteract/coffee_boat.git
# Use the coffee boat
from coffee_boat import Captain
captain = Captain(accept_conda_license=True)
captain.add_pip_packages("pyarrow", "edtf")
captain.launch_ship()
sc = SparkContext(master="yarn")
# You can now use pyarrow & edtf
captain.add_pip_packages("yourmagic")
# You can now use yourmagic in transformations!
Hadoop “streaming” (Python/R)
● Unix pipes!
● Involves a data copy, formats get sad
● But the overhead of a Map/Reduce task is pretty high anyways...
Lisa Larsson
Kafka: re-implement all the things
● Multiple options for connecting to Kafka from outside of the JVM (yay!)
● They implement the protocol to talk to Kafka (yay!)
● This involves duplicated client work, and sometimes the clients can be slow
(solution, FFI bindings to C instead of Java)
● Buuuut -- we can’t access all of the cool Kafka business (like Kafka Streams)
and features depend on client libraries implementing them (easy to slip below
parity)
Smokey Combs
Dask: a new beginning?
● Pure* python implementation
● Provides real enough DataFrame interface for distributed data
● Also your standard-ish distributed collections
● Multiple backends
● Primary challenge: interacting with the rest of the big data ecosystem
○ Arrow & friends might make this better with time too, buuut….
● See https://dask.pydata.org/en/latest/ &
http://dask.pydata.org/en/latest/spark.html
Lisa Zins
BEAM Beyond the JVM
● Non JVM BEAM doesn’t work outside of Google’s environment yet, so I’m
going to skip the details.
● tl;dr : uses grpc / protobuf
● But exciting new plans to unify the runners and ease the support of different
languages (called SDKS)
○ See https://beam.apache.org/contribute/portability/
● If this is exciting, you can come join me on making BEAM work in Python3
○ Yes we still don’t have that :(
○ But we're getting closer!
Why now?
● There’s been better formats/options for a long time
● JVM devs want to use libraries in other languages with lots of data
○ e.g. startup + Deep Learning + ? => profit
● Arrow has solved the chicken-egg problem by building not just the chicken &
the egg, but also a hen house
Andrew Mager
References
● Apache Arrow: https://arrow.apache.org/
● Brian (IBM) on initial Spark + Arrow
https://arrow.apache.org/blog/2017/07/26/spark-arrow/
● Li Jin (two sigma)
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspar
k.html
● Bill Maimone
https://blogs.nvidia.com/blog/2017/06/27/gpu-computation-visualization/
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
You can buy it today!
Only one chapter on non-JVM stuff, I’m sorry.
Cats love it*
*Or at least the box it comes in. If buying for a cat, get print
rather than e-book.
Spark Videos
● Apache Spark Youtube Channel
● My Spark videos on YouTube -
○ http://bit.ly/holdenSparkVideos
● Spark Summit 2014 training
● Paco’s Introduction to Apache Spark
And some upcoming talks:
● Feb
○ FOSDEM - One on testing one on scaling
○ JFokus in Stockholm - Adding deep learning to Spark
○ I disappear for a week and pretend computers work
● March
○ Strata San Jose - Big Data Beyond the JVM
■ You can give us feedback so that one rocks more
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
I need to give a testing talk next
month, help a “friend” out.
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)
Bonus Slides
Maybe you ask a question and we go here :)
We can do that w/Kafka streams..
● Why bother learning from our mistakes?
● Or more seriously, the mistakes weren’t that bad...
Our “special” business logic
def transform(input):
"""
Transforms the supplied input.
"""
return str(len(input))
Pargon
Let’s pretend all the world is a string:
override def transform(value: String): String = {
// WARNING: This may summon cuthuluhu
dataOut.writeInt(value.getBytes.size)
dataOut.write(value.getBytes)
dataOut.flush()
val resultSize = dataIn.readInt()
val result = new Array[Byte](resultSize)
dataIn.readFully(result)
// Assume UTF8, what could go wrong? :p
new String(result)
}
From https://github.com/holdenk/kafka-streams-python-cthulhu
Then make an instance to use it...
val testFuncFile =
"kafka_streams_python_cthulhu/strlen.py"
stream.transformValues(
PythonStringValueTransformerSupplier(testFuncFile))
// Or we could wrap this in the bridge but thats effort.
From https://github.com/holdenk/kafka-streams-python-cthulhu
Let’s pretend all the world is a string:
def main(socket):
while (True):
input_length = _read_int(socket)
data = socket.read(input_length)
result = transform(data)
resultBytes = result.encode()
_write_int(len(resultBytes), socket)
socket.write(resultBytes)
socket.flush()
From https://github.com/holdenk/kafka-streams-python-cthulhu
What does that let us do?
● You can add a map stage with your data scientists
Python code in the middle
● You’re limited to strings*
● Still missing the “driver side” integration (e.g. the
interface requires someone to make a Scala class at
some point)
What about things other than strings?
Use another system
● Like Spark! (oh wait) or BEAM* or FLINK*?
Write it in a format Python can understand:
● Pickling (from Java)
● JSON
● XML
Purely Python solutions
● Currently roll-your-own (but not that bad)
*These are also JVM based solutions calling into Python. I’m not saying they will also summon Cuthulhu, I’m just saying hang onto

More Related Content

What's hot (20)

PDF
Contributing to Apache Spark 3
Holden Karau
 
PDF
Spark Autotuning Talk - Strata New York
Holden Karau
 
PDF
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Holden Karau
 
PDF
Powering tensor flow with big data using apache beam, flink, and spark cern...
Holden Karau
 
PDF
PySpark on Kubernetes @ Python Barcelona March Meetup
Holden Karau
 
PDF
Using Spark ML on Spark Errors - What do the clusters tell us?
Holden Karau
 
PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
PDF
Inside the JVM - Follow the white rabbit!
Sylvain Wallez
 
KEY
Mashups with Drupal and QueryPath
Matt Butcher
 
PPTX
Mastering Java Bytecode - JAX.de 2012
Anton Arhipov
 
PDF
Unbreaking Your Django Application
OSCON Byrum
 
PDF
The things we don't see – stories of Software, Scala and Akka
Konrad Malawski
 
PDF
Not Only Streams for Akademia JLabs
Konrad Malawski
 
PDF
Async await...oh wait!
Thomas Pierrain
 
PDF
Need for Async: Hot pursuit for scalable applications
Konrad Malawski
 
PDF
ScalaSwarm 2017 Keynote: Tough this be madness yet theres method in't
Konrad Malawski
 
PDF
Great Tools Heavily Used In Japan, You Don't Know.
Junichi Ishida
 
PDF
Dart Workshop
Dmitry Buzdin
 
PDF
Realtime Apps with Django
Renyi Khor
 
PPTX
Burp plugin development for java n00bs (44 con)
Marc Wickenden
 
Contributing to Apache Spark 3
Holden Karau
 
Spark Autotuning Talk - Strata New York
Holden Karau
 
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Holden Karau
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Holden Karau
 
PySpark on Kubernetes @ Python Barcelona March Meetup
Holden Karau
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Holden Karau
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
Inside the JVM - Follow the white rabbit!
Sylvain Wallez
 
Mashups with Drupal and QueryPath
Matt Butcher
 
Mastering Java Bytecode - JAX.de 2012
Anton Arhipov
 
Unbreaking Your Django Application
OSCON Byrum
 
The things we don't see – stories of Software, Scala and Akka
Konrad Malawski
 
Not Only Streams for Akademia JLabs
Konrad Malawski
 
Async await...oh wait!
Thomas Pierrain
 
Need for Async: Hot pursuit for scalable applications
Konrad Malawski
 
ScalaSwarm 2017 Keynote: Tough this be madness yet theres method in't
Konrad Malawski
 
Great Tools Heavily Used In Japan, You Don't Know.
Junichi Ishida
 
Dart Workshop
Dmitry Buzdin
 
Realtime Apps with Django
Renyi Khor
 
Burp plugin development for java n00bs (44 con)
Marc Wickenden
 

Similar to Big data beyond the JVM - DDTX 2018 (20)

PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
PDF
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
PDF
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Holden Karau
 
PDF
Are general purpose big data systems eating the world?
Holden Karau
 
PDF
Getting The Best Performance With PySpark
Spark Summit
 
PDF
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 
PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
PDF
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
PDF
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
PDF
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Holden Karau
 
PDF
Introduction to Spark with Python
Gokhan Atil
 
PDF
Getting the best performance with PySpark - Spark Summit West 2016
Holden Karau
 
PDF
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 
PPTX
Simplifying training deep and serving learning models with big data in python...
Holden Karau
 
PPTX
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Holden Karau
 
PDF
Apache Spark Tutorial
Ahmet Bulut
 
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
PDF
How does that PySpark thing work? And why Arrow makes it faster?
Rubén Berenguel
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Holden Karau
 
Are general purpose big data systems eating the world?
Holden Karau
 
Getting The Best Performance With PySpark
Spark Summit
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Holden Karau
 
Introduction to Spark with Python
Gokhan Atil
 
Getting the best performance with PySpark - Spark Summit West 2016
Holden Karau
 
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 
Simplifying training deep and serving learning models with big data in python...
Holden Karau
 
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Holden Karau
 
Apache Spark Tutorial
Ahmet Bulut
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
How does that PySpark thing work? And why Arrow makes it faster?
Rubén Berenguel
 
Ad

Recently uploaded (20)

PPT
introductio to computers by arthur janry
RamananMuthukrishnan
 
PPTX
Random Presentation By Fuhran Khalil uio
maniieiish
 
PPTX
Cost_of_Quality_Presentation_Software_Engineering.pptx
farispalayi
 
PPT
Computer Securityyyyyyyy - Chapter 2.ppt
SolomonSB
 
PPTX
ipv6 very very very very vvoverview.pptx
eyala75
 
PPTX
PE introd.pptxfrgfgfdgfdgfgrtretrt44t444
nepmithibai2024
 
PPTX
internet básico presentacion es una red global
70965857
 
PDF
Internet Governance and its role in Global economy presentation By Shreedeep ...
Shreedeep Rayamajhi
 
PPTX
Research Design - Report on seminar in thesis writing. PPTX
arvielobos1
 
PDF
Pas45789-Energs-Efficient-Craigg1ing.pdf
lafinedelcinghiale
 
PDF
DevOps Design for different deployment options
henrymails
 
PDF
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
PPTX
ONLINE BIRTH CERTIFICATE APPLICATION SYSYTEM PPT.pptx
ShyamasreeDutta
 
PDF
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
PDF
123546568reb2024-Linux-remote-logging.pdf
lafinedelcinghiale
 
PPTX
英国学位证(RCM毕业证书)皇家音乐学院毕业证书如何办理
Taqyea
 
PDF
Digital Security in 2025 with Adut Angelina
The ClarityDesk
 
PPTX
ZARA-Case.pptx djdkkdjnddkdoodkdxjidjdnhdjjdjx
RonnelPineda2
 
PPT
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
PDF
Apple_Environmental_Progress_Report_2025.pdf
yiukwong
 
introductio to computers by arthur janry
RamananMuthukrishnan
 
Random Presentation By Fuhran Khalil uio
maniieiish
 
Cost_of_Quality_Presentation_Software_Engineering.pptx
farispalayi
 
Computer Securityyyyyyyy - Chapter 2.ppt
SolomonSB
 
ipv6 very very very very vvoverview.pptx
eyala75
 
PE introd.pptxfrgfgfdgfdgfgrtretrt44t444
nepmithibai2024
 
internet básico presentacion es una red global
70965857
 
Internet Governance and its role in Global economy presentation By Shreedeep ...
Shreedeep Rayamajhi
 
Research Design - Report on seminar in thesis writing. PPTX
arvielobos1
 
Pas45789-Energs-Efficient-Craigg1ing.pdf
lafinedelcinghiale
 
DevOps Design for different deployment options
henrymails
 
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
ONLINE BIRTH CERTIFICATE APPLICATION SYSYTEM PPT.pptx
ShyamasreeDutta
 
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
123546568reb2024-Linux-remote-logging.pdf
lafinedelcinghiale
 
英国学位证(RCM毕业证书)皇家音乐学院毕业证书如何办理
Taqyea
 
Digital Security in 2025 with Adut Angelina
The ClarityDesk
 
ZARA-Case.pptx djdkkdjnddkdoodkdxjidjdnhdjjdjx
RonnelPineda2
 
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
Apple_Environmental_Progress_Report_2025.pdf
yiukwong
 
Ad

Big data beyond the JVM - DDTX 2018

  • 1. Big Data Beyond the JVM With a lot of a Spark focus
  • 2. Rachel ● My name is Rachel ● Prefered pronouns are she/her ● Data scientist at Salesforce Einstein ● previously at Alpine ● co-author of High Performance Spark ● @warre_n_peace ● Slide share http://www.slideshare.net/rachelbwarren ● Linkedin https://www.linkedin.com/in/rachelbwarren
  • 3. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC :) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos
  • 5. Who is Boo? ● Boo uses she/her pronouns (as I told the Texas house committee) ● Best doge ● Lot’s of experience barking at computers to make them go faster ● Author of “Learning to Bark” & “High Performance Barking” ● On twitter @BooProgrammer
  • 6. Who I think you wonderful humans are? ● Nice enough people ● Don’t mind pictures of cats ● Might know some Apache Spark - might not ● Possibly know some Python or R ● Or are tired of scala/ java Lori Erickson
  • 7. What is Spark? ● General purpose distributed system ○ With a really nice API including Python :) ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Good when problems become too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 8. Spark specific terms in this talk ● RDD ○ Resilient Distributed Dataset - Like a distributed collection. Supports many of the same operations as Seq’s in Scala but automatically distributed and fault tolerant. Lazily evaluated, and handles faults by recompute. Any* Java or Kyro serializable object. ● DataFrame ○ Spark DataFrame - not a Pandas or R DataFrame. Distributed, supports a limited set of operations. Columnar structured, runtime schema information only. Limited* data types. Sql (ish) API ● Dataset ○ Compile time typed version of DataFrame (generic) skdevitt
  • 9. What will be covered? ● A more detailed look at the current state of PySpark ● Why it isn’t good enough ● Why things are finally changing ● A brief tour of options for non-JVM languages in the Big Data space ● My even less subtle attempts to get you to buy my new book ● Pictures of cats & stuffed animals ● tl;dr - We’ve* made some bad** choices historically, and projects like Arrow & friends can save us from some of these (yay!)
  • 10. What’s the state of non-JVM big data? Most of the tools are built in the JVM, so how do we play together? ● Pickling, Strings, JSON, XML, oh my! ● Unix pipes ● Sockets What about if we don’t want to copy the data all the time? ● Or standalone “pure”* re-implementations of everything ○ Reasonable option for things like Kafka where you would have the I/O regardless. ○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem David Brown
  • 11. PySpark: ● The Python interface to Spark ● Same general technique used as the bases for the C#, R, Julia, etc. interfaces to Spark ● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API ● Has some serious performance hurdles from the design
  • 12. Yes, we have wordcount! :p lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(output) No data is read or processed until after this line This is an “action” which forces spark to evaluate the RDD These are still combined and executed in one python executor Trish Hamme
  • 13. A quick detour into PySpark’s internals + + JSON
  • 14. Spark in Scala, how does PySpark work? ● Py4J + pickling + JSON and magic ○ Py4j in the driver ○ Pipes to start python process from java exec ○ cloudPickle to serialize data between JVM and python executors (transmitted via sockets) ○ Json for dataframe schema ● Data from Spark worker serialized and piped to Python worker --> then piped back to jvm ○ Multiple iterator-to-iterator transformations are still pipelined :) ○ So serialization happens only once per stage ● Spark SQL (and DataFrames) avoid some of this kristin klein
  • 15. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 16. So how does that impact PySpark? ● Double serialization cost makes everything more expensive ● Python worker startup takes a bit of extra time ● Python memory isn’t controlled by the JVM - easy to go over container limits if deploying on YARN or similar ● Error messages make ~0 sense ● Spark Features aren’t automatically exposed, but exposing them is normally simple
  • 17. Our saviour from serialization: DataFrames ● For the most part keeps data in the JVM ○ Notable exception is UDFs written in Python ● Takes our python calls and turns it into a query plan if we need more than the native operations in Spark’s DataFrames ● be wary of Distributed Systems bringing claims of usability…. Andy Blackledge
  • 18. So what are Spark DataFrames? ● More than SQL tables ● Not Pandas or R DataFrames ● Semi-structured (have schema information) ● tabular ● work on expression as well as lambdas ○ e.g. df.filter(df.col(“happy”) == true) instead of rdd.filter(lambda x: x.happy == true)) ● Not a subset of Spark “Datasets” - since Dataset API isn’t exposed in Python yet :( Quinn Dombrowski
  • 19. Word count w/Dataframes df = sqlCtx.read.load(src) # Returns an RDD words = df.select("text").flatMap(lambda x: x.text.split(" ")) words_df = words.map( lambda x: Row(word=x, cnt=1)).toDF() word_count = words_df.groupBy("word").sum() word_count.write.format("parquet").save("wc.parquet") Still have the double serialization here :(
  • 20. We can see the difference easily: Andrew Skudder * *Vendor benchmark. Trust but verify.
  • 21. *For a small price of your fun libraries. Proof-of-Concept.
  • 22. That was a bad idea, buuut….. ● Work going on in Scala land to translate simple Scala into SQL expressions - need the Dataset API ○ Maybe we can try similar approaches with Python? ● Very early work going on to use Jython for simple UDFs (e.g. 2.7 compat & no native libraries) - SPARK-15369 ○ Early benchmarking w/word count 5% slower than native Scala UDF, close to 2x faster than regular Python ● Willing to share your Python UDFs for benchmarking? - http://bit.ly/pySparkUDF *The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its ok!
  • 24. The “future”*: faster interchange ● By future I mean availability starting in the next 3-6 months (with more improvements after). ○ Yes much of this code exists, it just isn’t released yet so I’m sure we’ll find all sorts of bugs and ways to improve. ○ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early! ● Unifying our cross-language experience ○ And not just “normal” languages, CUDA counts yo Tambako The Jaguar
  • 25. Andrew Skudder *Arrow: likely the future. I really hope so. Spark 2.3 and beyond! * *
  • 26. What does the future look like?* *Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html. *Vendor benchmark. Trust but verify.
  • 27. Beyond wordcount: depencies? ● Your machines probably already have pandas ○ But maybe an old version ● But they might not have “special_business_logic” ○ Very special business logic, no one wants change fortran code*. ● Option 1: Talk to your vendor** ● Option 2: Try some sketchy open source software from a hack day ● We’re going to focus on option 2! *Because it’s perfect, it is fortran after all. ** I don’t like this option because the vendor I work for doesn’t have an answer.
  • 28. coffee_boat to the rescue* # You can tell it's alpha cause were installing from github !pip install --upgrade git+https://github.com/nteract/coffee_boat.git # Use the coffee boat from coffee_boat import Captain captain = Captain(accept_conda_license=True) captain.add_pip_packages("pyarrow", "edtf") captain.launch_ship() sc = SparkContext(master="yarn") # You can now use pyarrow & edtf captain.add_pip_packages("yourmagic") # You can now use yourmagic in transformations!
  • 29. Hadoop “streaming” (Python/R) ● Unix pipes! ● Involves a data copy, formats get sad ● But the overhead of a Map/Reduce task is pretty high anyways... Lisa Larsson
  • 30. Kafka: re-implement all the things ● Multiple options for connecting to Kafka from outside of the JVM (yay!) ● They implement the protocol to talk to Kafka (yay!) ● This involves duplicated client work, and sometimes the clients can be slow (solution, FFI bindings to C instead of Java) ● Buuuut -- we can’t access all of the cool Kafka business (like Kafka Streams) and features depend on client libraries implementing them (easy to slip below parity) Smokey Combs
  • 31. Dask: a new beginning? ● Pure* python implementation ● Provides real enough DataFrame interface for distributed data ● Also your standard-ish distributed collections ● Multiple backends ● Primary challenge: interacting with the rest of the big data ecosystem ○ Arrow & friends might make this better with time too, buuut…. ● See https://dask.pydata.org/en/latest/ & http://dask.pydata.org/en/latest/spark.html Lisa Zins
  • 32. BEAM Beyond the JVM ● Non JVM BEAM doesn’t work outside of Google’s environment yet, so I’m going to skip the details. ● tl;dr : uses grpc / protobuf ● But exciting new plans to unify the runners and ease the support of different languages (called SDKS) ○ See https://beam.apache.org/contribute/portability/ ● If this is exciting, you can come join me on making BEAM work in Python3 ○ Yes we still don’t have that :( ○ But we're getting closer!
  • 33. Why now? ● There’s been better formats/options for a long time ● JVM devs want to use libraries in other languages with lots of data ○ e.g. startup + Deep Learning + ? => profit ● Arrow has solved the chicken-egg problem by building not just the chicken & the egg, but also a hen house Andrew Mager
  • 34. References ● Apache Arrow: https://arrow.apache.org/ ● Brian (IBM) on initial Spark + Arrow https://arrow.apache.org/blog/2017/07/26/spark-arrow/ ● Li Jin (two sigma) https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspar k.html ● Bill Maimone https://blogs.nvidia.com/blog/2017/06/27/gpu-computation-visualization/
  • 35. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 36. High Performance Spark! You can buy it today! Only one chapter on non-JVM stuff, I’m sorry. Cats love it* *Or at least the box it comes in. If buying for a cat, get print rather than e-book.
  • 37. Spark Videos ● Apache Spark Youtube Channel ● My Spark videos on YouTube - ○ http://bit.ly/holdenSparkVideos ● Spark Summit 2014 training ● Paco’s Introduction to Apache Spark
  • 38. And some upcoming talks: ● Feb ○ FOSDEM - One on testing one on scaling ○ JFokus in Stockholm - Adding deep learning to Spark ○ I disappear for a week and pretend computers work ● March ○ Strata San Jose - Big Data Beyond the JVM ■ You can give us feedback so that one rocks more
  • 39. k thnx bye :) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark I need to give a testing talk next month, help a “friend” out. Will tweet results “eventually” @holdenkarau Do you want more realistic benchmarks? Share your UDFs! http://bit.ly/pySparkUDF Pssst: Have feedback on the presentation? Give me a shout ([email protected]) if you feel comfortable doing so :)
  • 40. Bonus Slides Maybe you ask a question and we go here :)
  • 41. We can do that w/Kafka streams.. ● Why bother learning from our mistakes? ● Or more seriously, the mistakes weren’t that bad...
  • 42. Our “special” business logic def transform(input): """ Transforms the supplied input. """ return str(len(input)) Pargon
  • 43. Let’s pretend all the world is a string: override def transform(value: String): String = { // WARNING: This may summon cuthuluhu dataOut.writeInt(value.getBytes.size) dataOut.write(value.getBytes) dataOut.flush() val resultSize = dataIn.readInt() val result = new Array[Byte](resultSize) dataIn.readFully(result) // Assume UTF8, what could go wrong? :p new String(result) } From https://github.com/holdenk/kafka-streams-python-cthulhu
  • 44. Then make an instance to use it... val testFuncFile = "kafka_streams_python_cthulhu/strlen.py" stream.transformValues( PythonStringValueTransformerSupplier(testFuncFile)) // Or we could wrap this in the bridge but thats effort. From https://github.com/holdenk/kafka-streams-python-cthulhu
  • 45. Let’s pretend all the world is a string: def main(socket): while (True): input_length = _read_int(socket) data = socket.read(input_length) result = transform(data) resultBytes = result.encode() _write_int(len(resultBytes), socket) socket.write(resultBytes) socket.flush() From https://github.com/holdenk/kafka-streams-python-cthulhu
  • 46. What does that let us do? ● You can add a map stage with your data scientists Python code in the middle ● You’re limited to strings* ● Still missing the “driver side” integration (e.g. the interface requires someone to make a Scala class at some point)
  • 47. What about things other than strings? Use another system ● Like Spark! (oh wait) or BEAM* or FLINK*? Write it in a format Python can understand: ● Pickling (from Java) ● JSON ● XML Purely Python solutions ● Currently roll-your-own (but not that bad) *These are also JVM based solutions calling into Python. I’m not saying they will also summon Cuthulhu, I’m just saying hang onto