SlideShare a Scribd company logo
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
May, 2015
Douglas Eisenstein - Advanti
Stanislav Seltser - Advanti
BOSTON 2015
@opendatasci
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
Spark, Python, and Parquet
Learn How to Use Spark, Python, and Parquet for
Loading and Transforming Data in 45 Minutes
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Agenda
Use Case Background
What’s Spark and Parquet?
Demo: code + data = fun part
Questions
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
TAKEAWAYS: WHAT WILL YOU LEARN TODAY?
Learn new technologies and be motivated to start using them
3	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
In about the same time it will take you to mow your lawn…
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Takeaways: you will learn …
1.  What is Spark and Parquet and why to consider it?
2.  A live demo showing you 5 useful transforms in Spark
3.  Instructions for DIY cluster: Spark, Hadoop, Hive, Parquet, and C*
4.  “Take home” fund holdings open dataset from Morningstar
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
USE CASE: DATASET, TRADEOFFS, SOLUTION
What’s our use case? Why choose Spark?
6	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Holdings Data
“The contents of an investment portfolio held by an individual or entity such as a mutual
fund or pension fund.” – Investopedia 2015
Sources:
o  3rd Parties: Morningstar, Lipper, FactSet, Bloomberg, Thomson Reuters
o  Internal: Specialized portfolio accounting systems per asset type
o  Public: SEC 13F’s when AUM is >= $100M, includes hedge funds
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Use Case: Challenges
1.  Millions of files, Reshape data, pre-aggregate holdings,
disambiguate securities, unstack/stack data
2.  Create wide tables (think columnar) for storing time series data
about 1M instruments over 12 months for 381k funds
3.  Rules are abstracted to work on “holdings” making them vendor-
agnostic and reusable (13F’s, 3rd parties, proprietary, etc)
4.  All sorts of “data usability” issues: missing positions, missing
identifiers, sparse derivative data, irregular fund reporting, etc
5.  Need to create your own “ownership views” by asset class and
period (ex. Fixed Income Ownership View)
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Open Dataset Description
Key Points
o  Holdings are electively contributed for fresher ranks
o  Long and Short positions included, unique…
o  Datasets: free open data, trial data, paid licensed data
o  Our dataset starts in 2014-01 across all fund types
Descriptive Statistics
o  History starts in 2000+
o  Fund types: Open, Closed, ETFs, SMAs
o  381k funds since 2014
o  1M unique securities since 2014
o  82 legal types, ex Corporate Bond or Equity Swap
Open Dataset: open-ended funds for report period 2014-06-10
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Common Usages
1.  Ownership: Top N largest long/short positions per asset class
2.  Flow: Compute holdings flow across various financial assets
3.  Comparison: Peer group analysis through holdings attribution
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
What’s the best part about starting a new project? You get to “play”
with new tech toys right? But which ones….?
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Data Processing Landscape
o  f
Too
many
choices
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Modern General Purpose Data Pipeline
Why:
1.  Spark for in-memory transforms/analytics and fast exploration
2.  Parquet for fast columnar data retrieval
3.  Python as the glue layer and to re-use data transforms
Data Pipeline:
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Granular Options
Remember… Technology moves quickly!
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
SPARK: WHAT IS IT?
Everyone hears about it, but what is Spark?
15	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
If you do any Googling, don’t search for Spark 1.4
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
What is Spark?
Highlights
o  General purpose data processing engine
o  In-memory data persistence
o  Interactive shells in Scala and Python
Key Concepts
o  RDD: Collections of objects stored in RAM or on Disk
o  Transforms: map(), filter(), reduceByKey(), join()
o  Actions: count(), collect(), save()
Spark SQL
o  Runs SQL or HQL queries
o  In-line SQL UDF
o  Distributed DataFrame
o  Parquet, Hive, Cassandra
Graphic	
  sourced:	
  hAps://goo.gl/fpLrSs	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
What is a Spark DataFrame?
What’s DataFrame?
•  A collection of rows organized into named columns
•  Expressive language for wrangling data into something usable
•  Ability to select, filter, and aggregate structured data
What’s a Spark DataFrame?
•  Distributed collection of data organized into named columns
•  Conceptually equivalent to a table in a relational database
•  Relational data processing: project, filter, aggregate, join, etc
•  Operations: groupBy(),
join(), sql(),
unique()
•  Create UDF’s and push them
into SparkSQL
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Distributed Data Aggregation
o  Distributed high-performance data, once only available by
expensive appliances: Netezza, GreenPlum, Exadata, Teradata, etc
o  Code 1-liner doing a SUM() between Pandas (standalone) and
Spark (distributed)
Pandas DataFrame Spark DataFrame
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
PARQUET: WHAT IS IT?
No, it’s not your Grandma’s flooring…
20	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
What is Parquet?
Parquet File Format Parquet in HDFS
“Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem,
regardless of the choice of data processing framework, data model or programming language.” –
parquet.apache.org
•  Columnar File Format
•  Supports Nested Data Structures
•  Not tied to any commercial framework
•  Accessible by HIVE, Spark, Pig, Drill, MR
•  R/W in HDFS or local file system
•  Gaining Strong Usage…
Features	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
What is Columnar Data?
•  Limit’s IO to data needed
•  Columnar compresses better
•  Type specific encodings available
•  Enables vectorized execution engines
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
How is Columnar Data Read?
Projection à
9 columns
Predicate à
50 Columns Wide
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Spark DataFrame è Parquet
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
DEMO: HIVE, PARQUET, SPARK SQL, DATAFRAME
Prepare CSV’s in HIVE, persist in Parquet, show Spark SQL and
DataFrame transforms using interactive shell in PySpark
25	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Spark SQL is Experimental
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Demo Outline
T
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
OBSTACLES: NOTHING IS EASY
Expect workarounds and time spent hacking, but consider this, it’s
a learning opportunity…
28	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Nothing is easy – unless you’re Ricky Bobby – “Shake and Bake”
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Obstacles
Spark
•  Don’t expect Pandas-convenient Spark DataFrames (yet at least)
(e.g. no upsampling, backfilling, etc)
•  RDD’s are powerful, but you’ll need to get your hands dirty writing
lower-level code (not a bad thing)
•  Allocate enough memory on all of your nodes, it’s a hog!
Parquet
•  Date and binary support are pending although timestamp,
decimal, char, and varchar are now supported in Hive 0.14.0
•  You won’t get Vertica or Cassandra level response times (maybe in
the future – that’s my speculation)
Getting Started
•  Learn by doing: reading is useful to gain the basics but just “jump
right in and do it”
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
QUESTIONS
Douglas Eisenstein
@dougeisenstein
doug.eisenstein@advantisolutions.com
Helping to create fast, reliable, and transparent modern data pipelines for financial analytics
Talk feedback: https://goo.gl/Y0UuWy
31	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
APPENDIX
Using Spark, Python, and Cassandra for Loading and
Transformations at Scale
32	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Resources
hAp://training.databricks.com/workshop/itas_workshop.pdf	
  (Intro	
  to	
  Spark	
  PDF)	
  	
  
hAps://databricks-­‐training.s3.amazonaws.com/slides/SparkSQLTraining.Summit.July2014.pdf	
  (Spark	
  SQL	
  Training	
  Module)	
  	
  
hAp://training.databricks.com/workshop/itas_workshop.pdf	
  (Spark	
  101)	
  	
  
hAp://spark-­‐summit.org/2014/training	
  (General	
  Spark	
  Videos)	
  	
  
hAps://spark.apache.org/docs/1.3.0/sql-­‐programming-­‐guide.html	
  (Spark	
  SQL	
  and	
  DataFrame’s	
  —	
  awesome)	
  	
  
hAp://www.slideshare.net/EvanChan2/2014-­‐07olapcassspark	
  (OLAP	
  with	
  Cassandra	
  and	
  Spark)	
  	
  
hAps://databricks.com/blog/2015/03/24/spark-­‐sql-­‐graduates-­‐from-­‐alpha-­‐in-­‐spark-­‐1-­‐3.html	
  (Latest	
  Spark/DataFrame/Parquet)	
  	
  
hAps://spark.apache.org/docs/latest/programming-­‐guide.html	
  (Basics	
  of	
  Spark	
  Development)	
  	
  
hAps://spark.apache.org/docs/latest/configura=on.html	
  (Spark	
  Configura=on)	
  	
  
hAps://academy.datastax.com/demos/datastax-­‐enterprise-­‐joining-­‐tables-­‐apache-­‐spark	
  (Joining	
  Cassandra	
  Tables	
  in	
  Spark)	
  	
  
hAp://www.infoobjects.com/author/rishi/	
  (Spark	
  /	
  Parquet	
  Integra=on)	
  	
  
hAps://parquet.incubator.apache.org/presenta=ons/	
  (Parquet	
  Videos/Slides,	
  awesome)	
  	
  
hAp://www.slideshare.net/databricks/spark-­‐sqlsse2015public	
  (All	
  about	
  DataFrame	
  by	
  the	
  author)	
  	
  
hAp://www.infoobjects.com/category/spark_cookbook/	
  (Good	
  walkthrough	
  of	
  Spark	
  Demos)	
  	
  
hAp://tobert.github.io/post/2014-­‐07-­‐15-­‐installing-­‐cassandra-­‐spark-­‐stack.html	
  (Spark	
  /	
  Cassandra	
  Integra=on	
  from	
  scratch)	
  	
  
hAp://blog.cloudera.com/blog/2015/05/working-­‐with-­‐apache-­‐spark-­‐or-­‐how-­‐i-­‐learned-­‐to-­‐stop-­‐worrying-­‐and-­‐love-­‐the-­‐
shuffle/	
  (Spark	
  by	
  Data	
  Engineer)	
  	
  
hAp://blog.cloudera.com/blog/2015/03/how-­‐to-­‐tune-­‐your-­‐apache-­‐spark-­‐jobs-­‐part-­‐1/	
  (Tuning	
  Spark)	
  	
  
hAps://spark-­‐summit.org/2015-­‐east/wp-­‐content/uploads/2015/03/SSE15-­‐21-­‐Sandy-­‐Ryza.pdf	
  ()	
  	
  
hAps://www.youtube.com/watch?v=0OM68k3np0E&list=PL-­‐x35fyliRwiiYSXHyI61RXdHlYR3QjZ1&index=5	
  ()	
  
http://www.slideshare.net/dataera/parquet-format
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
About Me
o  My vision is to make data preparation fast and reliable
o  I help financial firms with data-intensive processes
o 
o  In my spare time: CrossFit, Baseball, No Gardening!
@dougeisenstein

More Related Content

What's hot (20)

PDF
Introduction to Apache Spark
Samy Dindane
 
PPTX
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
PDF
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
PDF
Data Science
Ahmet Bulut
 
PDF
Hadoop to spark-v2
Sujee Maniyam
 
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
PDF
Introduction to Apache Spark
Vincent Poncet
 
PDF
Spark SQL
Joud Khattab
 
PDF
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
PDF
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
PDF
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PPTX
Programming in Spark using PySpark
Mostafa
 
PDF
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
PPTX
Spark SQL
Caserta
 
PDF
PySaprk
Giivee The
 
PPTX
Apache Spark sql
aftab alam
 
PDF
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Timothy Spann
 
PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Introduction to Apache Spark
Samy Dindane
 
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Data Science
Ahmet Bulut
 
Hadoop to spark-v2
Sujee Maniyam
 
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
Introduction to Apache Spark
Vincent Poncet
 
Spark SQL
Joud Khattab
 
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
Apache Spark Fundamentals
Zahra Eskandari
 
Programming in Spark using PySpark
Mostafa
 
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Spark SQL
Caserta
 
PySaprk
Giivee The
 
Apache Spark sql
aftab alam
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Timothy Spann
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 

Similar to Spark, Python and Parquet (20)

PDF
Wisely Chen Spark Talk At Spark Gathering in Taiwan
Wisely chen
 
PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
PDF
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
PPTX
Accelerate Innovation with Databricks and Legacy Data
Precisely
 
PDF
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
PPTX
Architecting an Open Source AI Platform 2018 edition
David Talby
 
PDF
Anaconda and PyData Solutions
Travis Oliphant
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
PDF
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
PDF
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
PDF
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Gianmario Spacagna
 
PDF
Continuum Analytics and Python
Travis Oliphant
 
PDF
Big data berlin
kammeyer
 
PPTX
Data Engineering A Deep Dive into Databricks
Knoldus Inc.
 
PPTX
Odsc london data science bootcamp with pixie dust
David Taieb
 
PPTX
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
David Taieb
 
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
PDF
WALD: A Modern & Sustainable Analytics Stack
Florian Wilhelm
 
Wisely Chen Spark Talk At Spark Gathering in Taiwan
Wisely chen
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
Accelerate Innovation with Databricks and Legacy Data
Precisely
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Architecting an Open Source AI Platform 2018 edition
David Talby
 
Anaconda and PyData Solutions
Travis Oliphant
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Gianmario Spacagna
 
Continuum Analytics and Python
Travis Oliphant
 
Big data berlin
kammeyer
 
Data Engineering A Deep Dive into Databricks
Knoldus Inc.
 
Odsc london data science bootcamp with pixie dust
David Taieb
 
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
David Taieb
 
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
WALD: A Modern & Sustainable Analytics Stack
Florian Wilhelm
 
Ad

More from odsc (20)

PPT
Understanding the Chief Data Officer
odsc
 
PPTX
Machine-In-The-Loop for Knowledge Discovery
odsc
 
PPT
API Driven Development
odsc
 
PPTX
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
odsc
 
PPTX
Productionizing Deep Learning From the Ground Up
odsc
 
PPT
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
odsc
 
PPTX
Think Breadth, Not Depth
odsc
 
PPT
Data Science at Dow Jones: Monetizing Data, News and Information
odsc
 
PPTX
Building a Predictive Analytics Solution with Azure ML
odsc
 
PPT
Beyond Names
odsc
 
PPT
How Woman are Conquering the S&P 500
odsc
 
PPTX
Domain Expertise and Unstructured Data
odsc
 
PPTX
Kaggle The Home of Data Science
odsc
 
PPT
Open Source Tools & Data Science Competitions
odsc
 
PPT
Machine Learning with scikit-learn
odsc
 
PPT
Bridging the Gap Between Data and Insight using Open-Source Tools
odsc
 
PDF
Top 10 Signs of the Textpocalypse
odsc
 
PPTX
The Art of Data Science
odsc
 
PPTX
Frontiers of Open Data Science Research
odsc
 
PPTX
Feature Engineering
odsc
 
Understanding the Chief Data Officer
odsc
 
Machine-In-The-Loop for Knowledge Discovery
odsc
 
API Driven Development
odsc
 
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
odsc
 
Productionizing Deep Learning From the Ground Up
odsc
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
odsc
 
Think Breadth, Not Depth
odsc
 
Data Science at Dow Jones: Monetizing Data, News and Information
odsc
 
Building a Predictive Analytics Solution with Azure ML
odsc
 
Beyond Names
odsc
 
How Woman are Conquering the S&P 500
odsc
 
Domain Expertise and Unstructured Data
odsc
 
Kaggle The Home of Data Science
odsc
 
Open Source Tools & Data Science Competitions
odsc
 
Machine Learning with scikit-learn
odsc
 
Bridging the Gap Between Data and Insight using Open-Source Tools
odsc
 
Top 10 Signs of the Textpocalypse
odsc
 
The Art of Data Science
odsc
 
Frontiers of Open Data Science Research
odsc
 
Feature Engineering
odsc
 
Ad

Recently uploaded (20)

PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 

Spark, Python and Parquet

  • 1. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   May, 2015 Douglas Eisenstein - Advanti Stanislav Seltser - Advanti BOSTON 2015 @opendatasci O P E N D A T A S C I E N C E C O N F E R E N C E_ Spark, Python, and Parquet Learn How to Use Spark, Python, and Parquet for Loading and Transforming Data in 45 Minutes
  • 2. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Agenda Use Case Background What’s Spark and Parquet? Demo: code + data = fun part Questions
  • 3. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   TAKEAWAYS: WHAT WILL YOU LEARN TODAY? Learn new technologies and be motivated to start using them 3  
  • 4. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   In about the same time it will take you to mow your lawn…
  • 5. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Takeaways: you will learn … 1.  What is Spark and Parquet and why to consider it? 2.  A live demo showing you 5 useful transforms in Spark 3.  Instructions for DIY cluster: Spark, Hadoop, Hive, Parquet, and C* 4.  “Take home” fund holdings open dataset from Morningstar
  • 6. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   USE CASE: DATASET, TRADEOFFS, SOLUTION What’s our use case? Why choose Spark? 6  
  • 7. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Holdings Data “The contents of an investment portfolio held by an individual or entity such as a mutual fund or pension fund.” – Investopedia 2015 Sources: o  3rd Parties: Morningstar, Lipper, FactSet, Bloomberg, Thomson Reuters o  Internal: Specialized portfolio accounting systems per asset type o  Public: SEC 13F’s when AUM is >= $100M, includes hedge funds
  • 8. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Use Case: Challenges 1.  Millions of files, Reshape data, pre-aggregate holdings, disambiguate securities, unstack/stack data 2.  Create wide tables (think columnar) for storing time series data about 1M instruments over 12 months for 381k funds 3.  Rules are abstracted to work on “holdings” making them vendor- agnostic and reusable (13F’s, 3rd parties, proprietary, etc) 4.  All sorts of “data usability” issues: missing positions, missing identifiers, sparse derivative data, irregular fund reporting, etc 5.  Need to create your own “ownership views” by asset class and period (ex. Fixed Income Ownership View)
  • 9. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Open Dataset Description Key Points o  Holdings are electively contributed for fresher ranks o  Long and Short positions included, unique… o  Datasets: free open data, trial data, paid licensed data o  Our dataset starts in 2014-01 across all fund types Descriptive Statistics o  History starts in 2000+ o  Fund types: Open, Closed, ETFs, SMAs o  381k funds since 2014 o  1M unique securities since 2014 o  82 legal types, ex Corporate Bond or Equity Swap Open Dataset: open-ended funds for report period 2014-06-10
  • 10. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Common Usages 1.  Ownership: Top N largest long/short positions per asset class 2.  Flow: Compute holdings flow across various financial assets 3.  Comparison: Peer group analysis through holdings attribution
  • 11. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   What’s the best part about starting a new project? You get to “play” with new tech toys right? But which ones….?
  • 12. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Data Processing Landscape o  f Too many choices
  • 13. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Modern General Purpose Data Pipeline Why: 1.  Spark for in-memory transforms/analytics and fast exploration 2.  Parquet for fast columnar data retrieval 3.  Python as the glue layer and to re-use data transforms Data Pipeline:
  • 14. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Granular Options Remember… Technology moves quickly!
  • 15. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   SPARK: WHAT IS IT? Everyone hears about it, but what is Spark? 15  
  • 16. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   If you do any Googling, don’t search for Spark 1.4
  • 17. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   What is Spark? Highlights o  General purpose data processing engine o  In-memory data persistence o  Interactive shells in Scala and Python Key Concepts o  RDD: Collections of objects stored in RAM or on Disk o  Transforms: map(), filter(), reduceByKey(), join() o  Actions: count(), collect(), save() Spark SQL o  Runs SQL or HQL queries o  In-line SQL UDF o  Distributed DataFrame o  Parquet, Hive, Cassandra Graphic  sourced:  hAps://goo.gl/fpLrSs  
  • 18. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   What is a Spark DataFrame? What’s DataFrame? •  A collection of rows organized into named columns •  Expressive language for wrangling data into something usable •  Ability to select, filter, and aggregate structured data What’s a Spark DataFrame? •  Distributed collection of data organized into named columns •  Conceptually equivalent to a table in a relational database •  Relational data processing: project, filter, aggregate, join, etc •  Operations: groupBy(), join(), sql(), unique() •  Create UDF’s and push them into SparkSQL
  • 19. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Distributed Data Aggregation o  Distributed high-performance data, once only available by expensive appliances: Netezza, GreenPlum, Exadata, Teradata, etc o  Code 1-liner doing a SUM() between Pandas (standalone) and Spark (distributed) Pandas DataFrame Spark DataFrame
  • 20. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   PARQUET: WHAT IS IT? No, it’s not your Grandma’s flooring… 20  
  • 21. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   What is Parquet? Parquet File Format Parquet in HDFS “Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.” – parquet.apache.org •  Columnar File Format •  Supports Nested Data Structures •  Not tied to any commercial framework •  Accessible by HIVE, Spark, Pig, Drill, MR •  R/W in HDFS or local file system •  Gaining Strong Usage… Features  
  • 22. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   What is Columnar Data? •  Limit’s IO to data needed •  Columnar compresses better •  Type specific encodings available •  Enables vectorized execution engines
  • 23. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   How is Columnar Data Read? Projection à 9 columns Predicate à 50 Columns Wide
  • 24. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Spark DataFrame è Parquet
  • 25. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   DEMO: HIVE, PARQUET, SPARK SQL, DATAFRAME Prepare CSV’s in HIVE, persist in Parquet, show Spark SQL and DataFrame transforms using interactive shell in PySpark 25  
  • 26. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Spark SQL is Experimental
  • 27. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Demo Outline T
  • 28. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   OBSTACLES: NOTHING IS EASY Expect workarounds and time spent hacking, but consider this, it’s a learning opportunity… 28  
  • 29. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Nothing is easy – unless you’re Ricky Bobby – “Shake and Bake”
  • 30. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Obstacles Spark •  Don’t expect Pandas-convenient Spark DataFrames (yet at least) (e.g. no upsampling, backfilling, etc) •  RDD’s are powerful, but you’ll need to get your hands dirty writing lower-level code (not a bad thing) •  Allocate enough memory on all of your nodes, it’s a hog! Parquet •  Date and binary support are pending although timestamp, decimal, char, and varchar are now supported in Hive 0.14.0 •  You won’t get Vertica or Cassandra level response times (maybe in the future – that’s my speculation) Getting Started •  Learn by doing: reading is useful to gain the basics but just “jump right in and do it”
  • 31. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   QUESTIONS Douglas Eisenstein @dougeisenstein [email protected] Helping to create fast, reliable, and transparent modern data pipelines for financial analytics Talk feedback: https://goo.gl/Y0UuWy 31  
  • 32. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   APPENDIX Using Spark, Python, and Cassandra for Loading and Transformations at Scale 32  
  • 33. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Resources hAp://training.databricks.com/workshop/itas_workshop.pdf  (Intro  to  Spark  PDF)     hAps://databricks-­‐training.s3.amazonaws.com/slides/SparkSQLTraining.Summit.July2014.pdf  (Spark  SQL  Training  Module)     hAp://training.databricks.com/workshop/itas_workshop.pdf  (Spark  101)     hAp://spark-­‐summit.org/2014/training  (General  Spark  Videos)     hAps://spark.apache.org/docs/1.3.0/sql-­‐programming-­‐guide.html  (Spark  SQL  and  DataFrame’s  —  awesome)     hAp://www.slideshare.net/EvanChan2/2014-­‐07olapcassspark  (OLAP  with  Cassandra  and  Spark)     hAps://databricks.com/blog/2015/03/24/spark-­‐sql-­‐graduates-­‐from-­‐alpha-­‐in-­‐spark-­‐1-­‐3.html  (Latest  Spark/DataFrame/Parquet)     hAps://spark.apache.org/docs/latest/programming-­‐guide.html  (Basics  of  Spark  Development)     hAps://spark.apache.org/docs/latest/configura=on.html  (Spark  Configura=on)     hAps://academy.datastax.com/demos/datastax-­‐enterprise-­‐joining-­‐tables-­‐apache-­‐spark  (Joining  Cassandra  Tables  in  Spark)     hAp://www.infoobjects.com/author/rishi/  (Spark  /  Parquet  Integra=on)     hAps://parquet.incubator.apache.org/presenta=ons/  (Parquet  Videos/Slides,  awesome)     hAp://www.slideshare.net/databricks/spark-­‐sqlsse2015public  (All  about  DataFrame  by  the  author)     hAp://www.infoobjects.com/category/spark_cookbook/  (Good  walkthrough  of  Spark  Demos)     hAp://tobert.github.io/post/2014-­‐07-­‐15-­‐installing-­‐cassandra-­‐spark-­‐stack.html  (Spark  /  Cassandra  Integra=on  from  scratch)     hAp://blog.cloudera.com/blog/2015/05/working-­‐with-­‐apache-­‐spark-­‐or-­‐how-­‐i-­‐learned-­‐to-­‐stop-­‐worrying-­‐and-­‐love-­‐the-­‐ shuffle/  (Spark  by  Data  Engineer)     hAp://blog.cloudera.com/blog/2015/03/how-­‐to-­‐tune-­‐your-­‐apache-­‐spark-­‐jobs-­‐part-­‐1/  (Tuning  Spark)     hAps://spark-­‐summit.org/2015-­‐east/wp-­‐content/uploads/2015/03/SSE15-­‐21-­‐Sandy-­‐Ryza.pdf  ()     hAps://www.youtube.com/watch?v=0OM68k3np0E&list=PL-­‐x35fyliRwiiYSXHyI61RXdHlYR3QjZ1&index=5  ()   http://www.slideshare.net/dataera/parquet-format
  • 34. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   About Me o  My vision is to make data preparation fast and reliable o  I help financial firms with data-intensive processes o  o  In my spare time: CrossFit, Baseball, No Gardening! @dougeisenstein