SlideShare a Scribd company logo
Apache Spark Usage in the
Open Source Ecosystem
Hossein Falaki
@mhfalaki
About me
• Software Engineer /part-time Data Scientist atDatabricks
• I started using Apache Spark since version 0.6
• Developed first version of Apache Spark CSV data source
• Worked on SparkR and Rnotebooks at Databricks
2
Stackoverflow 2016 trending tech
3
Apache Spark Philosophy
Unified engine
Support end-to-end applications
High-level APIs
Easy to use, rich optimizations
Integrate broadly
Storage systems, libraries, etc
SQLStreaming ML Graph
…
1
2
3
Databricks Community Edition
• In February Databricks launched a free version of its cloud based
platform in beta
• Since then more than 8,000 users registered
• Users created over 61,000 notebooks indifferent languages
• This is an analysis of third party libraries that our beta users
imported to complement Apache Spark in Scala, Python, and R
5
What % of users use other libraries
Language %	users	importing external	libs Average	#	libs Median	#	libs
Python 75	% 9 2
Scala 55	% 3 1
R 57	% 6 1
6
Installing libraries is easy
7
Python Packages
8
Most popular Python packages
9
What is test_helper?
10
What are these?
ETL
• re
• datetime
• pandas
• json
• csv
• string
• math /operator
• urllib /urllib2
11
Visualization
• matplotlib
• ggplot
• seaborn
Advanced analytics
• numpy
• sklearn
• graphframes
• tensorflow
• scipy
Other
• test_helper
• os
• md5
Python package categories
12
What packages go together?
13
Scala Packages
14
Most popular Scala libraries
15
What are these?
ETL
• java/scala util
• scala.collection
• scala.math
• java.{io, nio}
• java.text
• o.a.commons
• kafka
• twitter4j
16
Visualization
• ?
Advanced analytics
• spark.ml
• graphframes
Other
• java.net
• scala.sys
Scala package categories
17
What libraries go together?
18
R Packages
19
Most popular R packages
20
What are these?
ETL
• dplyr
• plyr
• reshape2
• jsonlite
• tidyr
• lubridate
• httr
• data.table
21
Visualization
• ggplot2
• beanplot
• plotly
• ...
Advanced analytics
• sparkr
• h2o
• caret
• e1071
Other
• devtools
• magrittr
R package categories
22
Comparing Python, Scala & R
23
Languages have unique features
24
Scala/ Python / R R / Python Scala / Python/ R
• 25 % of users,use multiple languages
• 3% of notebooks mix different languages
Summary
• Spark users extensively mix itwith other packages in different languages
– One ofgoals ofSpark project is working well with other projects
• ETL related libraries are the most popular category
– Opportunities for newdata sources
• Notebooks are being used for “small data” aswell as“big data.”
• Languages and their ecosystems have diverse capabilities. Users seem to
be mixing languages to their advantage
– Scala is missing visualization libraries
25
Try your favorite library in Databricks
26
http://databricks.com/ce
Try latest version of Apache Spark and previewof Spark 2.0
Thank you!
What packages are used together?
28

More Related Content

What's hot (20)

PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
PPTX
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
PDF
Spark Meetup at Uber
Databricks
 
PDF
Distributed ML in Apache Spark
Databricks
 
PPTX
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
PDF
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PDF
Enabling exploratory data science with Spark and R
Databricks
 
PPTX
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
PDF
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Databricks
 
PDF
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PDF
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
PDF
Operational Tips for Deploying Spark
Databricks
 
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
PDF
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Databricks
 
PDF
Spark Summit EU talk by Tim Hunter
Spark Summit
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
Spark Meetup at Uber
Databricks
 
Distributed ML in Apache Spark
Databricks
 
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Enabling exploratory data science with Spark and R
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Databricks
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
Operational Tips for Deploying Spark
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Databricks
 
Spark Summit EU talk by Tim Hunter
Spark Summit
 

Viewers also liked (20)

PDF
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
Spark Summit
 
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
 
PDF
Introduction to Apache Spark Ecosystem
Bojan Babic
 
PDF
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
René Pfitzner
 
PPTX
Introduction to Hive
Uday Vakalapudi
 
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PDF
Spark is going to replace Apache Hadoop! Know Why?
Edureka!
 
PPTX
Big data spain keynote nov 2016
alanfgates
 
PPTX
Hive ACID Apache BigData 2016
alanfgates
 
PPTX
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
tsliwowicz
 
PDF
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
PDF
2016 spark survey
Abhishek Choudhary
 
PPTX
Big data Processing with Apache Spark & Scala
Edureka!
 
PPTX
Big Data Trend with Open Platform
Jongwook Woo
 
PDF
Data Science with Apache Spark - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
PDF
PySpark Best Practices
Cloudera, Inc.
 
PDF
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Data Con LA
 
PPT
Hive Training -- Motivations and Real World Use Cases
nzhang
 
PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
Spark Summit
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
 
Introduction to Apache Spark Ecosystem
Bojan Babic
 
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
René Pfitzner
 
Introduction to Hive
Uday Vakalapudi
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Spark is going to replace Apache Hadoop! Know Why?
Edureka!
 
Big data spain keynote nov 2016
alanfgates
 
Hive ACID Apache BigData 2016
alanfgates
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
tsliwowicz
 
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
2016 spark survey
Abhishek Choudhary
 
Big data Processing with Apache Spark & Scala
Edureka!
 
Big Data Trend with Open Platform
Jongwook Woo
 
Data Science with Apache Spark - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
PySpark Best Practices
Cloudera, Inc.
 
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Data Con LA
 
Hive Training -- Motivations and Real World Use Cases
nzhang
 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Ad

Similar to Apache Spark Usage in the Open Source Ecosystem (20)

PDF
A Jupyter kernel for Scala and Apache Spark.pdf
Luciano Resende
 
PPTX
Introduction to Scala
Mohammad Hossein Rimaz
 
PPTX
11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”
DuraSpace
 
PPTX
Semantic web tools
Jithin Parakka
 
PDF
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
Spark Summit
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PDF
Scalable Scientific Computing with Dask
Uwe Korn
 
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
PDF
Rust with-kafka-07-02-2019
Gerard Klijs
 
PPTX
R introduction
Teachers Mitraa
 
PDF
Towards a Commons RDF Library - ApacheCon Europe 2014
Sergio Fernández
 
PPTX
The ExtremeEarth infrastructure-phiweek19
ExtremeEarth
 
PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
PDF
Apache Spark Tutorial
Ahmet Bulut
 
PPTX
AI and Spark - IBM Community AI Day
Nick Pentreath
 
PPTX
Overview of Apache Spark and PySpark.pptx
Accentfuture
 
PDF
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
PDF
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
A Jupyter kernel for Scala and Apache Spark.pdf
Luciano Resende
 
Introduction to Scala
Mohammad Hossein Rimaz
 
11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”
DuraSpace
 
Semantic web tools
Jithin Parakka
 
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
Spark Summit
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Scalable Scientific Computing with Dask
Uwe Korn
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Rust with-kafka-07-02-2019
Gerard Klijs
 
R introduction
Teachers Mitraa
 
Towards a Commons RDF Library - ApacheCon Europe 2014
Sergio Fernández
 
The ExtremeEarth infrastructure-phiweek19
ExtremeEarth
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
Apache Spark Tutorial
Ahmet Bulut
 
AI and Spark - IBM Community AI Day
Nick Pentreath
 
Overview of Apache Spark and PySpark.pptx
Accentfuture
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PPTX
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Import Data Form Excel to Tally Services
Tally xperts
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Executive Business Intelligence Dashboards
vandeslie24
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 

Apache Spark Usage in the Open Source Ecosystem

  • 1. Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki
  • 2. About me • Software Engineer /part-time Data Scientist atDatabricks • I started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source • Worked on SparkR and Rnotebooks at Databricks 2
  • 4. Apache Spark Philosophy Unified engine Support end-to-end applications High-level APIs Easy to use, rich optimizations Integrate broadly Storage systems, libraries, etc SQLStreaming ML Graph … 1 2 3
  • 5. Databricks Community Edition • In February Databricks launched a free version of its cloud based platform in beta • Since then more than 8,000 users registered • Users created over 61,000 notebooks indifferent languages • This is an analysis of third party libraries that our beta users imported to complement Apache Spark in Scala, Python, and R 5
  • 6. What % of users use other libraries Language % users importing external libs Average # libs Median # libs Python 75 % 9 2 Scala 55 % 3 1 R 57 % 6 1 6
  • 9. Most popular Python packages 9
  • 11. What are these? ETL • re • datetime • pandas • json • csv • string • math /operator • urllib /urllib2 11 Visualization • matplotlib • ggplot • seaborn Advanced analytics • numpy • sklearn • graphframes • tensorflow • scipy Other • test_helper • os • md5
  • 13. What packages go together? 13
  • 15. Most popular Scala libraries 15
  • 16. What are these? ETL • java/scala util • scala.collection • scala.math • java.{io, nio} • java.text • o.a.commons • kafka • twitter4j 16 Visualization • ? Advanced analytics • spark.ml • graphframes Other • java.net • scala.sys
  • 18. What libraries go together? 18
  • 20. Most popular R packages 20
  • 21. What are these? ETL • dplyr • plyr • reshape2 • jsonlite • tidyr • lubridate • httr • data.table 21 Visualization • ggplot2 • beanplot • plotly • ... Advanced analytics • sparkr • h2o • caret • e1071 Other • devtools • magrittr
  • 24. Languages have unique features 24 Scala/ Python / R R / Python Scala / Python/ R • 25 % of users,use multiple languages • 3% of notebooks mix different languages
  • 25. Summary • Spark users extensively mix itwith other packages in different languages – One ofgoals ofSpark project is working well with other projects • ETL related libraries are the most popular category – Opportunities for newdata sources • Notebooks are being used for “small data” aswell as“big data.” • Languages and their ecosystems have diverse capabilities. Users seem to be mixing languages to their advantage – Scala is missing visualization libraries 25
  • 26. Try your favorite library in Databricks 26 http://databricks.com/ce Try latest version of Apache Spark and previewof Spark 2.0
  • 28. What packages are used together? 28