SlideShare a Scribd company logo
The BDAS Open Source 
Community 
UC 
BERKELEY 
Ion Stoica 
UC Berkeley and Databricks
Growing Beyond AMPLab 
As software matures and becomes successful, 
more and more contributors outside AMPLab 
New startups have anchored development 
» Databricks (Spark Stack) 
» Mesosphere (Mesos) 
» … 
Enables AMPLab to focus more resources on 
future systems instead of software maintenance
Apache Spark 
Cancer Genomics, Energy Debugging, Smart Buildings 
MLBase SparkR 
Velox Model Serving 
Sample 
Clean 
Spark 
Streaming SparkSQL 
Tachyon 
BlinkDB 
GraphX MLlib 
Apache Spark (core) 
Tachyon 
HDFS, S3, 
Apache Meso…s Yarn
Apache Spark 
Open Source: end of 2010 
Apache Project: 2013 
Over time has grown to include key libraries 
» SparkStreaming, SparkSQL, MLlib, GraphX 
Becoming a platform for Big Data apps
Apache Spark Today 
MapReduce 
YARN 
HDFS 
Storm 
Spark 
2000 
1800 
1600 
1400 
1200 
1000 
800 
600 
400 
200 
0 
MapReduce 
YARN 
HDFS 
Storm 
Spark 
350000 
300000 
250000 
200000 
150000 
100000 
50000 
0 
2-3x more activity than: Hadoop, Storm, 
Commits Lines of Code Changed 
MongoDB, NumPy, D3, Julia, … 
Activity in past 6 months
Meetups Around the World
Monthly Contributors 
100 
75 
50 
25 
0 
Databricks 
founded 
2011 2012 2013 2014 
370+ contributors for last 12 months
Spark Stack (2013) 
Cancer Genomics, Energy Debugging, Smart Buildings 
Tachyon 
BlinkDB 
Spark 
Streaming 
MLlib 
MLBase 
Sample 
Clean 
Shark 
Apache Spark (core) 
Tachyon 
HDFS, S3, 
Apache Meso…s Yarn
Last Year Developments 
Tachyon 
Cancer Genomics, Energy Debugging, Smart Buildings 
UC 
BERKELEY 
BlinkDB 
MLBase 
SparkR 
SpSahrkaSrkQL GraphX MLlib 
Tachyon 
Spark 
Streaming 
Sample 
Clean 
Apache Spark (core) 
Tachyon 
HDFS, S3, 
Tachyon 
Apache Mesos… Yarn 
Tachyon UC 
BERKELEY 
… 
UC 
BERKELEY 
Velox Model Serving
Wide Adoption 
All major Hadoop distributions include Spark 
Beyond Hadoop
Wide Adoption 
All major Hadoop distributions include Spark 
Beyond Hadoop 
partners 
partners 
Databricks: spurred Spark’s enterprise growth
Apache Mesos 
Cancer Genomics, Energy Debugging, Smart Buildings 
MLBase SparkR 
Velox Model Serving 
Sample 
Clean 
Spark 
Streaming SparkSQL 
Tachyon 
BlinkDB 
GraphX MLlib 
Apache Spark 
Tachyon 
HDFS, S3, 
Apache Meso…s Yarn
Apache Mesos 
Open Source: 2010 
Apache Project: 2012 
Used in production at Twitter for past 2.5 years 
» +10,000 machines 
» +500 engineers using it 
Most development moved outside Berkeley 
starting with 2012
Monthly Contributors 
Mesosphere 
founded 
65 contributors for last 12 months
BDAS Stack 
Cancer Genomics, Energy Debugging, Smart Buildings 
MLBase SparkR 
Velox Model Serving 
Sample 
Clean 
Spark 
Streaming SparkSQL 
Tachyon 
BlinkDB 
GraphX MLlib 
Apache Spark 
HDFS, S3, 
Apache Meso…s Yarn
Release Growth 
Tachyon 0.2: 
- 3 contributors 
Apr ‘13Oct‘13 
Tachyon 0.5: 
- 46 contributors 
Tachyon 0.4: 
- 30 contributors 
Feb ‘14 
Tachyon 0.3: 
- 15 contributors 
16 
July ‘14 
Tachyon 0.1: 
-1 contributor 
Dec ‘12
Fast Growing Community 
Berkeley 
Contributors 
Non-Berkeley 
Contributors 
(20+ companies) 
~80% contributors already outside AMPLab
Reaching Tipping Point 
18
Research to Real-World Impact 
MLlib 
Spark 
Streaming 
Spark 
SQL 
Apache Spark (core) 
Apache Mesos 
GraphX 
Tachyon 
Succinct 
Velox 
ADAM 
BlinkDB 
Research 
Real-world Impact 
AMPLab/Berkeley 
Non-Berkeley 
committers / commits
Impact on AMPLab 
Created blue-print & ecosystem for other 
BDAS components to succeed 
» MLlib, GraphX, Tachyon, … 
Enabled AMPLab to increase focus on new 
research projects 
» Velox, ADAM, Succinct, …

More Related Content

PDF
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
PDF
New Developments in Spark
Databricks
 
PDF
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 
PDF
Spark what's new what's coming
Databricks
 
PDF
New Directions for Spark in 2015 - Spark Summit East
Databricks
 
PDF
Lessons from Running Large Scale Spark Workloads
Databricks
 
PDF
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
New Developments in Spark
Databricks
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 
Spark what's new what's coming
Databricks
 
New Directions for Spark in 2015 - Spark Summit East
Databricks
 
Lessons from Running Large Scale Spark Workloads
Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 

What's hot (20)

PDF
End-to-end Data Pipeline with Apache Spark
Databricks
 
PDF
Enabling exploratory data science with Spark and R
Databricks
 
PDF
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
PDF
Introduction to Spark (Intern Event Presentation)
Databricks
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PDF
New directions for Apache Spark in 2015
Databricks
 
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
PDF
Spark streaming state of the union
Databricks
 
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
PDF
A look ahead at spark 2.0
Databricks
 
PPTX
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
PDF
H2O World - H2O Rains with Databricks Cloud
Sri Ambati
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
PDF
Operational Tips for Deploying Spark
Databricks
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
End-to-end Data Pipeline with Apache Spark
Databricks
 
Enabling exploratory data science with Spark and R
Databricks
 
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
Introduction to Spark (Intern Event Presentation)
Databricks
 
Building a modern Application with DataFrames
Spark Summit
 
New directions for Apache Spark in 2015
Databricks
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Spark streaming state of the union
Databricks
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
A look ahead at spark 2.0
Databricks
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
H2O World - H2O Rains with Databricks Cloud
Sri Ambati
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Operational Tips for Deploying Spark
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Ad

Viewers also liked (10)

PDF
Design patterns tutorials
University of Technology
 
PPTX
Docker Security workshop slides
Docker, Inc.
 
PDF
MLlib: Spark's Machine Learning Library
jeykottalam
 
PDF
Docker London: Container Security
Phil Estes
 
PPTX
Using machine learning to determine drivers of bounce and conversion
Tammy Everts
 
PDF
Container Orchestration Wars
Karl Isenberg
 
PDF
An Introduction to Kubernetes
Imesh Gunaratne
 
PDF
Package your Java EE Application using Docker and Kubernetes
Arun Gupta
 
PDF
Musings on Mesos: Docker, Kubernetes, and Beyond.
Timothy St. Clair
 
PDF
Introduction to Apache Mesos
tomasbart
 
Design patterns tutorials
University of Technology
 
Docker Security workshop slides
Docker, Inc.
 
MLlib: Spark's Machine Learning Library
jeykottalam
 
Docker London: Container Security
Phil Estes
 
Using machine learning to determine drivers of bounce and conversion
Tammy Everts
 
Container Orchestration Wars
Karl Isenberg
 
An Introduction to Kubernetes
Imesh Gunaratne
 
Package your Java EE Application using Docker and Kubernetes
Arun Gupta
 
Musings on Mesos: Docker, Kubernetes, and Beyond.
Timothy St. Clair
 
Introduction to Apache Mesos
tomasbart
 
Ad

Similar to The BDAS Open Source Community (20)

PDF
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
PDF
Big data apache spark + scala
Juantomás García Molina
 
PDF
Started with-apache-spark
Happiest Minds Technologies
 
PPTX
AMP Camp 5 Intro
jeykottalam
 
PPTX
963
Annu Ahmed
 
PDF
Job Data Analysis Reveals Key Skills Required for Data Scientists
JobsPikr
 
PDF
Tapping into Scientific Data with Hadoop and Flink
Michael Häusler
 
PDF
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
PDF
Dev Ops Training
Spark Summit
 
PPTX
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
PPTX
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 
PDF
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
seoul_engineer
 
PPTX
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 
PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
PDF
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Data Con LA
 
PDF
Open Source Tools for Big Data
Teemu Heikkilä
 
PDF
Open Source Tools for Big Data
Exove
 
PDF
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Luciano Resende
 
PDF
Present and future of unified, portable and efficient data processing with Ap...
DataWorks Summit
 
ODP
The other Apache Technologies your Big Data solution needs
gagravarr
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
Big data apache spark + scala
Juantomás García Molina
 
Started with-apache-spark
Happiest Minds Technologies
 
AMP Camp 5 Intro
jeykottalam
 
Job Data Analysis Reveals Key Skills Required for Data Scientists
JobsPikr
 
Tapping into Scientific Data with Hadoop and Flink
Michael Häusler
 
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
Dev Ops Training
Spark Summit
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
seoul_engineer
 
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Data Con LA
 
Open Source Tools for Big Data
Teemu Heikkilä
 
Open Source Tools for Big Data
Exove
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Luciano Resende
 
Present and future of unified, portable and efficient data processing with Ap...
DataWorks Summit
 
The other Apache Technologies your Big Data solution needs
gagravarr
 

More from jeykottalam (6)

PDF
Intro to Spark and Spark SQL
jeykottalam
 
PPTX
Concurrency Control for Parallel Machine Learning
jeykottalam
 
PDF
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
PDF
SampleClean: Bringing Data Cleaning into the BDAS Stack
jeykottalam
 
PDF
Machine Learning Pipelines
jeykottalam
 
PDF
COCOA: Communication-Efficient Coordinate Ascent
jeykottalam
 
Intro to Spark and Spark SQL
jeykottalam
 
Concurrency Control for Parallel Machine Learning
jeykottalam
 
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
jeykottalam
 
Machine Learning Pipelines
jeykottalam
 
COCOA: Communication-Efficient Coordinate Ascent
jeykottalam
 

Recently uploaded (20)

PDF
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PPT
Activate_Methodology_Summary presentatio
annapureddyn
 
PPTX
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
PDF
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
PDF
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
PDF
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
PPTX
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PPTX
Role Of Python In Programing Language.pptx
jaykoshti048
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
PDF
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
PPTX
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
PDF
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
PPTX
Presentation about variables and constant.pptx
safalsingh810
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PDF
An Experience-Based Look at AI Lead Generation Pricing, Features & B2B Results
Thomas albart
 
PDF
vAdobe Premiere Pro 2025 (v25.2.3.004) Crack Pre-Activated Latest
imang66g
 
PDF
Jenkins: An open-source automation server powering CI/CD Automation
SaikatBasu37
 
PPTX
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
Activate_Methodology_Summary presentatio
annapureddyn
 
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Role Of Python In Programing Language.pptx
jaykoshti048
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
Presentation about variables and constant.pptx
safalsingh810
 
Presentation about variables and constant.pptx
kr2589474
 
An Experience-Based Look at AI Lead Generation Pricing, Features & B2B Results
Thomas albart
 
vAdobe Premiere Pro 2025 (v25.2.3.004) Crack Pre-Activated Latest
imang66g
 
Jenkins: An open-source automation server powering CI/CD Automation
SaikatBasu37
 
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 

The BDAS Open Source Community

  • 1. The BDAS Open Source Community UC BERKELEY Ion Stoica UC Berkeley and Databricks
  • 2. Growing Beyond AMPLab As software matures and becomes successful, more and more contributors outside AMPLab New startups have anchored development » Databricks (Spark Stack) » Mesosphere (Mesos) » … Enables AMPLab to focus more resources on future systems instead of software maintenance
  • 3. Apache Spark Cancer Genomics, Energy Debugging, Smart Buildings MLBase SparkR Velox Model Serving Sample Clean Spark Streaming SparkSQL Tachyon BlinkDB GraphX MLlib Apache Spark (core) Tachyon HDFS, S3, Apache Meso…s Yarn
  • 4. Apache Spark Open Source: end of 2010 Apache Project: 2013 Over time has grown to include key libraries » SparkStreaming, SparkSQL, MLlib, GraphX Becoming a platform for Big Data apps
  • 5. Apache Spark Today MapReduce YARN HDFS Storm Spark 2000 1800 1600 1400 1200 1000 800 600 400 200 0 MapReduce YARN HDFS Storm Spark 350000 300000 250000 200000 150000 100000 50000 0 2-3x more activity than: Hadoop, Storm, Commits Lines of Code Changed MongoDB, NumPy, D3, Julia, … Activity in past 6 months
  • 7. Monthly Contributors 100 75 50 25 0 Databricks founded 2011 2012 2013 2014 370+ contributors for last 12 months
  • 8. Spark Stack (2013) Cancer Genomics, Energy Debugging, Smart Buildings Tachyon BlinkDB Spark Streaming MLlib MLBase Sample Clean Shark Apache Spark (core) Tachyon HDFS, S3, Apache Meso…s Yarn
  • 9. Last Year Developments Tachyon Cancer Genomics, Energy Debugging, Smart Buildings UC BERKELEY BlinkDB MLBase SparkR SpSahrkaSrkQL GraphX MLlib Tachyon Spark Streaming Sample Clean Apache Spark (core) Tachyon HDFS, S3, Tachyon Apache Mesos… Yarn Tachyon UC BERKELEY … UC BERKELEY Velox Model Serving
  • 10. Wide Adoption All major Hadoop distributions include Spark Beyond Hadoop
  • 11. Wide Adoption All major Hadoop distributions include Spark Beyond Hadoop partners partners Databricks: spurred Spark’s enterprise growth
  • 12. Apache Mesos Cancer Genomics, Energy Debugging, Smart Buildings MLBase SparkR Velox Model Serving Sample Clean Spark Streaming SparkSQL Tachyon BlinkDB GraphX MLlib Apache Spark Tachyon HDFS, S3, Apache Meso…s Yarn
  • 13. Apache Mesos Open Source: 2010 Apache Project: 2012 Used in production at Twitter for past 2.5 years » +10,000 machines » +500 engineers using it Most development moved outside Berkeley starting with 2012
  • 14. Monthly Contributors Mesosphere founded 65 contributors for last 12 months
  • 15. BDAS Stack Cancer Genomics, Energy Debugging, Smart Buildings MLBase SparkR Velox Model Serving Sample Clean Spark Streaming SparkSQL Tachyon BlinkDB GraphX MLlib Apache Spark HDFS, S3, Apache Meso…s Yarn
  • 16. Release Growth Tachyon 0.2: - 3 contributors Apr ‘13Oct‘13 Tachyon 0.5: - 46 contributors Tachyon 0.4: - 30 contributors Feb ‘14 Tachyon 0.3: - 15 contributors 16 July ‘14 Tachyon 0.1: -1 contributor Dec ‘12
  • 17. Fast Growing Community Berkeley Contributors Non-Berkeley Contributors (20+ companies) ~80% contributors already outside AMPLab
  • 19. Research to Real-World Impact MLlib Spark Streaming Spark SQL Apache Spark (core) Apache Mesos GraphX Tachyon Succinct Velox ADAM BlinkDB Research Real-world Impact AMPLab/Berkeley Non-Berkeley committers / commits
  • 20. Impact on AMPLab Created blue-print & ecosystem for other BDAS components to succeed » MLlib, GraphX, Tachyon, … Enabled AMPLab to increase focus on new research projects » Velox, ADAM, Succinct, …