SlideShare a Scribd company logo
Spark
Spark 
- Summit 
- News 
- Basics 
- Advanced 
- Subprojects 
- Use Cases 
- Resources
Summit 
- 1,164 participants from over 453 companies 
attended 
- Spark Training sold out at 300 participants 
- 31 organizations sponsored the event 
- 12 keynotes and 52 community presentations 
were given
News 
- Project 
- Databricks
Project 
- 1.0.0 release 
- Graduated incubator 
- Very active community
Very active community 
- Top three Apache projects 
- Most active Big Data project 
- > 50 companies 
- > 250 contributors 
- > 175,000 LOC
Databricks 
- Certification 
- Cloud
Certification 
- Every certified app will 
run on every certified 
distribution 
- Distribution Partners 
- App Partners
Distribution Partners 
- Cloudera 
- MapR 
- Hortonworks 
- Pivotal 
- IBM 
- Amazon Web Services 
- SAP
App Partners 
- Alteryx 
- Datastax 
- 0xdata 
- Typesafe 
- Zoomdata
Cloud 
- Vision: Make Big Data Easy! 
- Product: Badass 
- Hosted Platform 
- Cluster Management 
- Interactive Workspace
Interactive Workspace 
- Notebooks 
- Dashboards 
- Jobs
Dashboards 
- WYSIWYG Builder 
- Interactive plots 
- One-click publishing
Spark Basics 
- Execution 
- RDDs 
- Caching 
- Broadcast 
- Languages
Execution 
- Apply Functional Operators 
across Distributed Collections 
- Master / Worker 
- Lazy 
- Parallelize with Threads first
RDDs 
- Interface for dataset 
- Backed by anything 
- Any InputFormat class 
- HDFS default
Caching 
- Store intermediate 
results in memory 
- Partition-locality 
- Significant speed-up for 
iterative algorithms
Broadcast 
- Send immutable object 
to all workers 
- Similar to 
DistributedCache in 
mapreduce
Languages 
- Scala 
- Python 
- Java 7 
- Java 8 
- R 
- Clojure
Advanced 
- Partitioning 
- Persistence Options 
- Checkpointing 
- Accumulators 
- Optimizations
Subprojects 
- SparkSQL 
- Tachyon 
- Spark Streaming 
- MLLib 
- GraphX 
- BlinkDB 
- Spark Job Server
SparkSQL 
- Replaces Shark 
- Core 
- Catalyst 
- Libraries
Core 
- SchemaRDDs 
- Query Execution 
- Caching
Catalyst 
- Relational algebra 
- Expressions / UDFs 
- Query Planning 
- Optimizer
Libraries 
- POJOs 
- JDBC 
- JSON 
- Parquet 
- Hive
Hive 
- Catalog info from Metastore 
- Helps connect UI like 
Microstrategy / Tableau 
- Wrappers for UDF, UDAFs, 
UDTFs 
- Supports TRANSFORM 
- Supports SerDes
Tachyon 
- In Memory (Off-Heap) Distributed 
Datastore 
- Change URI from hdfs:// to tachyon:// 
- Share datasets between jobs without 
HDFS 
- Helps scaling by off-loading allocation 
responsibility and GC pauses from 
executor processes
Spark Streaming 
- Real-time streams 
- Micro-batching 
- Windowed 
Computations 
- Lambda Architecture
MLLib 
- Summary statistics 
- Regression 
- Classification 
- Clustering 
- Collaborative Filtering 
- Optimization 
- Dimensional Reduction
GraphX 
- Graph, VertexRDD, EdgeRDD 
objects and operations 
- Pregel API 
- mapReduceTriplets List<V,E,V> 
- Graph analytics libraries
Graph analytics libraries 
- ConnectedComponents 
- PageRank 
- TriangleCount 
- ShortestPaths 
- SVDPlusPlus
BlinkDB 
- Get estimated results 
- Time bound 
- Error bound
Spark Job Server 
- Runs multiple jobs / contexts 
in same process 
- Allows for RDD Caching / 
Sharing between jobs 
- Job Persistence
Use Cases 
- Spotify 
- Real-time Auctions - ShareThrough 
- Real-time Recommendations - Graphflow 
- Cancer Genomics - AMPLab 
- Malware Detection - F-Secure 
- Media Distribution Analytics - NBC Universal 
- Personal Fitness - Jawbone 
- Neuroscience - HHMI
Resources 
- Code 
- Event 
- Technology 
- Videos
Code 
- https://github.com/apache/spark
Event 
- spark-summit.org 
- http://arjon.es/2014/06/30/spark-summit-2014-day-1/ 
- https://www.crowdchat.net/chat/c3BvdF9vYmpfODc=. 
- https://nathanbrixius.wordpress.com/2014/07/02/spark-summit-keynote- 
notes/ 
- http://thomaswdinsmore.com/2014/07/03/spark-summit-2014- 
roundup/
Technology 
- Learning Spark (O'Reilly eBook) 
- www.spark-stack.org 
- ampcamp.berkeley.edu 
- https://amplab.cs.berkeley.edu/2013/10/23/got-a-minute-spin- 
up-a-spark-cluster-on-your-laptop-with-docker/
YouTube 
- AmpLab 
https://www.youtube.com/channel/UCWudC4d9i-2yxR5tuen- 
Nuw 
- Databricks 
https://www.youtube.com/channel/UC3q8O3Bh2Le8Rj1-Q- 
_UUbA 
- Apache Spark 
https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82- 
w

More Related Content

PPTX
Using Visualization to Succeed with Big Data
Pactera_US
 
PDF
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
PPTX
Lambda-less Stream Processing @Scale in LinkedIn
DataWorks Summit/Hadoop Summit
 
PDF
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
PDF
IEEE International Conference on Data Engineering 2015
Yousun Jeong
 
PDF
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
PDF
Scalable And Incremental Data Profiling With Spark
Jen Aman
 
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Using Visualization to Succeed with Big Data
Pactera_US
 
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
Lambda-less Stream Processing @Scale in LinkedIn
DataWorks Summit/Hadoop Summit
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
IEEE International Conference on Data Engineering 2015
Yousun Jeong
 
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Scalable And Incremental Data Profiling With Spark
Jen Aman
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 

What's hot (20)

PDF
Big Telco - Yousun Jeong
Spark Summit
 
PDF
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
PDF
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Databricks
 
PDF
Realtime Reporting using Spark Streaming
Santosh Sahoo
 
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
PPTX
Querying Druid in SQL with Superset
DataWorks Summit
 
PPTX
Ai big dataconference_jeffrey ricker_kappa_architecture
Olga Zinkevych
 
PDF
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Mars Lan
 
PDF
Data governance in Hadoop (My Personal Notes)
Komes Chandavimol
 
PPTX
Hadoop data access layer v4.0
SpringPeople
 
ODP
Kick-Start with SMACK Stack
Knoldus Inc.
 
PDF
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
 
PPTX
Intro to Apache Spark
Mammoth Data
 
PDF
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Databricks
 
PPTX
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
PDF
Uber's data science workbench
Ran Wei
 
PDF
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
Databricks
 
PPTX
Kappa Architecture on Apache Kafka and Querona: datamass.io
Piotr Czarnas
 
PDF
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Guido Schmutz
 
PDF
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
HostedbyConfluent
 
Big Telco - Yousun Jeong
Spark Summit
 
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Databricks
 
Realtime Reporting using Spark Streaming
Santosh Sahoo
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
Querying Druid in SQL with Superset
DataWorks Summit
 
Ai big dataconference_jeffrey ricker_kappa_architecture
Olga Zinkevych
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Mars Lan
 
Data governance in Hadoop (My Personal Notes)
Komes Chandavimol
 
Hadoop data access layer v4.0
SpringPeople
 
Kick-Start with SMACK Stack
Knoldus Inc.
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
 
Intro to Apache Spark
Mammoth Data
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Databricks
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
Uber's data science workbench
Ran Wei
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
Databricks
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Piotr Czarnas
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Guido Schmutz
 
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
HostedbyConfluent
 
Ad

Viewers also liked (20)

PPTX
Spark in the BigData dark
Sergey Levandovskiy
 
PDF
Apache streams 2015
Steve Blackmon
 
PDF
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
PDF
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 
PDF
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly
 
PDF
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
PDF
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
PDF
Spark Summit EU talk by Herman van Hovell
Spark Summit
 
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
PDF
Enhancements on Spark SQL optimizer by Min Qiu
Spark Summit
 
PDF
20140908 spark sql & catalyst
Takuya UESHIN
 
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
PDF
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Chris Fregly
 
PPTX
Spark sql meetup
Michael Zhang
 
PDF
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
Data Con LA
 
PDF
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
PDF
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PDF
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Spark in the BigData dark
Sergey Levandovskiy
 
Apache streams 2015
Steve Blackmon
 
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly
 
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
Spark Summit EU talk by Herman van Hovell
Spark Summit
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
Enhancements on Spark SQL optimizer by Min Qiu
Spark Summit
 
20140908 spark sql & catalyst
Takuya UESHIN
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Chris Fregly
 
Spark sql meetup
Michael Zhang
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
Data Con LA
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Ad

Similar to Austin Data Meetup 092014 - Spark (20)

PPTX
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Guido Schmutz
 
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
PPTX
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
PDF
An introduction To Apache Spark
Amir Sedighi
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
PDF
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
PDF
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
PDF
Dev Ops Training
Spark Summit
 
PDF
20170126 big data processing
Vienna Data Science Group
 
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
PDF
Spark as a Service with Azure Databricks
Lace Lofranco
 
PPTX
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
PDF
Developing Enterprise Consciousness: Building Modern Open Data Platforms
ScyllaDB
 
PPTX
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
PDF
Machine Learning with H2O, Spark, and Python at Strata 2015
Sri Ambati
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Guido Schmutz
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
An introduction To Apache Spark
Amir Sedighi
 
Unified Big Data Processing with Apache Spark
C4Media
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
Dev Ops Training
Spark Summit
 
20170126 big data processing
Vienna Data Science Group
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Spark as a Service with Azure Databricks
Lace Lofranco
 
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
ScyllaDB
 
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Sri Ambati
 

Recently uploaded (20)

PDF
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PDF
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
 
PDF
Protecting the Digital World Cyber Securit
dnthakkar16
 
PDF
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
PDF
49784907924775488180_LRN2959_Data_Pump_23ai.pdf
Abilash868456
 
PPTX
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
PPTX
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
PDF
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PPTX
Role Of Python In Programing Language.pptx
jaykoshti048
 
PDF
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
PDF
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
PDF
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
PDF
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
PPTX
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
PDF
Immersive experiences: what Pharo users do!
ESUG
 
PDF
Bandai Playdia The Book - David Glotz
BluePanther6
 
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
 
Protecting the Digital World Cyber Securit
dnthakkar16
 
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
49784907924775488180_LRN2959_Data_Pump_23ai.pdf
Abilash868456
 
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
Presentation about variables and constant.pptx
kr2589474
 
Role Of Python In Programing Language.pptx
jaykoshti048
 
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
Immersive experiences: what Pharo users do!
ESUG
 
Bandai Playdia The Book - David Glotz
BluePanther6
 

Austin Data Meetup 092014 - Spark

  • 2. Spark - Summit - News - Basics - Advanced - Subprojects - Use Cases - Resources
  • 3. Summit - 1,164 participants from over 453 companies attended - Spark Training sold out at 300 participants - 31 organizations sponsored the event - 12 keynotes and 52 community presentations were given
  • 4. News - Project - Databricks
  • 5. Project - 1.0.0 release - Graduated incubator - Very active community
  • 6. Very active community - Top three Apache projects - Most active Big Data project - > 50 companies - > 250 contributors - > 175,000 LOC
  • 8. Certification - Every certified app will run on every certified distribution - Distribution Partners - App Partners
  • 9. Distribution Partners - Cloudera - MapR - Hortonworks - Pivotal - IBM - Amazon Web Services - SAP
  • 10. App Partners - Alteryx - Datastax - 0xdata - Typesafe - Zoomdata
  • 11. Cloud - Vision: Make Big Data Easy! - Product: Badass - Hosted Platform - Cluster Management - Interactive Workspace
  • 12. Interactive Workspace - Notebooks - Dashboards - Jobs
  • 13. Dashboards - WYSIWYG Builder - Interactive plots - One-click publishing
  • 14. Spark Basics - Execution - RDDs - Caching - Broadcast - Languages
  • 15. Execution - Apply Functional Operators across Distributed Collections - Master / Worker - Lazy - Parallelize with Threads first
  • 16. RDDs - Interface for dataset - Backed by anything - Any InputFormat class - HDFS default
  • 17. Caching - Store intermediate results in memory - Partition-locality - Significant speed-up for iterative algorithms
  • 18. Broadcast - Send immutable object to all workers - Similar to DistributedCache in mapreduce
  • 19. Languages - Scala - Python - Java 7 - Java 8 - R - Clojure
  • 20. Advanced - Partitioning - Persistence Options - Checkpointing - Accumulators - Optimizations
  • 21. Subprojects - SparkSQL - Tachyon - Spark Streaming - MLLib - GraphX - BlinkDB - Spark Job Server
  • 22. SparkSQL - Replaces Shark - Core - Catalyst - Libraries
  • 23. Core - SchemaRDDs - Query Execution - Caching
  • 24. Catalyst - Relational algebra - Expressions / UDFs - Query Planning - Optimizer
  • 25. Libraries - POJOs - JDBC - JSON - Parquet - Hive
  • 26. Hive - Catalog info from Metastore - Helps connect UI like Microstrategy / Tableau - Wrappers for UDF, UDAFs, UDTFs - Supports TRANSFORM - Supports SerDes
  • 27. Tachyon - In Memory (Off-Heap) Distributed Datastore - Change URI from hdfs:// to tachyon:// - Share datasets between jobs without HDFS - Helps scaling by off-loading allocation responsibility and GC pauses from executor processes
  • 28. Spark Streaming - Real-time streams - Micro-batching - Windowed Computations - Lambda Architecture
  • 29. MLLib - Summary statistics - Regression - Classification - Clustering - Collaborative Filtering - Optimization - Dimensional Reduction
  • 30. GraphX - Graph, VertexRDD, EdgeRDD objects and operations - Pregel API - mapReduceTriplets List<V,E,V> - Graph analytics libraries
  • 31. Graph analytics libraries - ConnectedComponents - PageRank - TriangleCount - ShortestPaths - SVDPlusPlus
  • 32. BlinkDB - Get estimated results - Time bound - Error bound
  • 33. Spark Job Server - Runs multiple jobs / contexts in same process - Allows for RDD Caching / Sharing between jobs - Job Persistence
  • 34. Use Cases - Spotify - Real-time Auctions - ShareThrough - Real-time Recommendations - Graphflow - Cancer Genomics - AMPLab - Malware Detection - F-Secure - Media Distribution Analytics - NBC Universal - Personal Fitness - Jawbone - Neuroscience - HHMI
  • 35. Resources - Code - Event - Technology - Videos
  • 37. Event - spark-summit.org - http://arjon.es/2014/06/30/spark-summit-2014-day-1/ - https://www.crowdchat.net/chat/c3BvdF9vYmpfODc=. - https://nathanbrixius.wordpress.com/2014/07/02/spark-summit-keynote- notes/ - http://thomaswdinsmore.com/2014/07/03/spark-summit-2014- roundup/
  • 38. Technology - Learning Spark (O'Reilly eBook) - www.spark-stack.org - ampcamp.berkeley.edu - https://amplab.cs.berkeley.edu/2013/10/23/got-a-minute-spin- up-a-spark-cluster-on-your-laptop-with-docker/
  • 39. YouTube - AmpLab https://www.youtube.com/channel/UCWudC4d9i-2yxR5tuen- Nuw - Databricks https://www.youtube.com/channel/UC3q8O3Bh2Le8Rj1-Q- _UUbA - Apache Spark https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82- w