SlideShare a Scribd company logo
‹#›© Cloudera, Inc. All rights reserved.
Mirko Kämpf | 2015
Apache Spark:
Next Generation Data
Processing for Hadoop
‹#›© Cloudera, Inc. All rights reserved.
Agenda
• The Data Science Process (DSP)
- Why or when to use Spark
• The role of: Apache Hadoop and Apache Spark
- History & Hadoop Ecosystem
• Apache Spark: Overview and Concepts
• Practical Tips
‹#›© Cloudera, Inc. All rights reserved.
The Data Science Process
Application of Big-Data-Technology
Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
‹#›© Cloudera, Inc. All rights reserved.
Huge Data Sets in Science
Application of Big-Data-Technology
Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
‹#›© Cloudera, Inc. All rights reserved.
“Spark offers tools for Data Science
and components for Data
Products.”
—How can Apache Spark fit into my world?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
• Just export as CSV / JSON and use a DataFrame to join with other DS.
Why not?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
• Just export as CSV / JSON and use a DataFrame to join with other DS.
• Think about additional analysis methods! Maybe it is already built into Apache
Spark!
Why not?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
• Just export as CSV / JSON and use a DataFrame to join with other DS.
• Think about additional analysis methods! Maybe it is build into Spark.
• OK, Spark will probably not help to speed up your system, but maybe you can
offload data to Hadoop, which releases some resources.
Why not?
‹#›© Cloudera, Inc. All rights reserved.
“Spark offers fast in memory processing on
huge distributed and even on heterogeneous
datasets.”
—What type of data fits into Spark?
‹#›© Cloudera, Inc. All rights reserved.
History of Spark
Spark is really young, but has a very
active community!
‹#›© Cloudera, Inc. All rights reserved.
Timeline: Spark Adoption
‹#›© Cloudera, Inc. All rights reserved.
Apache Spark:
Overview & Concepts
‹#›© Cloudera, Inc. All rights reserved.
Hadoop Ecosystem incl. Apache Spark
Spark can be an entry point to your Big Data world …
‹#›© Cloudera, Inc. All rights reserved.
“Apache Spark is distributed on top of Hadoop
and brings parallel processing
to powerful workstations.”
—Do I need a Hadoop cluster to work with Apache Spark?
‹#›© Cloudera, Inc. All rights reserved.
Spark vs. MapReduce
‹#›© Cloudera, Inc. All rights reserved.
How to interact with Spark?
‹#›© Cloudera, Inc. All rights reserved.
Spark Components
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
MLLib: GraphX:
Basic statistics
summary statistics, correlations, stratified sampling,
hypothesis testing, random data generation
Classification and regression
linear models (SVMs, logistic / linear regression)
naive Bayes, decision trees
ensembles of trees (Random Forests / Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means, Gaussian mixture, power iteration clustering (PIC)
latent Dirichlet allocation (LDA), streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
…
PageRank
Connected Components
Triangle Counting
Pregel API
‹#›© Cloudera, Inc. All rights reserved.
How to use your code in Spark?
A. Interactively, by loading it into the spark-shell.
B. Contribute to existing Spark projects.
C. Create your module and use it in a spark-shell session.
D. Build a data-product which uses Apache Spark.
For simple and reliable usage of Java classes
and complete third-party libraries, we define
a Spark Module as a self-contained artifact
created by Maven. This module can easily
be shared by multiple users via repositories.
http://blog.cloudera.com/blog/2015/03/how-to-build-re-usable-spark-programs-using-spark-shell-and-maven/
‹#›© Cloudera, Inc. All rights reserved.
Apache Spark:
Overview & Concepts
‹#›© Cloudera, Inc. All rights reserved.
Spark Context
‹#›© Cloudera, Inc. All rights reserved.
RDDs and DataFrames
‹#›© Cloudera, Inc. All rights reserved.
Creation of RDDs
‹#›© Cloudera, Inc. All rights reserved.
Datatypes in RDDs
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
Spark in a Cluster
‹#›© Cloudera, Inc. All rights reserved.
Spark in a Cluster
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
DStream: The heart of Spark Streaming
‹#›© Cloudera, Inc. All rights reserved.
“Efficient hardware utilization, caching,
simple APIs, and access to a variety of data
in Hadoop is key to success.”
—What makes Spark so different, compared to core MapReduce?
‹#›© Cloudera, Inc. All rights reserved.
Practical Tips
‹#›© Cloudera, Inc. All rights reserved.
Development Techniques
• Build your tools and analysis procedures in small cycles.
• Test all phases of your work and document carefully.
• Document what you expect! => Requirements management …
• Collect what you get! => Operational logs …
• Reuse well tested components and modularize your analysis scripts.
• Learn „state of the art“ tools and share your work!
‹#›© Cloudera, Inc. All rights reserved.
Data Management
• Think about typical access patterns:
• random access to each record or field?
• access to entire groups of records?
• variable size or fixed size sets?
• „full table scan“
• OPTIMIZE FOR YOUR DOMINANT ACCESS PATTERN!
• Select efficient storage formats: Avro, Parquet
• Index your data in SOLR for random access and data exploration
• Indexing can be done by just a few clicks in HUE …
‹#›© Cloudera, Inc. All rights reserved.
Collecting Sensor Data with Spark Streaming …
• Spark Streaming works on fixed time slices only (in current version, 1.5)
• Use the original time stamp?
• Requires additional storage and bandwidth
• Original system clock defines resolution
• Use „Spark-Time“ or a local time reference:
• You may lose information!
• You have a limited resolution, defined by batch size.
‹#›© Cloudera, Inc. All rights reserved.
Thank you !
Enjoy Apache Spark and all your data …

More Related Content

What's hot (20)

PDF
Big Telco - Yousun Jeong
Spark Summit
 
PPTX
Lambda architecture with Spark
Vincent GALOPIN
 
PPTX
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
PPTX
Spark - Migration Story
Roman Chukh
 
PDF
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Databricks
 
PDF
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
PDF
Cloud Experience: Data-driven Applications Made Simple and Fast
Databricks
 
PDF
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
Databricks
 
PDF
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 
PDF
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
PDF
Fast data for fitness 10 nov 2020
Timothy Spann
 
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
PPTX
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
Spark Summit
 
PDF
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
PDF
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Spark Summit
 
PPTX
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
 
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
PPTX
Self-Service Analytics on Hadoop: Lessons Learned
DataWorks Summit/Hadoop Summit
 
PDF
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Databricks
 
PDF
Cloud Connect 2012, Big Data @ Netflix
Jerome Boulon
 
Big Telco - Yousun Jeong
Spark Summit
 
Lambda architecture with Spark
Vincent GALOPIN
 
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
Spark - Migration Story
Roman Chukh
 
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Databricks
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Databricks
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
Databricks
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Fast data for fitness 10 nov 2020
Timothy Spann
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
Spark Summit
 
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Spark Summit
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Self-Service Analytics on Hadoop: Lessons Learned
DataWorks Summit/Hadoop Summit
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Databricks
 
Cloud Connect 2012, Big Data @ Netflix
Jerome Boulon
 

Viewers also liked (7)

PDF
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark Briefing
Thomas W. Dinsmore
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
PDF
Apache Spark, the Next Generation Cluster Computing
Gerger
 
PDF
Introduction to Stateful Stream Processing with Apache Flink.
Konstantinos Kloudas
 
PDF
What the Spark!? Intro and Use Cases
Aerospike, Inc.
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Apache Spark Briefing
Thomas W. Dinsmore
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Apache Spark, the Next Generation Cluster Computing
Gerger
 
Introduction to Stateful Stream Processing with Apache Flink.
Konstantinos Kloudas
 
What the Spark!? Intro and Use Cases
Aerospike, Inc.
 
Ad

Similar to Apache Spark in Scientific Applications (20)

PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
nreddyjanga
 
PDF
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
PPTX
Introduction to spark
Home
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PPTX
APACHE SPARK.pptx
DeepaThirumurugan
 
PDF
Started with-apache-spark
Happiest Minds Technologies
 
PPT
Spark_Part 1
Shashi Prakash
 
PPTX
Data Science and CDSW
Jason Hubbard
 
PPTX
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
MLconf
 
PPTX
Atlanta MLConf
Qubole
 
PPTX
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
PPTX
Analyzing Hadoop Data Using Sparklyr

Cloudera, Inc.
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
PPTX
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PPTX
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
PPTX
apache spark Presentation general seminar.pptx
abhinavas9207
 
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
nreddyjanga
 
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Introduction to spark
Home
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
APACHE SPARK.pptx
DeepaThirumurugan
 
Started with-apache-spark
Happiest Minds Technologies
 
Spark_Part 1
Shashi Prakash
 
Data Science and CDSW
Jason Hubbard
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
MLconf
 
Atlanta MLConf
Qubole
 
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Analyzing Hadoop Data Using Sparklyr

Cloudera, Inc.
 
Apache Spark Fundamentals
Zahra Eskandari
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
apache spark Presentation general seminar.pptx
abhinavas9207
 
Ad

More from Dr. Mirko Kämpf (9)

PPTX
IoT meets AI in the Clouds
Dr. Mirko Kämpf
 
PPTX
Improving computer vision models at scale (Strata Data NYC)
Dr. Mirko Kämpf
 
PDF
Improving computer vision models at scale presentation
Dr. Mirko Kämpf
 
PPTX
Etosha - Data Asset Manager : Status and road map
Dr. Mirko Kämpf
 
PPTX
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
PPT
DPG Berlin - SOE 18 - talk v1.2.4
Dr. Mirko Kämpf
 
PPT
Information Spread in the Context of Evacuation Optimization
Dr. Mirko Kämpf
 
PDF
Hadoop & Complex Systems Research
Dr. Mirko Kämpf
 
PDF
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
Dr. Mirko Kämpf
 
IoT meets AI in the Clouds
Dr. Mirko Kämpf
 
Improving computer vision models at scale (Strata Data NYC)
Dr. Mirko Kämpf
 
Improving computer vision models at scale presentation
Dr. Mirko Kämpf
 
Etosha - Data Asset Manager : Status and road map
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
DPG Berlin - SOE 18 - talk v1.2.4
Dr. Mirko Kämpf
 
Information Spread in the Context of Evacuation Optimization
Dr. Mirko Kämpf
 
Hadoop & Complex Systems Research
Dr. Mirko Kämpf
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
Dr. Mirko Kämpf
 

Recently uploaded (20)

PPTX
Pirimidinas_2025_Curso Ácidos nucleicos. Cinvestav
Cinvestav
 
PPTX
Nanofertilizer: Its potential benefits and associated challenges.pptx
BikramjitDeuri
 
PPT
1a. Basic Principles of Medical Microbiology Part 2 [Autosaved].ppt
separatedwalk
 
PPTX
Hydrocarbons Pollution. OIL pollutionpptx
AkCreation33
 
PDF
Challenges of Transpiling Smalltalk to JavaScript
ESUG
 
PPT
1. Basic Principles of Medical Microbiology Part 1.ppt
separatedwalk
 
PPTX
RED ROT DISEASE OF SUGARCANE.pptx
BikramjitDeuri
 
DOCX
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
PDF
Systems Biology: Integrating Engineering with Biological Research (www.kiu.a...
publication11
 
PDF
High-definition imaging of a filamentary connection between a close quasar pa...
Sérgio Sacani
 
PDF
NSF-DOE Vera C. Rubin Observatory Observations of Interstellar Comet 3I/ATLAS...
Sérgio Sacani
 
PPTX
Home Garden as a Component of Agroforestry system : A survey-based Study
AkhangshaRoy
 
PPTX
CARBOHYDRATES METABOLSIM, UNIT 2, B Pharm II SEMESTER, BIOCHEMISTRY
ARUN KUMAR
 
PPTX
Brain_stem_Medulla oblongata_functions of pons_mid brain
muralinath2
 
PPTX
The Obesity Paradox. Friend or Foe ?pptx
drdgd1972
 
PPTX
Quality control test for plastic & metal.pptx
shrutipandit17
 
PPTX
METABOLIC_SYNDROME Dr Shadab- kgmu lucknow pptx
ShadabAlam169087
 
PDF
study of microbiologically influenced corrosion of 2205 duplex stainless stee...
ahmadfreak180
 
PDF
Pulsar Sparking: What if mountains on the surface?
Sérgio Sacani
 
PDF
Multiwavelength Study of a Hyperluminous X-Ray Source near NGC6099: A Strong ...
Sérgio Sacani
 
Pirimidinas_2025_Curso Ácidos nucleicos. Cinvestav
Cinvestav
 
Nanofertilizer: Its potential benefits and associated challenges.pptx
BikramjitDeuri
 
1a. Basic Principles of Medical Microbiology Part 2 [Autosaved].ppt
separatedwalk
 
Hydrocarbons Pollution. OIL pollutionpptx
AkCreation33
 
Challenges of Transpiling Smalltalk to JavaScript
ESUG
 
1. Basic Principles of Medical Microbiology Part 1.ppt
separatedwalk
 
RED ROT DISEASE OF SUGARCANE.pptx
BikramjitDeuri
 
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
Systems Biology: Integrating Engineering with Biological Research (www.kiu.a...
publication11
 
High-definition imaging of a filamentary connection between a close quasar pa...
Sérgio Sacani
 
NSF-DOE Vera C. Rubin Observatory Observations of Interstellar Comet 3I/ATLAS...
Sérgio Sacani
 
Home Garden as a Component of Agroforestry system : A survey-based Study
AkhangshaRoy
 
CARBOHYDRATES METABOLSIM, UNIT 2, B Pharm II SEMESTER, BIOCHEMISTRY
ARUN KUMAR
 
Brain_stem_Medulla oblongata_functions of pons_mid brain
muralinath2
 
The Obesity Paradox. Friend or Foe ?pptx
drdgd1972
 
Quality control test for plastic & metal.pptx
shrutipandit17
 
METABOLIC_SYNDROME Dr Shadab- kgmu lucknow pptx
ShadabAlam169087
 
study of microbiologically influenced corrosion of 2205 duplex stainless stee...
ahmadfreak180
 
Pulsar Sparking: What if mountains on the surface?
Sérgio Sacani
 
Multiwavelength Study of a Hyperluminous X-Ray Source near NGC6099: A Strong ...
Sérgio Sacani
 

Apache Spark in Scientific Applications

  • 1. ‹#›© Cloudera, Inc. All rights reserved. Mirko Kämpf | 2015 Apache Spark: Next Generation Data Processing for Hadoop
  • 2. ‹#›© Cloudera, Inc. All rights reserved. Agenda • The Data Science Process (DSP) - Why or when to use Spark • The role of: Apache Hadoop and Apache Spark - History & Hadoop Ecosystem • Apache Spark: Overview and Concepts • Practical Tips
  • 3. ‹#›© Cloudera, Inc. All rights reserved. The Data Science Process Application of Big-Data-Technology Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
  • 4. ‹#›© Cloudera, Inc. All rights reserved. Huge Data Sets in Science Application of Big-Data-Technology Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
  • 5. ‹#›© Cloudera, Inc. All rights reserved. “Spark offers tools for Data Science and components for Data Products.” —How can Apache Spark fit into my world?
  • 6. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow?
  • 7. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow? • Just export as CSV / JSON and use a DataFrame to join with other DS. Why not?
  • 8. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow? • Just export as CSV / JSON and use a DataFrame to join with other DS. • Think about additional analysis methods! Maybe it is already built into Apache Spark! Why not?
  • 9. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow? • Just export as CSV / JSON and use a DataFrame to join with other DS. • Think about additional analysis methods! Maybe it is build into Spark. • OK, Spark will probably not help to speed up your system, but maybe you can offload data to Hadoop, which releases some resources. Why not?
  • 10. ‹#›© Cloudera, Inc. All rights reserved. “Spark offers fast in memory processing on huge distributed and even on heterogeneous datasets.” —What type of data fits into Spark?
  • 11. ‹#›© Cloudera, Inc. All rights reserved. History of Spark Spark is really young, but has a very active community!
  • 12. ‹#›© Cloudera, Inc. All rights reserved. Timeline: Spark Adoption
  • 13. ‹#›© Cloudera, Inc. All rights reserved. Apache Spark: Overview & Concepts
  • 14. ‹#›© Cloudera, Inc. All rights reserved. Hadoop Ecosystem incl. Apache Spark Spark can be an entry point to your Big Data world …
  • 15. ‹#›© Cloudera, Inc. All rights reserved. “Apache Spark is distributed on top of Hadoop and brings parallel processing to powerful workstations.” —Do I need a Hadoop cluster to work with Apache Spark?
  • 16. ‹#›© Cloudera, Inc. All rights reserved. Spark vs. MapReduce
  • 17. ‹#›© Cloudera, Inc. All rights reserved. How to interact with Spark?
  • 18. ‹#›© Cloudera, Inc. All rights reserved. Spark Components
  • 19. ‹#›© Cloudera, Inc. All rights reserved.
  • 20. ‹#›© Cloudera, Inc. All rights reserved. MLLib: GraphX: Basic statistics summary statistics, correlations, stratified sampling, hypothesis testing, random data generation Classification and regression linear models (SVMs, logistic / linear regression) naive Bayes, decision trees ensembles of trees (Random Forests / Gradient-Boosted Trees) isotonic regression Collaborative filtering alternating least squares (ALS) Clustering k-means, Gaussian mixture, power iteration clustering (PIC) latent Dirichlet allocation (LDA), streaming k-means Dimensionality reduction singular value decomposition (SVD) principal component analysis (PCA) … PageRank Connected Components Triangle Counting Pregel API
  • 21. ‹#›© Cloudera, Inc. All rights reserved. How to use your code in Spark? A. Interactively, by loading it into the spark-shell. B. Contribute to existing Spark projects. C. Create your module and use it in a spark-shell session. D. Build a data-product which uses Apache Spark. For simple and reliable usage of Java classes and complete third-party libraries, we define a Spark Module as a self-contained artifact created by Maven. This module can easily be shared by multiple users via repositories. http://blog.cloudera.com/blog/2015/03/how-to-build-re-usable-spark-programs-using-spark-shell-and-maven/
  • 22. ‹#›© Cloudera, Inc. All rights reserved. Apache Spark: Overview & Concepts
  • 23. ‹#›© Cloudera, Inc. All rights reserved. Spark Context
  • 24. ‹#›© Cloudera, Inc. All rights reserved. RDDs and DataFrames
  • 25. ‹#›© Cloudera, Inc. All rights reserved. Creation of RDDs
  • 26. ‹#›© Cloudera, Inc. All rights reserved. Datatypes in RDDs
  • 27. ‹#›© Cloudera, Inc. All rights reserved.
  • 28. ‹#›© Cloudera, Inc. All rights reserved.
  • 29. ‹#›© Cloudera, Inc. All rights reserved. Spark in a Cluster
  • 30. ‹#›© Cloudera, Inc. All rights reserved. Spark in a Cluster
  • 31. ‹#›© Cloudera, Inc. All rights reserved.
  • 32. ‹#›© Cloudera, Inc. All rights reserved.
  • 33. ‹#›© Cloudera, Inc. All rights reserved. DStream: The heart of Spark Streaming
  • 34. ‹#›© Cloudera, Inc. All rights reserved. “Efficient hardware utilization, caching, simple APIs, and access to a variety of data in Hadoop is key to success.” —What makes Spark so different, compared to core MapReduce?
  • 35. ‹#›© Cloudera, Inc. All rights reserved. Practical Tips
  • 36. ‹#›© Cloudera, Inc. All rights reserved. Development Techniques • Build your tools and analysis procedures in small cycles. • Test all phases of your work and document carefully. • Document what you expect! => Requirements management … • Collect what you get! => Operational logs … • Reuse well tested components and modularize your analysis scripts. • Learn „state of the art“ tools and share your work!
  • 37. ‹#›© Cloudera, Inc. All rights reserved. Data Management • Think about typical access patterns: • random access to each record or field? • access to entire groups of records? • variable size or fixed size sets? • „full table scan“ • OPTIMIZE FOR YOUR DOMINANT ACCESS PATTERN! • Select efficient storage formats: Avro, Parquet • Index your data in SOLR for random access and data exploration • Indexing can be done by just a few clicks in HUE …
  • 38. ‹#›© Cloudera, Inc. All rights reserved. Collecting Sensor Data with Spark Streaming … • Spark Streaming works on fixed time slices only (in current version, 1.5) • Use the original time stamp? • Requires additional storage and bandwidth • Original system clock defines resolution • Use „Spark-Time“ or a local time reference: • You may lose information! • You have a limited resolution, defined by batch size.
  • 39. ‹#›© Cloudera, Inc. All rights reserved. Thank you ! Enjoy Apache Spark and all your data …