SlideShare a Scribd company logo
CONFIDENTIAL - RESTRICTED
Introduction to Spark
Scala SB Meetup
December 18th 2014
Maxime Dumas
Systems Engineer, Cloudera
Thirty Seconds About Max
• Systems Engineer
• aka Sales Engineer
• SoCal, AZ, NV
• former coder of PHP
• teaches meditation + yoga
• from Montreal, Canada
2
What Does Cloudera Do?
• product
• distribution of Hadoop components, Apache licensed
• enterprise tooling
• support
• training
• services (aka consulting)
• community
3
4
Quick and dirty, for context.
The Apache Hadoop Ecosystem
©2014 Cloudera, Inc. All rights
reserved.
• Scalability
• Simply scales just by adding nodes
• Local processing to avoid network bottlenecks
• Efficiency
• Cost efficiency (<$1k/TB) on commodity hardware
• Unified storage, metadata, security (no duplication or
synchronization)
• Flexibility
• All kinds of data (blobs, documents, records, etc)
• In all forms (structured, semi-structured, unstructured)
• Store anything then later analyze what you need
Why Hadoop?
Why “Ecosystem?”
• In the beginning, just Hadoop
• HDFS
• MapReduce
• Today, dozens of interrelated components
• I/O
• Processing
• Specialty Applications
• Configuration
• Workflow
6
HDFS
• Distributed, highly fault-tolerant filesystem
• Optimized for large streaming access to data
• Based on Google File System
• http://research.google.com/archive/gfs.html
7
Lots of Commodity Machines
8
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
MapReduce (MR)
• Programming paradigm
• Batch oriented, not realtime
• Works well with distributed computing
• Lots of Java, but other languages supported
• Based on Google’s paper
• http://research.google.com/archive/mapreduce.html
9
Apache Hive
• Abstraction of Hadoop’s Java API
• HiveQL “compiles” down to MR
• a “SQL-like” language
• Eases analysis using MapReduce
10
Apache Hive Metastore
• Maps HDFS files to DB-like resources
• Databases
• Tables
• Column/field names, data types
• Roles/users
• InputFormat/OutputFormat
11
CDH: the App Store for Hadoop
12
Integration
Storage
Resource Management
Metadata
NoSQL
DBMS
…
Analytic
MPP
DBMS
Search
Engine
In-
Memory
Batch
Processing
System
Management
Data
Management
Support
Security
Machine
Learning
MapReduce
13
Introduction to Apache Spark
Credits:
• Ben White
• Todd Lipcon
• Ted Malaska
• Jairam Ranganathan
• Jayant Shekhar
• Sandy Ryza
Can we improve on MR?
• Problems with MR:
• Very low-level: requires a lot of code to do simple
things
• Very constrained: everything must be described as
“map” and “reduce”. Powerful but sometimes
difficult to think in these terms.
14
Can we improve on MR?
• Two approaches to improve on MapReduce:
1. Special purpose systems to solve one problem domain
well.
• Giraph / Graphlab (graph processing)
• Storm (stream processing)
• Impala (real-time SQL)
2. Generalize the capabilities of MapReduce to
provide a richer foundation to solve problems.
• Tez, MPI, Hama/Pregel (BSP), Dryad (arbitrary DAGs)
Both are viable strategies depending on the problem!
15
What is Apache Spark?
Spark is a general purpose computational framework
Retains the advantages of MapReduce:
• Linear scalability
• Fault-tolerance
• Data Locality based computations
…but offers so much more:
• Leverages distributed memory for better performance
• Supports iterative algorithms that are not feasible in MR
• Improved developer experience
• Full Directed Graph expressions for data parallel computations
• Comes with libraries for machine learning, graph analysis, etc.
16
What is Apache Spark?
Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk.
One of the largest open source projects in big data:
• 170+ developers contributing
• 30+ companies contributing
• 400+ discussions per month on the mailing list
17
Popular project
18
Getting started with Spark
• Java API
• Interactive shells:
• Scala (spark-shell)
• Python (pyspark)
19
Execution modes
20
Execution modes
• Standalone Mode
• Dedicated master and worker daemons
• YARN Client Mode
• Launches a YARN application with the
driver program running locally
• YARN Cluster Mode
• Launches a YARN application with the
driver program running in the YARN
ApplicationMaster
21
Dynamic resource
management
between Spark,
MR, Impala…
Dedicated Spark
runtime with static
resource limits
Spark Concepts
22
RDD – Resilient Distributed Dataset
• Collections of objects partitioned across a cluster
• Stored in RAM or on Disk
• You can control persistence and partitioning
• Created by:
• Distributing local collection objects
• Transformation of data in storage
• Transformation of RDDs
• Automatically rebuilt on failure (resilient)
• Contains lineage to compute from storage
• Lazy materialization
23
RDD transformations
24
Operations on RDDs
Transformations lazily transform a RDD
to a new RDD
• map
• flatMap
• filter
• sample
• join
• sort
• reduceByKey
• …
Actions run computation to return a
value
• collect
• reduce(func)
• foreach(func)
• count
• first, take(n)
• saveAs
• …
25
Fault Tolerance
• RDDs contain lineage.
• Lineage – source location and list of transformations
• Lost partitions can be re-computed from source data
26
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
27
Examples
Word Count in MapReduce
28
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
Word Count in Spark
sc.textFile(“words”)
.flatMap(line => line.split(" "))
.map(word=>(word,1))
.reduceByKey(_+_).collect()
29
Logistic Regression
• Read two sets of points
• Looks for a plane W that separates them
• Perform gradient descent:
• Start with random W
• On each iteration, sum a function of W over the data
• Move W in a direction that improves it
30
Intuition
31
Logistic Regression
32
Logistic Regression Performance
33
34
Spark and Hadoop:
a Framework within a Framework
35
36
Integration
Storage
Resource Management
Metadata
HBase …Impala Solr Spark
Map
Reduce
System
Management
Data
Management
Support
Security
Spark Streaming
• Takes the concept of RDDs and extends it to DStreams
• Fault-tolerant like RDDs
• Transformable like RDDs
• Adds new “rolling window” operations
• Rolling averages, etc.
• But keeps everything else!
• Regular Spark code works in Spark Streaming
• Can still access HDFS data, etc.
• Example use cases:
• “On-the-fly” ETL as data is ingested into Hadoop/HDFS.
• Detecting anomalous behavior and triggering alerts.
• Continuous reporting of summary metrics for incoming data.
37
Micro-batching for on the fly ETL
38
What about SQL?
39
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/
Fault Recovery Recap
• RDDs store dependency graph
• Because RDDs are deterministic:
Missing RDDs are rebuilt in parallel on other nodes
• Stateful RDDs can have infinite lineage
• Periodic checkpoints to disk clears lineage
• Faster recovery times
• Better handling of stragglers vs row-by-row streaming
40
Why Spark?
• Flexible like MapReduce
• High performance
• Machine learning,
iterative algorithms
• Interactive data
explorations
• Concise, easy API for
developer productivity
41
42
Demo Time!
• Log file Analysis
• Machine Learning
• Spark Streaming
What’s Next?
• Download Hadoop!
• CDH available at www.cloudera.com
• Try it online: Cloudera Live
• Cloudera provides pre-loaded VMs
• http://tiny.cloudera.com/quickstartvm
43
44
Preferably related to the talk… or not.
Questions?
45
Thank You!
Maxime Dumas
mdumas@cloudera.com
We’re hiring.
46

More Related Content

PPTX
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
PPTX
Apache spark core
Thành Nguyễn
 
PPTX
Transformations and actions a visual guide training
Spark Summit
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PPTX
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Hadoop and Spark
Shravan (Sean) Pabba
 
PDF
DTCC '14 Spark Runtime Internals
Cheng Lian
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
Apache spark core
Thành Nguyễn
 
Transformations and actions a visual guide training
Spark Summit
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Hadoop and Spark
Shravan (Sean) Pabba
 
DTCC '14 Spark Runtime Internals
Cheng Lian
 

What's hot (20)

PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PDF
BDM25 - Spark runtime internal
David Lauzon
 
PDF
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
PDF
Apache Spark streaming and HBase
Carol McDonald
 
PDF
Map reduce vs spark
Tudor Lapusan
 
PDF
Introduction to spark
Duyhai Doan
 
PDF
Spark overview
Lisa Hua
 
PDF
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
PPTX
Spark and Spark Streaming
宇 傅
 
PDF
Facebook keynote-nicolas-qcon
Yiwei Ma
 
PPT
Scala and spark
Fabio Fumarola
 
PPTX
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
PPTX
Writing Yarn Applications Hadoop Summit 2012
Hortonworks
 
PPTX
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
PDF
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Peter Haase
 
PDF
Hadoop ecosystem
Ran Silberman
 
PDF
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
PPTX
Apache Spark overview
DataArt
 
PDF
Spark streaming , Spark SQL
Yousun Jeong
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
BDM25 - Spark runtime internal
David Lauzon
 
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Apache Spark streaming and HBase
Carol McDonald
 
Map reduce vs spark
Tudor Lapusan
 
Introduction to spark
Duyhai Doan
 
Spark overview
Lisa Hua
 
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
Spark and Spark Streaming
宇 傅
 
Facebook keynote-nicolas-qcon
Yiwei Ma
 
Scala and spark
Fabio Fumarola
 
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Writing Yarn Applications Hadoop Summit 2012
Hortonworks
 
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Peter Haase
 
Hadoop ecosystem
Ran Silberman
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Apache Spark overview
DataArt
 
Spark streaming , Spark SQL
Yousun Jeong
 
Ad

Similar to Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014 (20)

PPTX
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
PPTX
Why Functional Programming Is Important in Big Data Era
Handaru Sakti
 
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
PDF
20140614 introduction to spark-ben white
Data Con LA
 
PDF
Apache Spark Overview part1 (20161107)
Steve Min
 
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
PPTX
Apache Spark Core
Girish Khanzode
 
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
PPTX
Dive into spark2
Gal Marder
 
PPTX
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
PDF
MapReduce basics
Harisankar H
 
PPT
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
PPTX
Cleveland Hadoop Users Group - Spark
Vince Gonzalez
 
PDF
An introduction To Apache Spark
Amir Sedighi
 
PDF
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
PDF
Review of Calculation Paradigm and its Components
Namuk Park
 
PPTX
Apache Spark II (SparkSQL)
Datio Big Data
 
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
PPTX
Spark from the Surface
Josi Aranda
 
PPTX
Introduction to Hadoop
York University
 
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
Why Functional Programming Is Important in Big Data Era
Handaru Sakti
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
20140614 introduction to spark-ben white
Data Con LA
 
Apache Spark Overview part1 (20161107)
Steve Min
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Apache Spark Core
Girish Khanzode
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Dive into spark2
Gal Marder
 
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
MapReduce basics
Harisankar H
 
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Cleveland Hadoop Users Group - Spark
Vince Gonzalez
 
An introduction To Apache Spark
Amir Sedighi
 
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Review of Calculation Paradigm and its Components
Namuk Park
 
Apache Spark II (SparkSQL)
Datio Big Data
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
Spark from the Surface
Josi Aranda
 
Introduction to Hadoop
York University
 
Ad

Recently uploaded (20)

PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Doc9.....................................
SofiaCollazos
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Doc9.....................................
SofiaCollazos
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 

Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014

  • 1. CONFIDENTIAL - RESTRICTED Introduction to Spark Scala SB Meetup December 18th 2014 Maxime Dumas Systems Engineer, Cloudera
  • 2. Thirty Seconds About Max • Systems Engineer • aka Sales Engineer • SoCal, AZ, NV • former coder of PHP • teaches meditation + yoga • from Montreal, Canada 2
  • 3. What Does Cloudera Do? • product • distribution of Hadoop components, Apache licensed • enterprise tooling • support • training • services (aka consulting) • community 3
  • 4. 4 Quick and dirty, for context. The Apache Hadoop Ecosystem
  • 5. ©2014 Cloudera, Inc. All rights reserved. • Scalability • Simply scales just by adding nodes • Local processing to avoid network bottlenecks • Efficiency • Cost efficiency (<$1k/TB) on commodity hardware • Unified storage, metadata, security (no duplication or synchronization) • Flexibility • All kinds of data (blobs, documents, records, etc) • In all forms (structured, semi-structured, unstructured) • Store anything then later analyze what you need Why Hadoop?
  • 6. Why “Ecosystem?” • In the beginning, just Hadoop • HDFS • MapReduce • Today, dozens of interrelated components • I/O • Processing • Specialty Applications • Configuration • Workflow 6
  • 7. HDFS • Distributed, highly fault-tolerant filesystem • Optimized for large streaming access to data • Based on Google File System • http://research.google.com/archive/gfs.html 7
  • 8. Lots of Commodity Machines 8 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  • 9. MapReduce (MR) • Programming paradigm • Batch oriented, not realtime • Works well with distributed computing • Lots of Java, but other languages supported • Based on Google’s paper • http://research.google.com/archive/mapreduce.html 9
  • 10. Apache Hive • Abstraction of Hadoop’s Java API • HiveQL “compiles” down to MR • a “SQL-like” language • Eases analysis using MapReduce 10
  • 11. Apache Hive Metastore • Maps HDFS files to DB-like resources • Databases • Tables • Column/field names, data types • Roles/users • InputFormat/OutputFormat 11
  • 12. CDH: the App Store for Hadoop 12 Integration Storage Resource Management Metadata NoSQL DBMS … Analytic MPP DBMS Search Engine In- Memory Batch Processing System Management Data Management Support Security Machine Learning MapReduce
  • 13. 13 Introduction to Apache Spark Credits: • Ben White • Todd Lipcon • Ted Malaska • Jairam Ranganathan • Jayant Shekhar • Sandy Ryza
  • 14. Can we improve on MR? • Problems with MR: • Very low-level: requires a lot of code to do simple things • Very constrained: everything must be described as “map” and “reduce”. Powerful but sometimes difficult to think in these terms. 14
  • 15. Can we improve on MR? • Two approaches to improve on MapReduce: 1. Special purpose systems to solve one problem domain well. • Giraph / Graphlab (graph processing) • Storm (stream processing) • Impala (real-time SQL) 2. Generalize the capabilities of MapReduce to provide a richer foundation to solve problems. • Tez, MPI, Hama/Pregel (BSP), Dryad (arbitrary DAGs) Both are viable strategies depending on the problem! 15
  • 16. What is Apache Spark? Spark is a general purpose computational framework Retains the advantages of MapReduce: • Linear scalability • Fault-tolerance • Data Locality based computations …but offers so much more: • Leverages distributed memory for better performance • Supports iterative algorithms that are not feasible in MR • Improved developer experience • Full Directed Graph expressions for data parallel computations • Comes with libraries for machine learning, graph analysis, etc. 16
  • 17. What is Apache Spark? Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. One of the largest open source projects in big data: • 170+ developers contributing • 30+ companies contributing • 400+ discussions per month on the mailing list 17
  • 19. Getting started with Spark • Java API • Interactive shells: • Scala (spark-shell) • Python (pyspark) 19
  • 21. Execution modes • Standalone Mode • Dedicated master and worker daemons • YARN Client Mode • Launches a YARN application with the driver program running locally • YARN Cluster Mode • Launches a YARN application with the driver program running in the YARN ApplicationMaster 21 Dynamic resource management between Spark, MR, Impala… Dedicated Spark runtime with static resource limits
  • 23. RDD – Resilient Distributed Dataset • Collections of objects partitioned across a cluster • Stored in RAM or on Disk • You can control persistence and partitioning • Created by: • Distributing local collection objects • Transformation of data in storage • Transformation of RDDs • Automatically rebuilt on failure (resilient) • Contains lineage to compute from storage • Lazy materialization 23
  • 25. Operations on RDDs Transformations lazily transform a RDD to a new RDD • map • flatMap • filter • sample • join • sort • reduceByKey • … Actions run computation to return a value • collect • reduce(func) • foreach(func) • count • first, take(n) • saveAs • … 25
  • 26. Fault Tolerance • RDDs contain lineage. • Lineage – source location and list of transformations • Lost partitions can be re-computed from source data 26 msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 28. Word Count in MapReduce 28 package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • 29. Word Count in Spark sc.textFile(“words”) .flatMap(line => line.split(" ")) .map(word=>(word,1)) .reduceByKey(_+_).collect() 29
  • 30. Logistic Regression • Read two sets of points • Looks for a plane W that separates them • Perform gradient descent: • Start with random W • On each iteration, sum a function of W over the data • Move W in a direction that improves it 30
  • 34. 34 Spark and Hadoop: a Framework within a Framework
  • 35. 35
  • 36. 36 Integration Storage Resource Management Metadata HBase …Impala Solr Spark Map Reduce System Management Data Management Support Security
  • 37. Spark Streaming • Takes the concept of RDDs and extends it to DStreams • Fault-tolerant like RDDs • Transformable like RDDs • Adds new “rolling window” operations • Rolling averages, etc. • But keeps everything else! • Regular Spark code works in Spark Streaming • Can still access HDFS data, etc. • Example use cases: • “On-the-fly” ETL as data is ingested into Hadoop/HDFS. • Detecting anomalous behavior and triggering alerts. • Continuous reporting of summary metrics for incoming data. 37
  • 38. Micro-batching for on the fly ETL 38
  • 40. Fault Recovery Recap • RDDs store dependency graph • Because RDDs are deterministic: Missing RDDs are rebuilt in parallel on other nodes • Stateful RDDs can have infinite lineage • Periodic checkpoints to disk clears lineage • Faster recovery times • Better handling of stragglers vs row-by-row streaming 40
  • 41. Why Spark? • Flexible like MapReduce • High performance • Machine learning, iterative algorithms • Interactive data explorations • Concise, easy API for developer productivity 41
  • 42. 42 Demo Time! • Log file Analysis • Machine Learning • Spark Streaming
  • 43. What’s Next? • Download Hadoop! • CDH available at www.cloudera.com • Try it online: Cloudera Live • Cloudera provides pre-loaded VMs • http://tiny.cloudera.com/quickstartvm 43
  • 44. 44 Preferably related to the talk… or not. Questions?
  • 46. 46

Editor's Notes

  • #4: Similar to the Red Hat model. Hadoop elephant logo licensed for public use via Apache license: Apache Software Foundation, http://www.apache.org/foundation/marks/
  • #5: We’re going to breeze through these really quick, just to show how Search plugs in later…
  • #8: Lose a server, no problem. Lose a rack, no problem.