SlideShare a Scribd company logo
By Tsai Li Ming

PyData + Spark Meetup (SG) - 17 Nov 2015

http://about.me/tsailiming
Presentation and source codes are available here:

http://github.com/tsailiming/pydatasg-17Nov2015
What is Spark?
• Developed at UC Berkley in 2009. Open sourced
in 2010.

• Fast and general engine for large-scale data
processing
!
• Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk
!
• Multi-step Directed Acrylic Graphs (DAGs). Many
stages compared to just Hadoop Map and
Reduce only.
!
• Rich Scala, Java and Python APIs. R too!
!
• Interactive Shell
!
• Active development
What is Spark?
Spark Stack
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
http://web.stanford.edu/~ouster/cgi-bin/papers/ramcloud.pdf
Speed matters
http://hblok.net/blog/storage/
2011
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
Logistic Regression Performance
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
How Spark works
Resilient Distributed Datasets (RDDs)
• Basic abstraction in Spark. Fault-tolerant collection of
elements that can be operated on in parallel
!
• RDDs can be created from local file system, HDFS,
Cassandra, HBase, Amazon S3, SequenceFiles, and any
other Hadoop InputFormat.
!
• Different levels of caching: MEMORY_ONLY,
MEMORY_AND_DISK, DISK_ONLY, OFF_HEAP, etc
!
• Rich APIs for Transformations and Actions
!
• Data Locality: PROCESS_LOCAL -> NODE_LOCAL ->
RACK_LOCAL
RDD Operations
map flatMap sortByKey
filter union reduce
sample join count
groupByKey distinct saveAsTextFile
reduceByKey mapValues first
Spark Example
Wordcount Example
//package org.myorg;!
import java.io.IOException;!
import java.util.*;!
!
import org.apache.hadoop.fs.Path;!
import org.apache.hadoop.conf.*;!
import org.apache.hadoop.io.*;!
import org.apache.hadoop.mapred.*;!
import org.apache.hadoop.util.*;!
!
public class WordCount {!
!
! public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text,
IntWritable> {!
! ! private final static IntWritable one = new IntWritable(1);!
! ! private Text word = new Text();!
!
! ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {!
! ! ! String line = value.toString();!
! ! ! StringTokenizer tokenizer = new StringTokenizer(line);!
! ! ! while (tokenizer.hasMoreTokens()) {!
! ! ! ! word.set(tokenizer.nextToken());!
! ! ! ! output.collect(word, one);!
! ! ! }!
! ! }!
! }!
!
! public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable,
Text, IntWritable> {!
! ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {!
! ! ! int sum = 0;!
! ! ! while (values.hasNext()) {!
! ! ! ! sum += values.next().get();!
! ! ! }!
! ! ! output.collect(key, new IntWritable(sum));!
! ! }!
! }!
!
! public static void main(String[] args) throws Exception {!
! ! JobConf conf = new JobConf(WordCount.class);!
! ! conf.setJobName("wordcount");!
!
! ! conf.setOutputKeyClass(Text.class);!
! ! conf.setOutputValueClass(IntWritable.class);!
!
! ! conf.setMapperClass(Map.class);!
! ! //conf.setCombinerClass(Reduce.class);!
! ! conf.setReducerClass(Reduce.class);!
!
! ! conf.setInputFormat(TextInputFormat.class);!
! ! conf.setOutputFormat(TextOutputFormat.class);!
!
! ! FileInputFormat.setInputPaths(conf, new Path(args[0]));!
! ! FileOutputFormat.setOutputPath(conf, new Path(args[1]));!
!
! ! JobClient.runJob(conf);!
! }!
}
Hadoop MapReduce Spark Scala
val file = spark.textFile("hdfs://...")!
val counts = file.flatMap(line => line.split(" "))!
.map(word => (word, 1))!
.reduceByKey(_ + _)!
counts.saveAsTextFile("hdfs://...")
file = spark.textFile("hdfs://...")!
counts = file.flatMap(lambda line: line.split(" ")) !
.map(lambda word: (word, 1)) !
.reduceByKey(lambda a, b: a + b)!
counts.saveAsTextFile("hdfs://...")
Spark Python
Spark SQL and Dataframe Example
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
!
# Create the DataFrame
df = sqlContext.read.json("people.json")
!
# Show the content of the DataFrame
df.show()
## age name
## null Michael
## 30 Andy
## 19 Justin
!
# Print the schema in a tree format
df.printSchema()
## root
## |-- age: long (nullable = true)
## |-- name: string (nullable = true)
!
# Select only the "name" column
df.select("name").show()
## name
## Michael
## Andy
## Justin
# Select everybody, but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
## name (age + 1)
## Michael null
## Andy 31
## Justin 20
!
# Select people older than 21
df.filter(df['age'] > 21).show()
## age name
## 30 Andy
!
# Count people by age
df.groupBy("age").count().show()
## age count
## null 1
## 19 1
## 30 1
andypetrella/spark-notebook

(forked from Scala notebook)
Apache Zeppelin
Notebooks for Spark
Actual Demo
PySpark with Juypter
Thank You!

More Related Content

What's hot (20)

KEY
You know, for search. Querying 24 Billion Documents in 900ms
Jodok Batlogg
 
PPTX
Introduction to Apache Drill - interactive query and analysis at scale
MapR Technologies
 
PPTX
Attack monitoring using ElasticSearch Logstash and Kibana
Prajal Kulkarni
 
PPTX
Tuning tips for Apache Spark Jobs
Samir Bessalah
 
PDF
Hive dirty/beautiful hacks in TD
SATOSHI TAGOMORI
 
PPTX
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
PPTX
Data analysis scala_spark
Yiguang Hu
 
PDF
Spark with Elasticsearch
Holden Karau
 
KEY
MongoSF - mongodb @ foursquare
jorgeortiz85
 
PDF
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
 
PDF
Apache zeppelin the missing component for the big data ecosystem
Duyhai Doan
 
PDF
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Redis Indices (#RedisTLV)
Itamar Haber
 
PPT
9b. Document-Oriented Databases lab
Fabio Fumarola
 
PDF
Intro to py spark (and cassandra)
Jon Haddad
 
PDF
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
PDF
Apache spark basics
sparrowAnalytics.com
 
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
PDF
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
PPT
ELK stack at weibo.com
琛琳 饶
 
You know, for search. Querying 24 Billion Documents in 900ms
Jodok Batlogg
 
Introduction to Apache Drill - interactive query and analysis at scale
MapR Technologies
 
Attack monitoring using ElasticSearch Logstash and Kibana
Prajal Kulkarni
 
Tuning tips for Apache Spark Jobs
Samir Bessalah
 
Hive dirty/beautiful hacks in TD
SATOSHI TAGOMORI
 
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
Data analysis scala_spark
Yiguang Hu
 
Spark with Elasticsearch
Holden Karau
 
MongoSF - mongodb @ foursquare
jorgeortiz85
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
 
Apache zeppelin the missing component for the big data ecosystem
Duyhai Doan
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Redis Indices (#RedisTLV)
Itamar Haber
 
9b. Document-Oriented Databases lab
Fabio Fumarola
 
Intro to py spark (and cassandra)
Jon Haddad
 
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
Apache spark basics
sparrowAnalytics.com
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
ELK stack at weibo.com
琛琳 饶
 

Similar to PySpark with Juypter (20)

PDF
Spark overview
Lisa Hua
 
PDF
Meetup ml spark_ppt
Snehal Nagmote
 
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
PDF
Apache Spark & Hadoop
MapR Technologies
 
PDF
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
PPT
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
PDF
Apache Spark
Uwe Printz
 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PPTX
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
PDF
Hadoop and Spark
Shravan (Sean) Pabba
 
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
PPTX
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
MapR Technologies
 
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
PDF
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
PDF
20140614 introduction to spark-ben white
Data Con LA
 
PDF
OCF.tw's talk about "Introduction to spark"
Giivee The
 
PPTX
SparkNotes
Demet Aksoy
 
PDF
Spark devoxx2014
Andy Petrella
 
PPTX
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
PDF
Scala+data
Samir Bessalah
 
Spark overview
Lisa Hua
 
Meetup ml spark_ppt
Snehal Nagmote
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Apache Spark & Hadoop
MapR Technologies
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Apache Spark
Uwe Printz
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
Hadoop and Spark
Shravan (Sean) Pabba
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
MapR Technologies
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
20140614 introduction to spark-ben white
Data Con LA
 
OCF.tw's talk about "Introduction to spark"
Giivee The
 
SparkNotes
Demet Aksoy
 
Spark devoxx2014
Andy Petrella
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Scala+data
Samir Bessalah
 
Ad

Recently uploaded (20)

PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Ad

PySpark with Juypter

  • 1. By Tsai Li Ming
 PyData + Spark Meetup (SG) - 17 Nov 2015
 http://about.me/tsailiming
  • 2. Presentation and source codes are available here:
 http://github.com/tsailiming/pydatasg-17Nov2015
  • 4. • Developed at UC Berkley in 2009. Open sourced in 2010.
 • Fast and general engine for large-scale data processing ! • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk ! • Multi-step Directed Acrylic Graphs (DAGs). Many stages compared to just Hadoop Map and Reduce only. ! • Rich Scala, Java and Python APIs. R too! ! • Interactive Shell ! • Active development What is Spark?
  • 12. Resilient Distributed Datasets (RDDs) • Basic abstraction in Spark. Fault-tolerant collection of elements that can be operated on in parallel ! • RDDs can be created from local file system, HDFS, Cassandra, HBase, Amazon S3, SequenceFiles, and any other Hadoop InputFormat. ! • Different levels of caching: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, OFF_HEAP, etc ! • Rich APIs for Transformations and Actions ! • Data Locality: PROCESS_LOCAL -> NODE_LOCAL -> RACK_LOCAL
  • 13. RDD Operations map flatMap sortByKey filter union reduce sample join count groupByKey distinct saveAsTextFile reduceByKey mapValues first
  • 15. Wordcount Example //package org.myorg;! import java.io.IOException;! import java.util.*;! ! import org.apache.hadoop.fs.Path;! import org.apache.hadoop.conf.*;! import org.apache.hadoop.io.*;! import org.apache.hadoop.mapred.*;! import org.apache.hadoop.util.*;! ! public class WordCount {! ! ! public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {! ! ! private final static IntWritable one = new IntWritable(1);! ! ! private Text word = new Text();! ! ! ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! ! ! ! String line = value.toString();! ! ! ! StringTokenizer tokenizer = new StringTokenizer(line);! ! ! ! while (tokenizer.hasMoreTokens()) {! ! ! ! ! word.set(tokenizer.nextToken());! ! ! ! ! output.collect(word, one);! ! ! ! }! ! ! }! ! }! ! ! public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {! ! ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! ! ! ! int sum = 0;! ! ! ! while (values.hasNext()) {! ! ! ! ! sum += values.next().get();! ! ! ! }! ! ! ! output.collect(key, new IntWritable(sum));! ! ! }! ! }! ! ! public static void main(String[] args) throws Exception {! ! ! JobConf conf = new JobConf(WordCount.class);! ! ! conf.setJobName("wordcount");! ! ! ! conf.setOutputKeyClass(Text.class);! ! ! conf.setOutputValueClass(IntWritable.class);! ! ! ! conf.setMapperClass(Map.class);! ! ! //conf.setCombinerClass(Reduce.class);! ! ! conf.setReducerClass(Reduce.class);! ! ! ! conf.setInputFormat(TextInputFormat.class);! ! ! conf.setOutputFormat(TextOutputFormat.class);! ! ! ! FileInputFormat.setInputPaths(conf, new Path(args[0]));! ! ! FileOutputFormat.setOutputPath(conf, new Path(args[1]));! ! ! ! JobClient.runJob(conf);! ! }! } Hadoop MapReduce Spark Scala val file = spark.textFile("hdfs://...")! val counts = file.flatMap(line => line.split(" "))! .map(word => (word, 1))! .reduceByKey(_ + _)! counts.saveAsTextFile("hdfs://...") file = spark.textFile("hdfs://...")! counts = file.flatMap(lambda line: line.split(" ")) ! .map(lambda word: (word, 1)) ! .reduceByKey(lambda a, b: a + b)! counts.saveAsTextFile("hdfs://...") Spark Python
  • 16. Spark SQL and Dataframe Example from pyspark.sql import SQLContext sqlContext = SQLContext(sc) ! # Create the DataFrame df = sqlContext.read.json("people.json") ! # Show the content of the DataFrame df.show() ## age name ## null Michael ## 30 Andy ## 19 Justin ! # Print the schema in a tree format df.printSchema() ## root ## |-- age: long (nullable = true) ## |-- name: string (nullable = true) ! # Select only the "name" column df.select("name").show() ## name ## Michael ## Andy ## Justin # Select everybody, but increment the age by 1 df.select(df['name'], df['age'] + 1).show() ## name (age + 1) ## Michael null ## Andy 31 ## Justin 20 ! # Select people older than 21 df.filter(df['age'] > 21).show() ## age name ## 30 Andy ! # Count people by age df.groupBy("age").count().show() ## age count ## null 1 ## 19 1 ## 30 1
  • 17. andypetrella/spark-notebook
 (forked from Scala notebook) Apache Zeppelin Notebooks for Spark