SlideShare a Scribd company logo
An Introduction to Apache Spark
with Amazon EMR
Peter Smith, Principal Software Engineer, ACL
Overview
• What is Spark?
• A Brief Timeline of Spark
• Storing data in RDDs and DataFrames
• Distributed Processing of Distributed Data
• Loading data: CSV, Parquet, JDBC
• Queries: SQL, Scala, PySpark
• Setting up Spark with EMR (Demo)
What is Spark?
Apache Spark is a unified analytics engine
for large-scale data processing.
• Can process Terabytes of data (billions of rows)
• Click streams from a web application.
• IoT data.
• Financial trades.
• Computation performed over multiple (potentially 1000s) of compute nodes.
• Has held the world record for sorting 100TB of data in 23 minutes (2014)
Usage Scenarios
• Batch Processing
– Large amounts of data, read from disk, then processed.
• Streaming Data
– Data is processed in real-time
• Machine Learning
– Predicting outcomes based on past experience
• Graph Processing
– Arbitrary data relationship, not just rows and columns
Spark versus Database
SQL
Interpreter /
Execution
Engine
SQL Database
SQL Client
SQL Select, Insert, Update
• Disk files in proprietary
format (e.g. B-Trees,
WALs)
• Users never look directly
at data files.
• Execution engine has
100% control over file
storage.
• Often database server is
a single machine (lots of
CPUs)
Spark versus Database
Spark
Worker
Spark
Worker
Spark
Worker
Spark
Worker
Spark Driver
SQL, Java, Scala,
Python, R
S3 Bucket
• Disk formats and
locations are 100%
controlled by user.
• No Transactional
Inserts or Updates!
• Compute is spread
over multiple servers
to improve scale.
Separate EC2
Servers
Amazon EFS
A Brief Timeline of Spark
2003 - Google File System
2004 - Map Reduce
2006 2010
2013
2011
Hadoop 2.8.4
Spark 2.3.2
(as of today)
How is Data Stored?
Spark allows data to be read or written from disk in a range of formats:
• CSV – Possibly the simplest and most common: Fred,Jones,1998,10,Stan
• JSON – Often generated by web applications. { “first_name”: “Fred”, … }
• JDBC – If the source data is in a database.
• Parquet and ORC – Optimized storage for column-oriented queries.
• Others – You’re free to write your own connectors.
Data is read or written as complete files – doesn’t support inserts or updates.
(unlike a transactional database that completely controls the structure of the data).
RDDs - How Spark Stores Data
• RDD – Resilient Distributed Dataset
• Data is stored in RAM, partitioned across multiple servers
• Each partition operates in parallel.
Instead of using database replication
(for resilience), Spark will re-perform
the work on a different worker.
Example: Sort people by age.
1. Divide into partitions of 8 people
2. Within group, sort by age.
3. Shuffle people based on decade
of birth.
4. Sort within each group.
Data Frames – Rows/Columns (> Spark v1.5)
• RDD – Rows of Java Objects
• DataFrame – Rows of Typed Fields (like a Database table)
Id
(Integer)
First_name
(String)
Last_name
(String)
BirthYear
(Integer)
Shoe size
(Float)
Dog’s name
(String)
1 Fran Brown 1982 10.5 Stan
2 Mary Jones 1976 9.0 Fido
3 Brad Pitt 1963 11.0 Barker
4 Jane Simpson 1988 8.0 Rex
… … … … … …
• DataFrames allow better type-safety and performance optimization.
Example: Loading data from a CSV File
data.csv:
1,Fran,Brown,1982,10.5,Stan
2,Mary,Jones,1976,9.0,Fido
3,Brad,Pitt,1963,11.0,Barker
4,Jane,Simpson,1988,8.0,Rex
5,James,Thompson,1980,9.5,Bif
6,Paul,Wilson,1967,8.5,Ariel
7,Alice,Carlton,1984,11.5,Hank
8,Mike,Taylor,1981,9.5,Quincy
9,Shona,Smith,1975,9.0,Juneau
10,Phil,Arnold,1978,10.0,Koda
Example: Loading data from a CSV File
$ spark-shell
scala> val df = spark.read.csv("data.csv")
Notes:
• Similar methods exist for JSON, JDBC, Parquet, etc.
• You can write your own!
• Scala is a general purpose programming language (not like SQL)
Example: Examing a Data Frame
scala> df.show(5)
+---+-----+--------+----+----+------+
|_c0| _c1| _c2| _c3| _c4| _c5|
+---+-----+--------+----+----+------+
| 1| Fran| Brown|1982|10.5| Stan|
| 2| Mary| Jones|1976| 9.0| Fido|
| 3| Brad| Pitt|1963|11.0|Barker|
| 4| Jane| Simpson|1988| 8.0| Rex|
| 5|James|Thompson|1980| 9.5| Bif|
+---+-----+--------+----+----+------+
Example: Defining a Schema
scala> df.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
|-- _c4: string (nullable = true)
|-- _c5: string (nullable = true)
Example: Defining a Schema
scala> df.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
|-- _c4: string (nullable = true)
|-- _c5: string (nullable = true)
Example: Defining a Schema
scala> val mySchema = StructType(
Array(
StructField("id", LongType),
StructField("first_name", StringType),
StructField("last_name", StringType),
StructField("birth_year", IntegerType),
StructField("shoe_size", FloatType),
StructField("dog_name", StringType)
)
)
scala> val df = spark.read.schema(mySchema).csv("data.csv")
Example: Defining a Schema
scala> df.show(5)
+---+----------+---------+----------+---------+--------+
| id|first_name|last_name|birth_year|shoe_size|dog_name|
+---+----------+---------+----------+---------+--------+
| 1| Fran| Brown| 1982| 10.5| Stan|
| 2| Mary| Jones| 1976| 9.0| Fido|
| 3| Brad| Pitt| 1963| 11.0| Barker|
| 4| Jane| Simpson| 1988| 8.0| Rex|
| 5| James| Thompson| 1980| 9.5| Bif|
+---+----------+---------+----------+---------+--------+
scala> df.printSchema
root
|-- id: long (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- birth_year: integer (nullable = true)
|-- shoe_size: float (nullable = true)
|-- dog_name: string (nullable = true)
Example: Counting Records
scala> df.count()
res21: Long = 10
Imagine 10 Billion rows over 1000 servers?
Example: Selecting Columns
scala> val df_dog = df.select(
col("first_name"),
col("dog_name”))
scala> df_dog.show(5)
+----------+--------+
|first_name|dog_name|
+----------+--------+
| Fran| Stan|
| Mary| Fido|
| Brad| Barker|
| Jane| Rex|
| James| Bif|
+----------+--------+
Example: Aggregations
scala> df.agg(
min(col("birth_year")),
avg(col("birth_year"))
).show
+---------------+---------------+
|min(birth_year)|avg(birth_year)|
+---------------+---------------+
| 1963| 1977.4|
+---------------+---------------+
Example: Filtering
scala> df.where("birth_year > 1980").show
+---+----------+---------+----------+---------+--------+
| id|first_name|last_name|birth_year|shoe_size|dog_name|
+---+----------+---------+----------+---------+--------+
| 1| Fran| Brown| 1982| 10.5| Stan|
| 4| Jane| Simpson| 1988| 8.0| Rex|
| 7| Alice| Carlton| 1984| 11.5| Hank|
| 8| Mike| Taylor| 1981| 9.5| Quincy|
+---+----------+---------+----------+---------+--------+
Example: Grouping
scala> df.groupBy(
(floor(col("birth_year") / 10) * 10) as "Decade"
).count.show
+------+-----+
|Decade|count|
+------+-----+
| 1960| 2|
| 1970| 3|
| 1980| 5|
+------+-----+
Example: More Advanced
scala> df.select(
col("first_name"),
col("dog_name"),
levenshtein(
col("first_name"),
col("dog_name")
) as "Diff"
).show(5)
+----------+--------+----+
|first_name|dog_name|Diff|
+----------+--------+----+
| Fran| Stan| 2|
| Mary| Fido| 4|
| Brad| Barker| 4|
| Jane| Rex| 4|
| James| Bif| 5|
+----------+--------+----+
Shorter version:
df.select(
'first_name,
'dog_name,
levenshtein(
'first_name,
'dog_name) as "Diff”
).show(5)
Queries: User Defined Functions
def taxRateFunc(year: Int) = {
if (year >= 1984) 0.20 else 0.05
}
val taxRate = udf(taxRateFunc _)
df.select('birth_year, taxRate('birth_year)).show(5)
+----------+---------------+
|birth_year|UDF(birth_year)|
+----------+---------------+
| 1982| 0.05|
| 1976| 0.05|
| 1963| 0.05|
| 1988| 0.20|
| 1980| 0.05|
+----------+---------------+
UDAFs - Check out
http://build.acl.com
Computing Average
Dates in Spark!
Why is Spark better than a Database?
It looks a lot like SQL, but:
• Can read/write data in arbitrary formats.
• Can be extended with general purpose program code.
• Can be split across 1000s of compute nodes.
• Can do ML, Streaming, Graph queries.
• Can use cheap storage (such as S3)
But yeah, if you’re happy with your database, that’s OK too.
Queries: PySpark
Very similar API, but written in Python:
$ pyspark
>>>> spark.read.csv(”data.csv").show(5)
+---+-----+--------+----+----+------+
|_c0| _c1| _c2| _c3| _c4| _c5|
+---+-----+--------+----+----+------+
| 1| Fran| Brown|1982|10.5| Stan|
| 2| Mary| Jones|1976| 9.0| Fido|
| 3| Brad| Pitt|1963|11.0|Barker|
| 4| Jane| Simpson|1988| 8.0| Rex|
| 5|James|Thompson|1980| 9.5| Bif|
+---+-----+--------+----+----+------+
Demo Time…
Using EMR
Driver
m4.2xlarge
Worker
m4.2xlarge
Worker
m4.2xlarge
Worker
m4.2xlarge
Worker
m4.2xlarge
Query
Data
Zeppelin
Spark History

More Related Content

What's hot (19)

PPTX
Millions of Regions in HBase: Size Matters
DataWorks Summit
 
PDF
Amazon DynamoDB Lessen's Learned by Beginner
Hirokazu Tokuno
 
PDF
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
SQL-on-Hadoop with Apache Drill
MapR Technologies
 
PPT
Chicago Data Summit: Apache HBase: An Introduction
Cloudera, Inc.
 
PDF
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Databricks
 
PDF
Apache Hadoop and HBase
Cloudera, Inc.
 
PDF
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.
 
PPTX
Apache hive
pradipbajpai68
 
PPTX
Hive : WareHousing Over hadoop
Chirag Ahuja
 
PDF
Intro to HBase
alexbaranau
 
PPT
Big Data Fundamentals in the Emerging New Data World
Jongwook Woo
 
PPT
Hw09 Practical HBase Getting The Most From Your H Base Install
Cloudera, Inc.
 
PPTX
HBase: Just the Basics
HBaseCon
 
PPTX
Hadoop Cluster Configuration and Data Loading - Module 2
Rohit Agrawal
 
PPTX
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
larsgeorge
 
PPTX
Hadoop World 2011: Advanced HBase Schema Design
Cloudera, Inc.
 
PDF
HBase Schema Design - HBase-Con 2012
Ian Varley
 
PPTX
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
Millions of Regions in HBase: Size Matters
DataWorks Summit
 
Amazon DynamoDB Lessen's Learned by Beginner
Hirokazu Tokuno
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
SQL-on-Hadoop with Apache Drill
MapR Technologies
 
Chicago Data Summit: Apache HBase: An Introduction
Cloudera, Inc.
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Databricks
 
Apache Hadoop and HBase
Cloudera, Inc.
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.
 
Apache hive
pradipbajpai68
 
Hive : WareHousing Over hadoop
Chirag Ahuja
 
Intro to HBase
alexbaranau
 
Big Data Fundamentals in the Emerging New Data World
Jongwook Woo
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Cloudera, Inc.
 
HBase: Just the Basics
HBaseCon
 
Hadoop Cluster Configuration and Data Loading - Module 2
Rohit Agrawal
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
larsgeorge
 
Hadoop World 2011: Advanced HBase Schema Design
Cloudera, Inc.
 
HBase Schema Design - HBase-Con 2012
Ian Varley
 
BIG DATA: Apache Hadoop
Oleksiy Krotov
 

Similar to Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR (20)

PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PDF
Introduction to Spark with Python
Gokhan Atil
 
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PPTX
Building a modern Application with DataFrames
Databricks
 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PDF
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
PPTX
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PPTX
Spark sql
Zahra Eskandari
 
PDF
Intro to Spark and Spark SQL
jeykottalam
 
PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
PPTX
Intro to Spark
Kyle Burke
 
PPTX
Spark from the Surface
Josi Aranda
 
PDF
Spark - Alexis Seigneurin (English)
Alexis Seigneurin
 
PDF
pyspark_df.pdf
SJain36
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Introduction to Spark with Python
Gokhan Atil
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Databricks
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
Spark sql
Zahra Eskandari
 
Intro to Spark and Spark SQL
jeykottalam
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Intro to Spark
Kyle Burke
 
Spark from the Surface
Josi Aranda
 
Spark - Alexis Seigneurin (English)
Alexis Seigneurin
 
pyspark_df.pdf
SJain36
 
Ad

Recently uploaded (20)

PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
big data eco system fundamentals of data science
arivukarasi
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
What Is Data Integration and Transformation?
subhashenia
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
Ad

Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR

  • 1. An Introduction to Apache Spark with Amazon EMR Peter Smith, Principal Software Engineer, ACL
  • 2. Overview • What is Spark? • A Brief Timeline of Spark • Storing data in RDDs and DataFrames • Distributed Processing of Distributed Data • Loading data: CSV, Parquet, JDBC • Queries: SQL, Scala, PySpark • Setting up Spark with EMR (Demo)
  • 3. What is Spark? Apache Spark is a unified analytics engine for large-scale data processing. • Can process Terabytes of data (billions of rows) • Click streams from a web application. • IoT data. • Financial trades. • Computation performed over multiple (potentially 1000s) of compute nodes. • Has held the world record for sorting 100TB of data in 23 minutes (2014)
  • 4. Usage Scenarios • Batch Processing – Large amounts of data, read from disk, then processed. • Streaming Data – Data is processed in real-time • Machine Learning – Predicting outcomes based on past experience • Graph Processing – Arbitrary data relationship, not just rows and columns
  • 5. Spark versus Database SQL Interpreter / Execution Engine SQL Database SQL Client SQL Select, Insert, Update • Disk files in proprietary format (e.g. B-Trees, WALs) • Users never look directly at data files. • Execution engine has 100% control over file storage. • Often database server is a single machine (lots of CPUs)
  • 6. Spark versus Database Spark Worker Spark Worker Spark Worker Spark Worker Spark Driver SQL, Java, Scala, Python, R S3 Bucket • Disk formats and locations are 100% controlled by user. • No Transactional Inserts or Updates! • Compute is spread over multiple servers to improve scale. Separate EC2 Servers Amazon EFS
  • 7. A Brief Timeline of Spark 2003 - Google File System 2004 - Map Reduce 2006 2010 2013 2011 Hadoop 2.8.4 Spark 2.3.2 (as of today)
  • 8. How is Data Stored? Spark allows data to be read or written from disk in a range of formats: • CSV – Possibly the simplest and most common: Fred,Jones,1998,10,Stan • JSON – Often generated by web applications. { “first_name”: “Fred”, … } • JDBC – If the source data is in a database. • Parquet and ORC – Optimized storage for column-oriented queries. • Others – You’re free to write your own connectors. Data is read or written as complete files – doesn’t support inserts or updates. (unlike a transactional database that completely controls the structure of the data).
  • 9. RDDs - How Spark Stores Data • RDD – Resilient Distributed Dataset • Data is stored in RAM, partitioned across multiple servers • Each partition operates in parallel. Instead of using database replication (for resilience), Spark will re-perform the work on a different worker.
  • 10. Example: Sort people by age. 1. Divide into partitions of 8 people 2. Within group, sort by age. 3. Shuffle people based on decade of birth. 4. Sort within each group.
  • 11. Data Frames – Rows/Columns (> Spark v1.5) • RDD – Rows of Java Objects • DataFrame – Rows of Typed Fields (like a Database table) Id (Integer) First_name (String) Last_name (String) BirthYear (Integer) Shoe size (Float) Dog’s name (String) 1 Fran Brown 1982 10.5 Stan 2 Mary Jones 1976 9.0 Fido 3 Brad Pitt 1963 11.0 Barker 4 Jane Simpson 1988 8.0 Rex … … … … … … • DataFrames allow better type-safety and performance optimization.
  • 12. Example: Loading data from a CSV File data.csv: 1,Fran,Brown,1982,10.5,Stan 2,Mary,Jones,1976,9.0,Fido 3,Brad,Pitt,1963,11.0,Barker 4,Jane,Simpson,1988,8.0,Rex 5,James,Thompson,1980,9.5,Bif 6,Paul,Wilson,1967,8.5,Ariel 7,Alice,Carlton,1984,11.5,Hank 8,Mike,Taylor,1981,9.5,Quincy 9,Shona,Smith,1975,9.0,Juneau 10,Phil,Arnold,1978,10.0,Koda
  • 13. Example: Loading data from a CSV File $ spark-shell scala> val df = spark.read.csv("data.csv") Notes: • Similar methods exist for JSON, JDBC, Parquet, etc. • You can write your own! • Scala is a general purpose programming language (not like SQL)
  • 14. Example: Examing a Data Frame scala> df.show(5) +---+-----+--------+----+----+------+ |_c0| _c1| _c2| _c3| _c4| _c5| +---+-----+--------+----+----+------+ | 1| Fran| Brown|1982|10.5| Stan| | 2| Mary| Jones|1976| 9.0| Fido| | 3| Brad| Pitt|1963|11.0|Barker| | 4| Jane| Simpson|1988| 8.0| Rex| | 5|James|Thompson|1980| 9.5| Bif| +---+-----+--------+----+----+------+
  • 15. Example: Defining a Schema scala> df.printSchema root |-- _c0: string (nullable = true) |-- _c1: string (nullable = true) |-- _c2: string (nullable = true) |-- _c3: string (nullable = true) |-- _c4: string (nullable = true) |-- _c5: string (nullable = true)
  • 16. Example: Defining a Schema scala> df.printSchema root |-- _c0: string (nullable = true) |-- _c1: string (nullable = true) |-- _c2: string (nullable = true) |-- _c3: string (nullable = true) |-- _c4: string (nullable = true) |-- _c5: string (nullable = true)
  • 17. Example: Defining a Schema scala> val mySchema = StructType( Array( StructField("id", LongType), StructField("first_name", StringType), StructField("last_name", StringType), StructField("birth_year", IntegerType), StructField("shoe_size", FloatType), StructField("dog_name", StringType) ) ) scala> val df = spark.read.schema(mySchema).csv("data.csv")
  • 18. Example: Defining a Schema scala> df.show(5) +---+----------+---------+----------+---------+--------+ | id|first_name|last_name|birth_year|shoe_size|dog_name| +---+----------+---------+----------+---------+--------+ | 1| Fran| Brown| 1982| 10.5| Stan| | 2| Mary| Jones| 1976| 9.0| Fido| | 3| Brad| Pitt| 1963| 11.0| Barker| | 4| Jane| Simpson| 1988| 8.0| Rex| | 5| James| Thompson| 1980| 9.5| Bif| +---+----------+---------+----------+---------+--------+ scala> df.printSchema root |-- id: long (nullable = true) |-- first_name: string (nullable = true) |-- last_name: string (nullable = true) |-- birth_year: integer (nullable = true) |-- shoe_size: float (nullable = true) |-- dog_name: string (nullable = true)
  • 19. Example: Counting Records scala> df.count() res21: Long = 10 Imagine 10 Billion rows over 1000 servers?
  • 20. Example: Selecting Columns scala> val df_dog = df.select( col("first_name"), col("dog_name”)) scala> df_dog.show(5) +----------+--------+ |first_name|dog_name| +----------+--------+ | Fran| Stan| | Mary| Fido| | Brad| Barker| | Jane| Rex| | James| Bif| +----------+--------+
  • 22. Example: Filtering scala> df.where("birth_year > 1980").show +---+----------+---------+----------+---------+--------+ | id|first_name|last_name|birth_year|shoe_size|dog_name| +---+----------+---------+----------+---------+--------+ | 1| Fran| Brown| 1982| 10.5| Stan| | 4| Jane| Simpson| 1988| 8.0| Rex| | 7| Alice| Carlton| 1984| 11.5| Hank| | 8| Mike| Taylor| 1981| 9.5| Quincy| +---+----------+---------+----------+---------+--------+
  • 23. Example: Grouping scala> df.groupBy( (floor(col("birth_year") / 10) * 10) as "Decade" ).count.show +------+-----+ |Decade|count| +------+-----+ | 1960| 2| | 1970| 3| | 1980| 5| +------+-----+
  • 24. Example: More Advanced scala> df.select( col("first_name"), col("dog_name"), levenshtein( col("first_name"), col("dog_name") ) as "Diff" ).show(5) +----------+--------+----+ |first_name|dog_name|Diff| +----------+--------+----+ | Fran| Stan| 2| | Mary| Fido| 4| | Brad| Barker| 4| | Jane| Rex| 4| | James| Bif| 5| +----------+--------+----+ Shorter version: df.select( 'first_name, 'dog_name, levenshtein( 'first_name, 'dog_name) as "Diff” ).show(5)
  • 25. Queries: User Defined Functions def taxRateFunc(year: Int) = { if (year >= 1984) 0.20 else 0.05 } val taxRate = udf(taxRateFunc _) df.select('birth_year, taxRate('birth_year)).show(5) +----------+---------------+ |birth_year|UDF(birth_year)| +----------+---------------+ | 1982| 0.05| | 1976| 0.05| | 1963| 0.05| | 1988| 0.20| | 1980| 0.05| +----------+---------------+ UDAFs - Check out http://build.acl.com Computing Average Dates in Spark!
  • 26. Why is Spark better than a Database? It looks a lot like SQL, but: • Can read/write data in arbitrary formats. • Can be extended with general purpose program code. • Can be split across 1000s of compute nodes. • Can do ML, Streaming, Graph queries. • Can use cheap storage (such as S3) But yeah, if you’re happy with your database, that’s OK too.
  • 27. Queries: PySpark Very similar API, but written in Python: $ pyspark >>>> spark.read.csv(”data.csv").show(5) +---+-----+--------+----+----+------+ |_c0| _c1| _c2| _c3| _c4| _c5| +---+-----+--------+----+----+------+ | 1| Fran| Brown|1982|10.5| Stan| | 2| Mary| Jones|1976| 9.0| Fido| | 3| Brad| Pitt|1963|11.0|Barker| | 4| Jane| Simpson|1988| 8.0| Rex| | 5|James|Thompson|1980| 9.5| Bif| +---+-----+--------+----+----+------+