SlideShare a Scribd company logo
Introduction to
Spark SQL
Bryan 2015
• Experience
Vpon Data Engineer
TWM, Keywear, Nielsen
• Bryan’s notes for data analysis
http://bryannotes.blogspot.tw
• Spark.TW
• Linikedin
https://tw.linkedin.com/pub/bryan-yang/7b/763/a79
ABOUT ME
Agenda
• Dataframe
• Basic of sqlContext
• Welcome hiveContext
Spark Sql for Training
Spark Sql for Training
Spark Sql for Training
Optimization
Spark Sql for Training
Spark Sql for Training
效率提升
SqlContext
主要的物件
https://spark.apache.org/docs/latest/api/scala/index.ht
ml#org.apache.spark.sql.package
spark-shell
• 除了sc之外,還會起SQL Context
• Spark context available as sc.
• 15/03/22 02:09:11 INFO SparkILoop: Created sql context
(with Hive support)..
• SQL context available as sqlContext.
DF from RDD
• 先轉成RDD
scala> val data = sc.textFile("hdfs://localhost:54310/user/hadoop/ml-
100k/u.data")
• 建立case class
case class Rattings(userId: Int, itemID: Int, rating: Int, timestmap:String)
• 轉成Data Frame
scala> val ratting = data.map(_.split("t")).map(p => Rattings(p(0).trim.toInt,
p(1).trim.toInt, p(2).trim.toInt, p(3))).toDF()
ratting: org.apache.spark.sql.DataFrame = [userId: int, itemID: int, rating: int,
timestmap: string]
DF from json
• 格式
{"movieID":242,"name":"test1"}
{"movieID":307,"name":"test2"}
• 可以直接呼叫
scala> val movie =
sqlContext.jsonFile("hdfs://localhost:54310/user/ha
doop/ml-100k/movies.json")
Dataframe Operations
• Show()
userId itemID rating timestmap
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
253 465 5 891628467
• head(5)
res11: Array[org.apache.spark.sql.Row] =
Array([196,242,3,881250949], [186,302,3,891717742],
[22,377,1,878887116], [244,51,2,880606923],
[166,346,1,886397596])
printSchema()
• printSchema()
scala> ratting.printSchema()
root
|-- userId: integer (nullable = false)
|-- itemID: integer (nullable = false)
|-- rating: integer (nullable = false)
|-- timestmap: string (nullable = true)
Select
• Select Column
scala> ratting.select("userId").show()
• Condition Select
scala> ratting.select(ratting("itemID")>100).show()
(itemID > 100)
true
true
true
filter
• 篩選條件
scala> ratting.filter(ratting("rating")>3).show()
userId itemID rating timestmap
298 474 4 884182806
253 465 5 891628467
286 1014 5 879781125
200 222 5 876042340
122 387 5 879270459
291 1042 4 874834944
119 392 4
• 偷懶寫法
ratting.filter("rating">3).show()
• 合併使用
scala>
ratting.filter(ratting("rating")>3).select("userID","itemID").show()
userID itemID
298 474
286 1014
• 也可以
ratting.filter("userID">500).select(avg("rating"),max("rating"),sum("r
ating")).show()
GROUP BY
• count()
scala> ratting.groupBy("userId").count().show()
userId count
831 73
631 20
• agg()
scala> ratting.groupBy("userId").agg("rating"->"avg","userID" ->
"count").show()
• 可以連用
scala>
ratting.groupBy("userId").count().sort("count","userID").show()
GROUP BY
其他
avg
max
min
mean
sum
更多Function
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
UnionAll
• 合併相同欄位表格
scala> val ratting1_3 = ratting.filter(ratting("rating")<=3)
scala> ratting1_3.count() //res79: Long = 44625
scala> val ratting4_5 = ratting.filter(ratting("rating")>3)
scala> ratting4_5.count() //res80: Long = 55375
ratting1_3.unionAll(ratting4_5).count() //res81: Long = 100000
• 欄位不同無法UNION
scala> ratting1_3.unionAll(test).count()
java.lang.AssertionError: assertion failed
JOIN
• 基本語法
scala> ratting.join(movie, $"itemID" === $"movieID",
"inner").show()
userId itemID rating timestmap movieID name
196 242 3 881250949 242 test1
63 242 3 875747190 242 test1
• 可支援的join型態:
inner, outer, left_outer, right_outer, semijoin.
也可以把表格註冊成
TABLE
• 註冊
scala> ratting.registerTempTable("ratting_table")
• 寫SQL
scala> sqlContext.sql("SELECT userID FROM
ratting_table").show()
DF支援RDD操作
• MAP
scala> result.map(t => "user:" + t(0)).collect().foreach(println)
• 取出來的物件型態是Any
scala> ratting.map(t => t(2)).take(5)
• 先轉string再轉int
scala> ratting.map(t => Array(t(0),t(2).toString.toInt *
10)).take(5)
res130: Array[Array[Any]] = Array(Array(196, 30), Array(186,
30), Array(22, 10), Array(244, 20), Array(166, 10))
SAVE DATA
• Save()
ratting.select("itemID").save("hdfs://localhost:5431
0/test2.json","json")
• saveAsParquetFile
• saveAsTable(Hive Table)
Hive Context
http://hortonworks.com/partner/zementis/
http://www.slideshare.net/Hadoop_Summit/empower-hive-with-spark
http://www.slideshare.net/Hadoop_Summit/empower-hive-with-spark
HiveContext
• 1.4.0之後的sqlContext就是hiveContext
• 繼承原有sqlContext功能,並加上與hive連結
Hive setting
• copy hive-site.xml to $SPARK_HOME/conf
Write SQL
• sqlContext.sql(“””
select * from ratings
“””).show()
• sqlContext.sql(“””
select item, avg(rating)
from ratings
group by item
“””)
Mixed expression
• df = sqlContext.sql(“select * from ratings”)
• df.filter(“ratings < 5”).groupBy(“item”).count().show()
User Defined Function
• from pyspark.sql.functions import udf
• from pyspark.sql.types import *
• sqlContext.registerFunction("hash", lambda x:
hash(x), LongType())
• sqlContext.sql(“select hash(item) from ratings”)
DataType
Numeric types
String type
Binary type
Boolean type
Datetime type
TimestampType: Represents values comprising values of fields year, month,
day, hour, minute, and second.
DateType: Represents values comprising values of fields year, month, day.
Complex types
發展方向
Spark Sql for Training
Reference
1. https://databricks.com/blog/2015/02/17/introducing-
dataframes-in-spark-for-large-scale-data-science.html
2. https://www.youtube.com/watch?v=vxeLcoELaP4
3. http://www.slideshare.net/databricks/introducing-
dataframes-in-spark-for-large-scale-data-science

More Related Content

What's hot (20)

PPTX
Introduce to Spark sql 1.3.0
Bryan Yang
 
PPTX
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Spark Summit
 
PDF
Reactive dashboard’s using apache spark
Rahul Kumar
 
PDF
New Developments in Spark
Databricks
 
PPTX
Spark etl
Imran Rashid
 
PDF
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PPTX
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
PPTX
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
PPTX
Spark sql
Zahra Eskandari
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Spark streaming , Spark SQL
Yousun Jeong
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
PDF
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Anand Narayanan
 
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
PDF
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
PDF
Introduction to Apache Spark
Samy Dindane
 
Introduce to Spark sql 1.3.0
Bryan Yang
 
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Spark Summit
 
Reactive dashboard’s using apache spark
Rahul Kumar
 
New Developments in Spark
Databricks
 
Spark etl
Imran Rashid
 
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
Spark sql
Zahra Eskandari
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Spark streaming , Spark SQL
Yousun Jeong
 
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Anand Narayanan
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Introduction to Apache Spark
Samy Dindane
 

Viewers also liked (20)

PDF
Introduction to Structured Data Processing with Spark SQL
datamantra
 
PDF
Intro to Spark and Spark SQL
jeykottalam
 
PPTX
Spark MLlib - Training Material
Bryan Yang
 
PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
PPTX
Spark meetup v2.0.5
Yan Zhou
 
PPTX
Spark sql meetup
Michael Zhang
 
PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
PPT
SparkSQL et Cassandra - Tool In Action Devoxx 2015
Alexander DEJANOVSKI
 
PDF
The SparkSQL things you maybe confuse
vito jeng
 
PPT
Work study
ramanjot sidhu
 
PPTX
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
PPTX
Zaharia spark-scala-days-2012
Skills Matter Talks
 
PPTX
Introduction to scala for a c programmer
Girish Kumar A L
 
PDF
Jump Start into Apache Spark (Seattle Spark Meetup)
Denny Lee
 
PPTX
Build your ETL job using Jenkins - step by step
Bryan Yang
 
PPTX
Apache hive
pradipbajpai68
 
PDF
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
vithakur
 
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
PPTX
HBaseConEast2016: HBase and Spark, State of the Art
Michael Stack
 
Introduction to Structured Data Processing with Spark SQL
datamantra
 
Intro to Spark and Spark SQL
jeykottalam
 
Spark MLlib - Training Material
Bryan Yang
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
Spark meetup v2.0.5
Yan Zhou
 
Spark sql meetup
Michael Zhang
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
Alexander DEJANOVSKI
 
The SparkSQL things you maybe confuse
vito jeng
 
Work study
ramanjot sidhu
 
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
Zaharia spark-scala-days-2012
Skills Matter Talks
 
Introduction to scala for a c programmer
Girish Kumar A L
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Denny Lee
 
Build your ETL job using Jenkins - step by step
Bryan Yang
 
Apache hive
pradipbajpai68
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
vithakur
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
HBaseConEast2016: HBase and Spark, State of the Art
Michael Stack
 
Ad

More from Bryan Yang (12)

PDF
敏捷開發心法
Bryan Yang
 
PDF
Data pipeline essential
Bryan Yang
 
PPTX
Docker 101
Bryan Yang
 
PDF
資料分析的快樂就是如此樸實無華且枯燥
Bryan Yang
 
PDF
Data pipeline 101
Bryan Yang
 
PPTX
Building a data driven business
Bryan Yang
 
PPTX
產業數據力-以傳統零售業為例
Bryan Yang
 
PPTX
Serverless ETL
Bryan Yang
 
PPTX
敏捷開發心法
Bryan Yang
 
PPTX
Introduction to docker
Bryan Yang
 
PPTX
Building your bi system-HadoopCon Taiwan 2015
Bryan Yang
 
PPTX
Data Scientist's Daily Life
Bryan Yang
 
敏捷開發心法
Bryan Yang
 
Data pipeline essential
Bryan Yang
 
Docker 101
Bryan Yang
 
資料分析的快樂就是如此樸實無華且枯燥
Bryan Yang
 
Data pipeline 101
Bryan Yang
 
Building a data driven business
Bryan Yang
 
產業數據力-以傳統零售業為例
Bryan Yang
 
Serverless ETL
Bryan Yang
 
敏捷開發心法
Bryan Yang
 
Introduction to docker
Bryan Yang
 
Building your bi system-HadoopCon Taiwan 2015
Bryan Yang
 
Data Scientist's Daily Life
Bryan Yang
 
Ad

Recently uploaded (20)

PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 

Spark Sql for Training