SlideShare a Scribd company logo
Introduce to
Spark SQL 1.3.0
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
Optimization
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
效率提升
主要的物件
● https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sq
l.package
spark-shell
● 除了sc之外,還會起SQL Context
Spark context available as sc.
15/03/22 02:09:11 INFO SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.
JAR
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
DF from RDD
● 先轉成RDD
scala> val data = sc.textFile("hdfs://localhost:54310/user/hadoop/ml-100k/u.data")
● 建立case class
case class Rattings(userId: Int, itemID: Int, rating: Int, timestmap:String)
● 轉成Data Frame
scala> val ratting = data.map(_.split("t")).map(p => Rattings(p(0).trim.toInt,
p(1).trim.toInt, p(2).trim.toInt, p(3))).toDF()
ratting: org.apache.spark.sql.DataFrame = [userId: int, itemID: int, rating: int, timestmap:
string]
DF from json
● 格式
{"movieID":242,"name":"test1"}
{"movieID":307,"name":"test2"}
● 可以直接呼叫
scala> val movie = sqlContext.jsonFile("hdfs://localhost:54310/user/hadoop/ml-
100k/movies.json")
Dataframe Operations
● Show()
userId itemID rating timestmap
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
253 465 5 891628467
● head(5)
res11: Array[org.apache.spark.sql.Row] = Array([196,242,3,881250949],
[186,302,3,891717742], [22,377,1,878887116], [244,51,2,880606923],
[166,346,1,886397596])
● printSchema() ←根本神技
scala> ratting.printSchema()
root
|-- userId: integer (nullable = false)
|-- itemID: integer (nullable = false)
|-- rating: integer (nullable = false)
|-- timestmap: string (nullable = true)
Select
● Select Column
scala> ratting.select("userId").show()
● Condition Select
scala> ratting.select(ratting("itemID")>100).show()
(itemID > 100)
true
true
true
filter
● 篩選條件
scala> ratting.filter(ratting("rating")>3).show()
userId itemID rating timestmap
298 474 4 884182806
253 465 5 891628467
286 1014 5 879781125
200 222 5 876042340
122 387 5 879270459
291 1042 4 874834944
119 392 4
● 偷懶寫法
ratting.filter('rating>3).show()
● 合併使用
scala> ratting.filter(ratting("rating")>3).select("userID","itemID")show()
userID itemID
298 474
286 1014
● 也可以
ratting.filter('userID>500).select(avg('rating),max('rating),sum('rating))show()
GROUP BY
● count()
scala> ratting.groupBy("userId").count().show()
userId count
831 73
631 20
● agg()
scala> ratting.groupBy("userId").agg("rating"->"avg","userID" -> "count").show()
● 可以連用
scala> ratting.groupBy("userId").count().sort("count","userID").show()
GROUP BY
● 其他
o avg
o max
o min
o mean
o sum
● 更多Function
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
● 大雜燴
ratting.groupBy('userID).agg(('userID),avg('rating),max('rating),sum('rating),count('
userID AVG('rating) MAX('rating) SUM('rating) COUNT('rating)
831 3.5205479452054793 5 257 73
631 3.1 4 62 20
31 3.9166666666666665 5 141 36
431 3.380952380952381 5 71 21
231 3.6666666666666665 5 77 21
832 2.96 5 74 25
632 3.6610169491525424 5 432 118
UnionAll
● 合併相同欄位表格
scala> val ratting1_3 = ratting.filter(ratting("rating")<=3)
scala> ratting1_3.count() //res79: Long = 44625
scala> val ratting4_5 = ratting.filter(ratting("rating")>3)
scala> ratting4_5.count() //res80: Long = 55375
ratting1_3.unionAll(ratting4_5).count() //res81: Long = 100000
● 欄位不同無法UNION
scala> ratting1_3.unionAll(test).count()
java.lang.AssertionError: assertion failed
JOIN
● 基本語法
scala> ratting.join(movie, $"itemID" === $"movieID", "inner").show()
userId itemID rating timestmap movieID name
196 242 3 881250949 242 test1
63 242 3 875747190 242 test1
● 可支援的join型態:inner, outer, left_outer,
right_outer, semijoin.
也可以把表格註冊成TABLE
● 註冊
scala> ratting.registerTempTable("ratting_table")
● 寫SQL
sqlContext.sql("SELECT us
scala> sqlContext.sql("SELECT userID FROM ratting_table").show()
DF支援RDD操作
● MAP
scala> result.map(t => "user:" + t(0)).collect().foreach(println)
● 取出來的物件型態是Any
scala> ratting.map(t => t(2)).take(5)
● 先轉string再轉int
scala> ratting.map(t => Array(t(0),t(2).toString.toInt * 10)).take(5)
res130: Array[Array[Any]] = Array(Array(196, 30), Array(186, 30), Array(22, 10),
Array(244, 20), Array(166, 10))
SAVE DATA
● Save()
ratting.select("itemID").save("hdfs://localhost:54310/test2.json","json")
● saveAsParquetFile
● saveAsTable(Hive Table)
DataType
● Numeric types
● String type
● Binary type
● Boolean type
● Datetime type
o TimestampType: Represents values comprising values of fields year, month, day, hour, minute,
and second.
o DateType: Represents values comprising values of fields year, month, day.
● Complex types
發展方向
Introduce to Spark sql 1.3.0
Reference
1. https://databricks.com/blog/2015/02/17/introducing-
dataframes-in-spark-for-large-scale-data-science.html
1. https://www.youtube.com/watch?v=vxeLcoELaP4
1. http://www.slideshare.net/databricks/introducing-
dataframes-in-spark-for-large-scale-data-science

More Related Content

What's hot (20)

PDF
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
PDF
Apache Spark RDDs
Dean Chen
 
PDF
Data Source API in Spark
Databricks
 
PDF
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
PDF
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
PPTX
Spark etl
Imran Rashid
 
PDF
Tachyon-2014-11-21-amp-camp5
Haoyuan Li
 
PPTX
Tuning and Debugging in Apache Spark
Patrick Wendell
 
PDF
Spark SQL
Joud Khattab
 
PDF
Automated Spark Deployment With Declarative Infrastructure
Spark Summit
 
PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
PPTX
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
PPTX
Spark meetup v2.0.5
Yan Zhou
 
PDF
Apache Spark and DataStax Enablement
Vincent Poncet
 
PDF
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
PDF
Spark Cassandra Connector Dataframes
Russell Spitzer
 
PDF
Intro to Spark and Spark SQL
jeykottalam
 
PPTX
Apache spark Intro
Tudor Lapusan
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
Apache Spark RDDs
Dean Chen
 
Data Source API in Spark
Databricks
 
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
Spark etl
Imran Rashid
 
Tachyon-2014-11-21-amp-camp5
Haoyuan Li
 
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Spark SQL
Joud Khattab
 
Automated Spark Deployment With Declarative Infrastructure
Spark Summit
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
Spark meetup v2.0.5
Yan Zhou
 
Apache Spark and DataStax Enablement
Vincent Poncet
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Spark Cassandra Connector Dataframes
Russell Spitzer
 
Intro to Spark and Spark SQL
jeykottalam
 
Apache spark Intro
Tudor Lapusan
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 

Viewers also liked (20)

PPTX
Data Scientist's Daily Life
Bryan Yang
 
PPT
Xsd examples
Bình Trọng Án
 
PPTX
Build your ETL job using Jenkins - step by step
Bryan Yang
 
PPTX
Spark MLlib - Training Material
Bryan Yang
 
PPTX
Building your bi system-HadoopCon Taiwan 2015
Bryan Yang
 
PDF
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 
PDF
Aioug vizag oracle12c_new_features
AiougVizagChapter
 
PDF
Oracle12 - The Top12 Features by NAYA Technologies
NAYATech
 
PDF
手把手教你 R 語言分析實務
Helen Afterglow
 
PDF
Word2vec (中文)
Yiwei Chen
 
PPTX
SPARQL and Linked Data Benchmarking
Kristian Alexander
 
PPTX
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
Getting value from IoT, Integration and Data Analytics
 
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
PDF
Pandas, Data Wrangling & Data Science
Krishna Sankar
 
PDF
Data Science with Spark
Krishna Sankar
 
PDF
Always Valid Inference (Ramesh Johari, Stanford)
Hakka Labs
 
PPTX
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
scalaconfjp
 
PPTX
Exadata 12c New Features RMOUG
Fuad Arshad
 
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
PDF
What is new on 12c for Backup and Recovery? Presentation
Francisco Alvarez
 
Data Scientist's Daily Life
Bryan Yang
 
Xsd examples
Bình Trọng Án
 
Build your ETL job using Jenkins - step by step
Bryan Yang
 
Spark MLlib - Training Material
Bryan Yang
 
Building your bi system-HadoopCon Taiwan 2015
Bryan Yang
 
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 
Aioug vizag oracle12c_new_features
AiougVizagChapter
 
Oracle12 - The Top12 Features by NAYA Technologies
NAYATech
 
手把手教你 R 語言分析實務
Helen Afterglow
 
Word2vec (中文)
Yiwei Chen
 
SPARQL and Linked Data Benchmarking
Kristian Alexander
 
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
Getting value from IoT, Integration and Data Analytics
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
Pandas, Data Wrangling & Data Science
Krishna Sankar
 
Data Science with Spark
Krishna Sankar
 
Always Valid Inference (Ramesh Johari, Stanford)
Hakka Labs
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
scalaconfjp
 
Exadata 12c New Features RMOUG
Fuad Arshad
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
What is new on 12c for Backup and Recovery? Presentation
Francisco Alvarez
 
Ad

Similar to Introduce to Spark sql 1.3.0 (20)

PPT
Drill / SQL / Optiq
Julian Hyde
 
PDF
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
PDF
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
PDF
AWS Study Group - Chapter 03 - Elasticity and Scalability Concepts [Solution ...
QCloudMentor
 
PDF
Spark Programming
Taewook Eom
 
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
PDF
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Databricks
 
PPTX
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
PROIDEA
 
PDF
Introduce spark (by 조창원)
I Goo Lee.
 
PDF
Spark streaming , Spark SQL
Yousun Jeong
 
PDF
Big Data Tools in AWS
Shu-Jeng Hsieh
 
PDF
Scala Frameworks for Web Application 2016
takezoe
 
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
PPTX
Big Data processing with Spark, Scala or Java?
Erik-Berndt Scheper
 
PDF
Nko workshop - node js & nosql
Simon Su
 
PDF
Hadoop Integration in Cassandra
Jairam Chandar
 
PDF
Lobos Introduction
Nicolas Buduroi
 
PPTX
Storlets fb session_16_9
Eran Rom
 
ODP
Slickdemo
Knoldus Inc.
 
PPTX
Introduction to Apache Spark
Mohamed hedi Abidi
 
Drill / SQL / Optiq
Julian Hyde
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
AWS Study Group - Chapter 03 - Elasticity and Scalability Concepts [Solution ...
QCloudMentor
 
Spark Programming
Taewook Eom
 
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Databricks
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
PROIDEA
 
Introduce spark (by 조창원)
I Goo Lee.
 
Spark streaming , Spark SQL
Yousun Jeong
 
Big Data Tools in AWS
Shu-Jeng Hsieh
 
Scala Frameworks for Web Application 2016
takezoe
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Big Data processing with Spark, Scala or Java?
Erik-Berndt Scheper
 
Nko workshop - node js & nosql
Simon Su
 
Hadoop Integration in Cassandra
Jairam Chandar
 
Lobos Introduction
Nicolas Buduroi
 
Storlets fb session_16_9
Eran Rom
 
Slickdemo
Knoldus Inc.
 
Introduction to Apache Spark
Mohamed hedi Abidi
 
Ad

More from Bryan Yang (10)

PDF
敏捷開發心法
Bryan Yang
 
PDF
Data pipeline essential
Bryan Yang
 
PPTX
Docker 101
Bryan Yang
 
PDF
資料分析的快樂就是如此樸實無華且枯燥
Bryan Yang
 
PDF
Data pipeline 101
Bryan Yang
 
PPTX
Building a data driven business
Bryan Yang
 
PPTX
產業數據力-以傳統零售業為例
Bryan Yang
 
PPTX
Serverless ETL
Bryan Yang
 
PPTX
敏捷開發心法
Bryan Yang
 
PPTX
Introduction to docker
Bryan Yang
 
敏捷開發心法
Bryan Yang
 
Data pipeline essential
Bryan Yang
 
Docker 101
Bryan Yang
 
資料分析的快樂就是如此樸實無華且枯燥
Bryan Yang
 
Data pipeline 101
Bryan Yang
 
Building a data driven business
Bryan Yang
 
產業數據力-以傳統零售業為例
Bryan Yang
 
Serverless ETL
Bryan Yang
 
敏捷開發心法
Bryan Yang
 
Introduction to docker
Bryan Yang
 

Recently uploaded (20)

PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPT
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PPTX
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PDF
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Data base management system Transactions.ppt
gandhamcharan2006
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
deep dive data management sharepoint apps.ppt
novaprofk
 
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 

Introduce to Spark sql 1.3.0