SlideShare a Scribd company logo
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 1
Pivoting Data with SparkSQL
Andrew Ray
Senior Data Engineer
Silicon Valley Data Science
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 2
CODE
git.io/vgy34
(github.com/silicon-valley-data-science/spark-pivot-examples)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 3
OUTLINE
• What’s a Pivot?
• Syntax
• Real world examples
• Tips and Tricks
• Implementation
• Future work
git.io/vgy34
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 4
WHAT’S A PIVOT?
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 5
WHAT’S A PIVOT?
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 6
WHAT’S A PIVOT?
Group by A, pivot on B, and sum C
A B C
G X 1
G Y 2
G X 3
H Y 4
H Z 5
A X Y Z
G 4 2
H 4 5
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 7
WHAT’S A PIVOT?
Group by A and B
Pivot on BA B C
G X 1
G Y 2
G X 3
H Y 4
H Z 5
A B C
G X 4
G Y 2
H Y 4
H Z 5
A X Y Z
G 4 2
H 4 5
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 8
WHAT’S A PIVOT?
Pivot on B (w/o agg.)
Group by AA B C
G X 1
G Y 2
G X 3
H Y 4
H Z 5
A X Y Z
G 1
G 2
G 3
H 4
H 5
A X Y Z
G 4 2
H 4 5
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 9
SYNTAX
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 10
SYNTAX
• Dataframe/table with columns A, B, C, and D.
• How to
– group by A and B
– pivot on C (with distinct values “small” and “large”)
– sum of D
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 11
SYNTAX: COMPETITION
• pandas (Python)
– pivot_table(df, values='D', index=['A', 'B'],
columns=['C'], aggfunc=np.sum)
• reshape2 (R)
– dcast(df, A + B ~ C, sum)
• Oracle 11g
– SELECT * FROM df PIVOT (sum(D) FOR C IN
('small', 'large')) p
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 12
SYNTAX: SPARKSQL
• Simple
– df.groupBy("A", "B").pivot("C").sum("D")
• Explicit pivot values
– df.groupBy("A", "B")
.pivot("C", Seq("small", "large"))
.sum("D")
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 13
PIVOT
• Added to DataFrame API in Spark 1.6
– Scala
– Java
– Python
– Not R L
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 14
REAL WORLD EXAMPLES
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 15
EXAMPLE 1: REPORTING
• Retail sales
• TPC-DS dataset
– scale factor 1
• Docker image:
docker run -it svds/spark-pivot-reporting
TPC Benchmark™ DS - Standard Specification, Version 2.1.0 Page 18 of 135
2.2.2.3 The implementation chosen by the test sponsor for a particular datatype definition shall be applied consistently
to all the instances of that datatype definition in the schema, except for identifier columns, whose datatype may
be selected to satisfy database scaling requirements.
2.2.3 NULLs
If a column definition includes an ‘N’ in the NULLs column this column is populated in every row of the table
for all scale factors. If the field is blank this column may contain NULLs.
2.2.4 Foreign Key
If the values in this column join with another column, the foreign columns name is listed in the Foreign Key
field of the column definition.
2.3 Fact Table Definitions
2.3.1 Store Sales (SS)
2.3.1.1 Store Sales ER-Diagram
2.3.1.2 Store Sales Column Definitions
Each row in this table represents a single lineitem for a sale made through the store channel and recorded in the
store_sales fact table.
Table 2-1 Store_sales Column Definitions
Column Datatype NULLs Primary Key Foreign Key
ss_sold_date_sk identifier d_date_sk
ss_sold_time_sk identifier t_time_sk
ss_item_sk (1) identifier N Y i_item_sk
ss_customer_sk identifier c_customer_sk
ss_cdemo_sk identifier cd_demo_sk
ss_hdemo_sk identifier hd_demo_sk
ss_addr_sk identifier ca_address_sk
ss_store_sk identifier s_store_sk
ss_promo_sk identifier p_promo_sk
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 16
SALES BY CATEGORY AND QUARTER
sql("""select *, concat('Q', d_qoy) as qoy
from store_sales
join date_dim on ss_sold_date_sk = d_date_sk
join item on ss_item_sk = i_item_sk""")
.groupBy("i_category")
.pivot("qoy")
.agg(round(sum("ss_sales_price")/1000000,2))
.show
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 17
SALES BY CATEGORY AND QUARTER
+-----------+----+----+----+----+
| i_category| Q1| Q2| Q3| Q4|
+-----------+----+----+----+----+
| Books|1.58|1.50|2.84|4.66|
| Women|1.41|1.36|2.54|4.16|
| Music|1.50|1.44|2.66|4.36|
| Children|1.54|1.46|2.74|4.51|
| Sports|1.47|1.40|2.62|4.30|
| Shoes|1.51|1.48|2.68|4.46|
| Jewelry|1.45|1.39|2.59|4.25|
| null|0.04|0.04|0.07|0.13|
|Electronics|1.56|1.49|2.77|4.57|
| Home|1.57|1.51|2.79|4.60|
| Men|1.60|1.54|2.86|4.71|
+-----------+----+----+----+----+
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 18
EXAMPLE 2: FEATURE GENERATION
• MovieLens 1M Dataset
– ~1M movie ratings
– 6040 users
– 3952 movies
• Predict gender based on ratings
– Using 100 most popular movies
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 19
LOAD RATINGS
val ratings_raw = sc.textFile("Downloads/ml-1m/ratings.dat")
case class Rating(user: Int, movie: Int, rating: Int)
val ratings = ratings_raw.map(_.split("::").map(_.toInt)).map(r => Rating(r(0),r(1),r(2))).toDF
ratings.show
+----+-----+------+
|user|movie|rating|
+----+-----+------+
| 11| 1753| 4|
| 11| 1682| 1|
| 11| 216| 4|
| 11| 2997| 4|
| 11| 1259| 3|
...
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 20
LOAD USERS
val users_raw = sc.textFile("Downloads/ml-1m/users.dat")
case class User(user: Int, gender: String, age: Int)
val users = users_raw.map(_.split("::")).map(u => User(u(0).toInt, u(1), u(2).toInt)).toDF
val sample_users = users.where(expr("gender = 'F' or ( rand() * 5 < 2 )"))
sample_users.groupBy("gender").count().show
+------+-----+
|gender|count|
+------+-----+
| F| 1709|
| M| 1744|
+------+-----+
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 21
PREP DATA
val popular = ratings.groupBy("movie")
.count()
.orderBy($"count".desc)
.limit(100)
.map(_.get(0)).collect
val ratings_pivot = ratings.groupBy("user")
.pivot("movie", popular.toSeq)
.agg(expr("coalesce(first(rating),3)").cast("double"))
ratings_pivot.where($"user" === 11).show
+----+----+---+----+----+---+----+---+----+----+---+...
|user|2858|260|1196|1210|480|2028|589|2571|1270|593|...
+----+----+---+----+----+---+----+---+----+----+---+...
| 11| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| 3.0| 3.0|5.0|...
+----+----+---+----+----+---+----+---+----+----+---+...
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 22
BUILD MODEL
val data = ratings_pivot.join(sample_users, "user")
.withColumn("label", expr("if(gender = 'M', 1, 0)").cast("double"))
val assembler = new VectorAssembler()
.setInputCols(popular.map(_.toString))
.setOutputCol("features")
val lr = new LogisticRegression()
val pipeline = new Pipeline().setStages(Array(assembler, lr))
val Array(training, test) = data.randomSplit(Array(0.9, 0.1), seed = 12345)
val model = pipeline.fit(training)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 23
TEST
val res = model.transform(test).select("label", "prediction")
res.groupBy("label").pivot("prediction", Seq(1.0, 0.0)).count().show
+-----+---+---+
|label|1.0|0.0|
+-----+---+---+
| 1.0|114| 74|
| 0.0| 46|146|
+-----+---+---+
Accuracy 68%
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 24
TIPS AND TRICKS
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 25
USAGE NOTES
• Specify the distinct values of the pivot column
– Otherwise it does this:
val values = df.select(pivotColumn)
.distinct()
.sort(pivotColumn)
.map(_.get(0))
.take(maxValues + 1)
.toSeq
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 26
MULTIPLE AGGREGATIONS
df.groupBy("A", "B").pivot("C").agg(sum("D"), avg("D")).show
+---+---+------------+------------+------------+------------+
| A| B|small_sum(D)|small_avg(D)|large_sum(D)|large_avg(D)|
+---+---+------------+------------+------------+------------+
|foo|two| 6| 3.0| null| null|
|bar|two| 6| 6.0| 7| 7.0|
|foo|one| 1| 1.0| 4| 2.0|
|bar|one| 5| 5.0| 4| 4.0|
+---+---+------------+------------+------------+------------+
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 27
PIVOT MULTIPLE COLUMNS
• Merge columns and pivot as usual
df.withColumn(“p”, concat($”p1”, $”p2”))
.groupBy(“a”, “b”)
.pivot(“p”)
.agg(…)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 28
MAX COLUMNS
• spark.sql.pivotMaxValues
– Default: 10,000
– When doing a pivot without specifying values for the pivot
column this is the maximum number of (distinct) values
that will be collected without error.
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 29
IMPLEMENTATION
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 30
PIVOT IMPLEMENTATION
• pivot is a method of GroupedData and returns
GroupedData with PivotType.
• New logical operator:
o.a.s.sql.catalyst.plans.logical.Pivot
• Analyzer rule:
o.a.s.sql.catalyst.analysis.Analyzer.ResolvePivot
– Currently translates logical pivot into an aggregation with
lots of if statements.
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 31
FUTURE WORK
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 32
FUTURE WORK
• Add to R API
• Add to SQL syntax
• Add support for unpivot
• Faster implementation
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 33
Pivoting Data with SparkSQL
Andrew Ray
andrew@svds.com
We’re hiring!
svds.com/careers
THANK YOU.
git.io/vgy34

More Related Content

What's hot (20)

PPTX
SQream-GPU가속 초거대 정형데이타 분석용 SQL DB-제품소개-박문기@메가존클라우드
문기 박
 
PDF
In the Wake of Kerberoast
ken_kitahara
 
PDF
On-boarding with JanusGraph Performance
Chin Huang
 
PDF
MongoDB performance
Mydbops
 
PDF
Delta Lake: Optimizing Merge
Databricks
 
PDF
Data pipelines from zero to solid
Lars Albertsson
 
PDF
分布式Key Value Store漫谈
Tim Y
 
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
PDF
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
 
PDF
Re-Engineering PostgreSQL as a Time-Series Database
All Things Open
 
PPTX
Optimizing Apache Spark SQL Joins
Databricks
 
PDF
Improving Apache Spark Downscaling
Databricks
 
PPTX
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Monica Beckwith
 
PDF
Optimizing MariaDB for maximum performance
MariaDB plc
 
PDF
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Carol McDonald
 
PPTX
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
DataStax
 
PPTX
Garbage First Garbage Collector (G1 GC): Current and Future Adaptability and ...
Monica Beckwith
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
The Apache Spark File Format Ecosystem
Databricks
 
SQream-GPU가속 초거대 정형데이타 분석용 SQL DB-제품소개-박문기@메가존클라우드
문기 박
 
In the Wake of Kerberoast
ken_kitahara
 
On-boarding with JanusGraph Performance
Chin Huang
 
MongoDB performance
Mydbops
 
Delta Lake: Optimizing Merge
Databricks
 
Data pipelines from zero to solid
Lars Albertsson
 
分布式Key Value Store漫谈
Tim Y
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
 
Re-Engineering PostgreSQL as a Time-Series Database
All Things Open
 
Optimizing Apache Spark SQL Joins
Databricks
 
Improving Apache Spark Downscaling
Databricks
 
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Monica Beckwith
 
Optimizing MariaDB for maximum performance
MariaDB plc
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Carol McDonald
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
DataStax
 
Garbage First Garbage Collector (G1 GC): Current and Future Adaptability and ...
Monica Beckwith
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
The Apache Spark File Format Ecosystem
Databricks
 

Similar to Pivoting Data with SparkSQL by Andrew Ray (20)

PDF
Pivoting Wide Tables with Spark
Henning Kropp
 
PDF
Spark The Definitive Guide Big Data Processing Made Simple Bill Chambers
shalikstenmo
 
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
PDF
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
 
PDF
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
Databricks
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PPTX
Beyond shuffling global big data tech conference 2015 sj
Holden Karau
 
PPTX
Working with Graphs _python.pptx
MrPrathapG
 
PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
PDF
Data Exploration with Apache Drill: Day 1
Charles Givre
 
PDF
The International Journal of Engineering and Science (The IJES)
theijes
 
PDF
_Super_Study_Guide__Data_Science_Tools_1620233377.pdf
nielitjanarthanam
 
PPTX
Python Programming.pptx
SudhakarVenkey
 
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Holden Karau
 
PDF
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
Pivoting Wide Tables with Spark
Henning Kropp
 
Spark The Definitive Guide Big Data Processing Made Simple Bill Chambers
shalikstenmo
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
 
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Beyond shuffling global big data tech conference 2015 sj
Holden Karau
 
Working with Graphs _python.pptx
MrPrathapG
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
Data Exploration with Apache Drill: Day 1
Charles Givre
 
The International Journal of Engineering and Science (The IJES)
theijes
 
_Super_Study_Guide__Data_Science_Tools_1620233377.pdf
nielitjanarthanam
 
Python Programming.pptx
SudhakarVenkey
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Holden Karau
 
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Ad

Recently uploaded (20)

PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
deep dive data management sharepoint apps.ppt
novaprofk
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 

Pivoting Data with SparkSQL by Andrew Ray

  • 1. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 1 Pivoting Data with SparkSQL Andrew Ray Senior Data Engineer Silicon Valley Data Science
  • 2. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 2 CODE git.io/vgy34 (github.com/silicon-valley-data-science/spark-pivot-examples)
  • 3. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 3 OUTLINE • What’s a Pivot? • Syntax • Real world examples • Tips and Tricks • Implementation • Future work git.io/vgy34
  • 4. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 4 WHAT’S A PIVOT?
  • 5. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 5 WHAT’S A PIVOT?
  • 6. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 6 WHAT’S A PIVOT? Group by A, pivot on B, and sum C A B C G X 1 G Y 2 G X 3 H Y 4 H Z 5 A X Y Z G 4 2 H 4 5
  • 7. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 7 WHAT’S A PIVOT? Group by A and B Pivot on BA B C G X 1 G Y 2 G X 3 H Y 4 H Z 5 A B C G X 4 G Y 2 H Y 4 H Z 5 A X Y Z G 4 2 H 4 5
  • 8. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 8 WHAT’S A PIVOT? Pivot on B (w/o agg.) Group by AA B C G X 1 G Y 2 G X 3 H Y 4 H Z 5 A X Y Z G 1 G 2 G 3 H 4 H 5 A X Y Z G 4 2 H 4 5
  • 9. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 9 SYNTAX
  • 10. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 10 SYNTAX • Dataframe/table with columns A, B, C, and D. • How to – group by A and B – pivot on C (with distinct values “small” and “large”) – sum of D
  • 11. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 11 SYNTAX: COMPETITION • pandas (Python) – pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum) • reshape2 (R) – dcast(df, A + B ~ C, sum) • Oracle 11g – SELECT * FROM df PIVOT (sum(D) FOR C IN ('small', 'large')) p
  • 12. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 12 SYNTAX: SPARKSQL • Simple – df.groupBy("A", "B").pivot("C").sum("D") • Explicit pivot values – df.groupBy("A", "B") .pivot("C", Seq("small", "large")) .sum("D")
  • 13. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 13 PIVOT • Added to DataFrame API in Spark 1.6 – Scala – Java – Python – Not R L
  • 14. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 14 REAL WORLD EXAMPLES
  • 15. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 15 EXAMPLE 1: REPORTING • Retail sales • TPC-DS dataset – scale factor 1 • Docker image: docker run -it svds/spark-pivot-reporting TPC Benchmark™ DS - Standard Specification, Version 2.1.0 Page 18 of 135 2.2.2.3 The implementation chosen by the test sponsor for a particular datatype definition shall be applied consistently to all the instances of that datatype definition in the schema, except for identifier columns, whose datatype may be selected to satisfy database scaling requirements. 2.2.3 NULLs If a column definition includes an ‘N’ in the NULLs column this column is populated in every row of the table for all scale factors. If the field is blank this column may contain NULLs. 2.2.4 Foreign Key If the values in this column join with another column, the foreign columns name is listed in the Foreign Key field of the column definition. 2.3 Fact Table Definitions 2.3.1 Store Sales (SS) 2.3.1.1 Store Sales ER-Diagram 2.3.1.2 Store Sales Column Definitions Each row in this table represents a single lineitem for a sale made through the store channel and recorded in the store_sales fact table. Table 2-1 Store_sales Column Definitions Column Datatype NULLs Primary Key Foreign Key ss_sold_date_sk identifier d_date_sk ss_sold_time_sk identifier t_time_sk ss_item_sk (1) identifier N Y i_item_sk ss_customer_sk identifier c_customer_sk ss_cdemo_sk identifier cd_demo_sk ss_hdemo_sk identifier hd_demo_sk ss_addr_sk identifier ca_address_sk ss_store_sk identifier s_store_sk ss_promo_sk identifier p_promo_sk
  • 16. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 16 SALES BY CATEGORY AND QUARTER sql("""select *, concat('Q', d_qoy) as qoy from store_sales join date_dim on ss_sold_date_sk = d_date_sk join item on ss_item_sk = i_item_sk""") .groupBy("i_category") .pivot("qoy") .agg(round(sum("ss_sales_price")/1000000,2)) .show
  • 17. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 17 SALES BY CATEGORY AND QUARTER +-----------+----+----+----+----+ | i_category| Q1| Q2| Q3| Q4| +-----------+----+----+----+----+ | Books|1.58|1.50|2.84|4.66| | Women|1.41|1.36|2.54|4.16| | Music|1.50|1.44|2.66|4.36| | Children|1.54|1.46|2.74|4.51| | Sports|1.47|1.40|2.62|4.30| | Shoes|1.51|1.48|2.68|4.46| | Jewelry|1.45|1.39|2.59|4.25| | null|0.04|0.04|0.07|0.13| |Electronics|1.56|1.49|2.77|4.57| | Home|1.57|1.51|2.79|4.60| | Men|1.60|1.54|2.86|4.71| +-----------+----+----+----+----+
  • 18. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 18 EXAMPLE 2: FEATURE GENERATION • MovieLens 1M Dataset – ~1M movie ratings – 6040 users – 3952 movies • Predict gender based on ratings – Using 100 most popular movies
  • 19. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 19 LOAD RATINGS val ratings_raw = sc.textFile("Downloads/ml-1m/ratings.dat") case class Rating(user: Int, movie: Int, rating: Int) val ratings = ratings_raw.map(_.split("::").map(_.toInt)).map(r => Rating(r(0),r(1),r(2))).toDF ratings.show +----+-----+------+ |user|movie|rating| +----+-----+------+ | 11| 1753| 4| | 11| 1682| 1| | 11| 216| 4| | 11| 2997| 4| | 11| 1259| 3| ...
  • 20. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 20 LOAD USERS val users_raw = sc.textFile("Downloads/ml-1m/users.dat") case class User(user: Int, gender: String, age: Int) val users = users_raw.map(_.split("::")).map(u => User(u(0).toInt, u(1), u(2).toInt)).toDF val sample_users = users.where(expr("gender = 'F' or ( rand() * 5 < 2 )")) sample_users.groupBy("gender").count().show +------+-----+ |gender|count| +------+-----+ | F| 1709| | M| 1744| +------+-----+
  • 21. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 21 PREP DATA val popular = ratings.groupBy("movie") .count() .orderBy($"count".desc) .limit(100) .map(_.get(0)).collect val ratings_pivot = ratings.groupBy("user") .pivot("movie", popular.toSeq) .agg(expr("coalesce(first(rating),3)").cast("double")) ratings_pivot.where($"user" === 11).show +----+----+---+----+----+---+----+---+----+----+---+... |user|2858|260|1196|1210|480|2028|589|2571|1270|593|... +----+----+---+----+----+---+----+---+----+----+---+... | 11| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| 3.0| 3.0|5.0|... +----+----+---+----+----+---+----+---+----+----+---+...
  • 22. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 22 BUILD MODEL val data = ratings_pivot.join(sample_users, "user") .withColumn("label", expr("if(gender = 'M', 1, 0)").cast("double")) val assembler = new VectorAssembler() .setInputCols(popular.map(_.toString)) .setOutputCol("features") val lr = new LogisticRegression() val pipeline = new Pipeline().setStages(Array(assembler, lr)) val Array(training, test) = data.randomSplit(Array(0.9, 0.1), seed = 12345) val model = pipeline.fit(training)
  • 23. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 23 TEST val res = model.transform(test).select("label", "prediction") res.groupBy("label").pivot("prediction", Seq(1.0, 0.0)).count().show +-----+---+---+ |label|1.0|0.0| +-----+---+---+ | 1.0|114| 74| | 0.0| 46|146| +-----+---+---+ Accuracy 68%
  • 24. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 24 TIPS AND TRICKS
  • 25. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 25 USAGE NOTES • Specify the distinct values of the pivot column – Otherwise it does this: val values = df.select(pivotColumn) .distinct() .sort(pivotColumn) .map(_.get(0)) .take(maxValues + 1) .toSeq
  • 26. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 26 MULTIPLE AGGREGATIONS df.groupBy("A", "B").pivot("C").agg(sum("D"), avg("D")).show +---+---+------------+------------+------------+------------+ | A| B|small_sum(D)|small_avg(D)|large_sum(D)|large_avg(D)| +---+---+------------+------------+------------+------------+ |foo|two| 6| 3.0| null| null| |bar|two| 6| 6.0| 7| 7.0| |foo|one| 1| 1.0| 4| 2.0| |bar|one| 5| 5.0| 4| 4.0| +---+---+------------+------------+------------+------------+
  • 27. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 27 PIVOT MULTIPLE COLUMNS • Merge columns and pivot as usual df.withColumn(“p”, concat($”p1”, $”p2”)) .groupBy(“a”, “b”) .pivot(“p”) .agg(…)
  • 28. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 28 MAX COLUMNS • spark.sql.pivotMaxValues – Default: 10,000 – When doing a pivot without specifying values for the pivot column this is the maximum number of (distinct) values that will be collected without error.
  • 29. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 29 IMPLEMENTATION
  • 30. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 30 PIVOT IMPLEMENTATION • pivot is a method of GroupedData and returns GroupedData with PivotType. • New logical operator: o.a.s.sql.catalyst.plans.logical.Pivot • Analyzer rule: o.a.s.sql.catalyst.analysis.Analyzer.ResolvePivot – Currently translates logical pivot into an aggregation with lots of if statements.
  • 31. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 31 FUTURE WORK
  • 32. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 32 FUTURE WORK • Add to R API • Add to SQL syntax • Add support for unpivot • Faster implementation
  • 33. © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 33 Pivoting Data with SparkSQL Andrew Ray [email protected] We’re hiring! svds.com/careers THANK YOU. git.io/vgy34