SlideShare a Scribd company logo
Higher Order Functions
Herman van Hövell @westerflyer
2018-10-03, London Spark Summit EU 2018
2
About Me
- Software Engineer @Databricks
Amsterdam office
- Apache Spark Committer and
PMC member
- In a previous life Data Engineer &
Data Analyst.
3
Complex Data
Complex data types in Spark SQL
- Struct. For example: struct(a: Int, b: String)
- Array. For example: array(a: Int)
- Map. For example: map(key: String, value: Int)
This provides primitives to build tree-based data models
- High expressiveness. Often alleviates the need for ‘flat-earth’ multi-table
designs.
- More natural, reality like data models
4
Complex Data - Tweet JSON
{
"created_at": "Wed Oct 03 11:41:57 +0000 2018",
"id_str": "994633657141813248",
"text": "Looky nested data #spark #sseu",
"display_text_range": [0, 140],
"user": {
"id_str": "12343453",
"screen_name": "Westerflyer"
},
"extended_tweet": {
"full_text": "Looky nested data #spark #sseu",
"display_text_range": [0, 249],
"entities": {
"hashtags": [{
"text": "spark",
"indices": [211, 225]
}, {
"text": "sseu",
"indices": [239, 249]
}]
}
}
}
adapted from: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.html
root
|-- created_at: string (nullable = true)
|-- id_str: string (nullable = true)
|-- text: string (nullable = true)
|-- user: struct (nullable = true)
| |-- id_str: string (nullable = true)
| |-- screen_name: string (nullable = true)
|-- display_text_range: array (nullable = true)
| |-- element: long (containsNull = true)
|-- extended_tweet: struct (nullable = true)
| |-- full_text: string (nullable = true)
| |-- display_text_range: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- entities: struct (nullable = true)
| | |-- hashtags: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- indices: array (nullable = true)
| | | | | |-- element: long (containsNull = true)
| | | | |-- text: string (nullable = true)
5
Manipulating Complex Data
Structs are easy :)
Maps/Arrays not so much...
- Easy to read single values/retrieve keys
- Hard to transform or to summarize
6
Transforming an Array
Let’s say we want to add 1 to every element of the vals field of
every row in an input table.
Id Vals
1 [1, 2, 3]
2 [4, 5, 6]
Id Vals
1 [2, 3, 4]
2 [5, 6, 7]
How would we do this?
7
Transforming an Array
Option 1 - Explode and Collect
select id,
collect_list(val + 1) as vals
from (select id,
explode(vals) as val
from input_tbl) x
group by id
8
Transforming an Array
Option 1 - Explode and Collect - Explode
select id,
collect_list(val + 1) as vals
from (select id,
explode(vals) as val
from input_tbl) x
group by id
1. Explode
9
Transforming an Array
Option 1 - Explode and Collect - Explode
Id Vals
1 [1, 2, 3]
2 [4, 5, 6]
Id Val
1 1
1 2
1 3
2 4
2 5
2 6
10
Transforming an Array
Option 1 - Explode and Collect - Collect
select id,
collect_list(val + 1) as vals
from (select id,
explode(vals) as val
from input_tbl) x
group by id
1. Explode
2. Collect
11
Transforming an Array
Option 1 - Explode and Collect - Collect
Id Val
1 1 + 1
1 2 + 1
1 3 + 1
2 4 + 1
2 5 + 1
2 6 + 1
Id Vals
1 [2, 3, 4]
2 [5, 6, 7]
12
Transforming an Array
Option 1 - Explode and Collect - Complexity
== Physical Plan ==
ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)])
+- Exchange hashpartitioning(id, 200)
+- ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)])
+- Generate explode(vals), [id], false, [val]
+- FileScan parquet default.input_tbl
13
Transforming an Array
Option 1 - Explode and Collect - Complexity
• Shuffles the data around, which is very expensive
• collect_list does not respect pre-existing ordering
== Physical Plan ==
ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)])
+- Exchange hashpartitioning(id, 200)
+- ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)])
+- Generate explode(vals), [id], false, [val]
+- FileScan parquet default.input_tbl
14
Transforming an Array
Option 1 - Explode and Collect - Pitfalls
Id Vals
1 [1, 2, 3]
1 [4, 5, 6]
Id Vals
1 [5, 6, 7, 2, 3, 4]
Id Vals
1 null
2 [4, 5, 6]
Id Vals
2 [5, 6, 7]
Values need to have data
Keys need to be unique
15
Transforming an Array
Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
16
Transforming an Array
Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int])
17
Transforming an Array
Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int])
val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
18
Transforming an Array
Pros
- Is faster than Explode & Collect
- Does not suffer from correctness pitfalls
Cons
- Is relatively slow, we need to do a lot serialization
- You need to register UDFs per type
- Does not work for SQL
- Clunky
Option 2 - Scala UDF
19
When are you going to talk
about
Higher Order Functions?
20
Higher Order Functions
Let’s take another look at Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register("plusOneInt",
addOne(_:Seq[Int]):Seq[Int])
val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
21
Higher Order Functions
Let’s take another look at Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register("plusOneInt",
addOne(_:Seq[Int]):Seq[Int])
val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
Higher Order Function
22
Higher Order Functions
Let’s take another look at Option 2 - Scala UDF
Can we do the same for Spark SQL?
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register("plusOneInt",
addOne(_:Seq[Int]):Seq[Int])
val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
Higher Order Function
Anonymous ‘Lambda’ Function
23
Higher Order Functions in Spark SQL
select id, transform(vals, val -> val + 1) as vals
from input_tbl
- Spark SQL native code: fast & no serialization needed
- Works for SQL
24
Higher Order Functions in Spark SQL
select id, transform(vals, val -> val + 1) as vals
from input_tbl
Higher Order Function
transform is the Higher Order Function. It takes an input array
and an expression, it applies this expression to each element in
the array
25
Higher Order Functions in Spark SQL
select id, transform(vals, val -> val + 1) as vals
from input_tbl Anonymous ‘Lambda’ Function
val -> val + 1 is the lambda function. It is the operation
that is applied to each value in the array. This function is divided
into two components separated by a -> symbol:
1. The Argument list.
2. The expression used to calculate the new value.
26
Higher Order Functions in Spark SQL
Nesting
select id,
transform(vals, val ->
transform(val, e -> e + 1)) as vals
from nested_input_tbl
Capture
select id,
ref_value,
transform(vals, val -> ref_value + val) as vals
from nested_input_tbl
27
Didn’t you say these were
faster?
28
Performance
29
Higher Order Functions in Spark SQL
Spark 2.4 will ship with following higher order functions:
Array
- transform
- filter
- exists
- aggregate/reduce
- zip_with
Map
- transform_keys
- transform_values
- map_filter
- map_zip_with
A lot of new collection based expression were also added...
30
Future work
Disclaimer: All of this is speculative and has not been discussed on the Dev list!
Arrays and Maps have received a lot of love. However working
with wide structs fields is still non-trivial (a lot of typing). We can
do better here:
- The following dataset functions should work for nested fields:
- withColumn()
- withColumnRenamed()
- The following functions should be added for struct fields:
- select()
- withColumn()
- withColumnRenamed()
Questions?

More Related Content

What's hot (20)

PPTX
Query Optimizer – MySQL vs. PostgreSQL
Christian Antognini
 
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
PDF
Scaling Apache Spark at Facebook
Databricks
 
PPTX
201804 neo4 j_cypher_guide
Junyi Song
 
PDF
Cassandra serving netflix @ scale
Vinay Kumar Chella
 
PDF
MongoDB WiredTiger Internals: Journey To Transactions
Mydbops
 
PPTX
Mongo db intro.pptx
JWORKS powered by Ordina
 
PDF
Physical Plans in Spark SQL
Databricks
 
PPTX
Apache Avro vs Protocol Buffers
Seiya Mizuno
 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
PDF
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Yongho Ha
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PDF
MariaDB 10.5 binary install (바이너리 설치)
NeoClova
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PDF
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Databricks
 
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
PPTX
Introducing MongoDB Atlas
MongoDB
 
PDF
An introduction to MongoDB
Universidade de São Paulo
 
PPTX
Programming in Spark using PySpark
Mostafa
 
Query Optimizer – MySQL vs. PostgreSQL
Christian Antognini
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
Scaling Apache Spark at Facebook
Databricks
 
201804 neo4 j_cypher_guide
Junyi Song
 
Cassandra serving netflix @ scale
Vinay Kumar Chella
 
MongoDB WiredTiger Internals: Journey To Transactions
Mydbops
 
Mongo db intro.pptx
JWORKS powered by Ordina
 
Physical Plans in Spark SQL
Databricks
 
Apache Avro vs Protocol Buffers
Seiya Mizuno
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Yongho Ha
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
MariaDB 10.5 binary install (바이너리 설치)
NeoClova
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Introducing MongoDB Atlas
MongoDB
 
An introduction to MongoDB
Universidade de São Paulo
 
Programming in Spark using PySpark
Mostafa
 

Similar to An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell (20)

PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PDF
Building Robust ETL Pipelines with Apache Spark
Databricks
 
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
PPTX
Hive in Practice
András Fehér
 
PDF
Dataframes in Spark - Data Analysts' perspective
Marcin Szymaniuk
 
PDF
Bay Area Apache Spark ™ Meetup: Upcoming Apache Spark 4.0.0 Release
carlyakerly1
 
PDF
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
PDF
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
PDF
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
scalaconfjp
 
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
PDF
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Holden Karau
 
PDF
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Databricks
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
PPTX
Spark sql
Zahra Eskandari
 
DOCX
ETL and pivoting in spark
Subhasish Guha
 
DOCX
ETL and pivoting in spark
Subhasish Guha
 
PDF
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Hive in Practice
András Fehér
 
Dataframes in Spark - Data Analysts' perspective
Marcin Szymaniuk
 
Bay Area Apache Spark ™ Meetup: Upcoming Apache Spark 4.0.0 Release
carlyakerly1
 
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
scalaconfjp
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Holden Karau
 
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Databricks
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
Spark sql
Zahra Eskandari
 
ETL and pivoting in spark
Subhasish Guha
 
ETL and pivoting in spark
Subhasish Guha
 
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 

An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell

  • 1. Higher Order Functions Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018
  • 2. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer and PMC member - In a previous life Data Engineer & Data Analyst.
  • 3. 3 Complex Data Complex data types in Spark SQL - Struct. For example: struct(a: Int, b: String) - Array. For example: array(a: Int) - Map. For example: map(key: String, value: Int) This provides primitives to build tree-based data models - High expressiveness. Often alleviates the need for ‘flat-earth’ multi-table designs. - More natural, reality like data models
  • 4. 4 Complex Data - Tweet JSON { "created_at": "Wed Oct 03 11:41:57 +0000 2018", "id_str": "994633657141813248", "text": "Looky nested data #spark #sseu", "display_text_range": [0, 140], "user": { "id_str": "12343453", "screen_name": "Westerflyer" }, "extended_tweet": { "full_text": "Looky nested data #spark #sseu", "display_text_range": [0, 249], "entities": { "hashtags": [{ "text": "spark", "indices": [211, 225] }, { "text": "sseu", "indices": [239, 249] }] } } } adapted from: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.html root |-- created_at: string (nullable = true) |-- id_str: string (nullable = true) |-- text: string (nullable = true) |-- user: struct (nullable = true) | |-- id_str: string (nullable = true) | |-- screen_name: string (nullable = true) |-- display_text_range: array (nullable = true) | |-- element: long (containsNull = true) |-- extended_tweet: struct (nullable = true) | |-- full_text: string (nullable = true) | |-- display_text_range: array (nullable = true) | | |-- element: long (containsNull = true) | |-- entities: struct (nullable = true) | | |-- hashtags: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- indices: array (nullable = true) | | | | | |-- element: long (containsNull = true) | | | | |-- text: string (nullable = true)
  • 5. 5 Manipulating Complex Data Structs are easy :) Maps/Arrays not so much... - Easy to read single values/retrieve keys - Hard to transform or to summarize
  • 6. 6 Transforming an Array Let’s say we want to add 1 to every element of the vals field of every row in an input table. Id Vals 1 [1, 2, 3] 2 [4, 5, 6] Id Vals 1 [2, 3, 4] 2 [5, 6, 7] How would we do this?
  • 7. 7 Transforming an Array Option 1 - Explode and Collect select id, collect_list(val + 1) as vals from (select id, explode(vals) as val from input_tbl) x group by id
  • 8. 8 Transforming an Array Option 1 - Explode and Collect - Explode select id, collect_list(val + 1) as vals from (select id, explode(vals) as val from input_tbl) x group by id 1. Explode
  • 9. 9 Transforming an Array Option 1 - Explode and Collect - Explode Id Vals 1 [1, 2, 3] 2 [4, 5, 6] Id Val 1 1 1 2 1 3 2 4 2 5 2 6
  • 10. 10 Transforming an Array Option 1 - Explode and Collect - Collect select id, collect_list(val + 1) as vals from (select id, explode(vals) as val from input_tbl) x group by id 1. Explode 2. Collect
  • 11. 11 Transforming an Array Option 1 - Explode and Collect - Collect Id Val 1 1 + 1 1 2 + 1 1 3 + 1 2 4 + 1 2 5 + 1 2 6 + 1 Id Vals 1 [2, 3, 4] 2 [5, 6, 7]
  • 12. 12 Transforming an Array Option 1 - Explode and Collect - Complexity == Physical Plan == ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)]) +- Exchange hashpartitioning(id, 200) +- ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)]) +- Generate explode(vals), [id], false, [val] +- FileScan parquet default.input_tbl
  • 13. 13 Transforming an Array Option 1 - Explode and Collect - Complexity • Shuffles the data around, which is very expensive • collect_list does not respect pre-existing ordering == Physical Plan == ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)]) +- Exchange hashpartitioning(id, 200) +- ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)]) +- Generate explode(vals), [id], false, [val] +- FileScan parquet default.input_tbl
  • 14. 14 Transforming an Array Option 1 - Explode and Collect - Pitfalls Id Vals 1 [1, 2, 3] 1 [4, 5, 6] Id Vals 1 [5, 6, 7, 2, 3, 4] Id Vals 1 null 2 [4, 5, 6] Id Vals 2 [5, 6, 7] Values need to have data Keys need to be unique
  • 15. 15 Transforming an Array Option 2 - Scala UDF def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) }
  • 16. 16 Transforming an Array Option 2 - Scala UDF def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) } val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int])
  • 17. 17 Transforming an Array Option 2 - Scala UDF def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) } val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int]) val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
  • 18. 18 Transforming an Array Pros - Is faster than Explode & Collect - Does not suffer from correctness pitfalls Cons - Is relatively slow, we need to do a lot serialization - You need to register UDFs per type - Does not work for SQL - Clunky Option 2 - Scala UDF
  • 19. 19 When are you going to talk about Higher Order Functions?
  • 20. 20 Higher Order Functions Let’s take another look at Option 2 - Scala UDF def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) } val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int]) val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
  • 21. 21 Higher Order Functions Let’s take another look at Option 2 - Scala UDF def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) } val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int]) val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals")) Higher Order Function
  • 22. 22 Higher Order Functions Let’s take another look at Option 2 - Scala UDF Can we do the same for Spark SQL? def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) } val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int]) val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals")) Higher Order Function Anonymous ‘Lambda’ Function
  • 23. 23 Higher Order Functions in Spark SQL select id, transform(vals, val -> val + 1) as vals from input_tbl - Spark SQL native code: fast & no serialization needed - Works for SQL
  • 24. 24 Higher Order Functions in Spark SQL select id, transform(vals, val -> val + 1) as vals from input_tbl Higher Order Function transform is the Higher Order Function. It takes an input array and an expression, it applies this expression to each element in the array
  • 25. 25 Higher Order Functions in Spark SQL select id, transform(vals, val -> val + 1) as vals from input_tbl Anonymous ‘Lambda’ Function val -> val + 1 is the lambda function. It is the operation that is applied to each value in the array. This function is divided into two components separated by a -> symbol: 1. The Argument list. 2. The expression used to calculate the new value.
  • 26. 26 Higher Order Functions in Spark SQL Nesting select id, transform(vals, val -> transform(val, e -> e + 1)) as vals from nested_input_tbl Capture select id, ref_value, transform(vals, val -> ref_value + val) as vals from nested_input_tbl
  • 27. 27 Didn’t you say these were faster?
  • 29. 29 Higher Order Functions in Spark SQL Spark 2.4 will ship with following higher order functions: Array - transform - filter - exists - aggregate/reduce - zip_with Map - transform_keys - transform_values - map_filter - map_zip_with A lot of new collection based expression were also added...
  • 30. 30 Future work Disclaimer: All of this is speculative and has not been discussed on the Dev list! Arrays and Maps have received a lot of love. However working with wide structs fields is still non-trivial (a lot of typing). We can do better here: - The following dataset functions should work for nested fields: - withColumn() - withColumnRenamed() - The following functions should be added for struct fields: - select() - withColumn() - withColumnRenamed()