SlideShare a Scribd company logo
Type Checking Scala Spark
Datasets: 

Data Set Transforms
John Nestor 47 Degrees
www.47deg.com
Seattle Spark Meetup
September 22, 2016
147deg.com
47deg.com © Copyright 2016 47 Degrees
Outline
• Introduction
• Transforms
• Demos
• Implementation
• Getting the Code
2
Introduction
3
47deg.com © Copyright 2016 47 Degrees
Spark Scala APIs
• RDD (pass closures)
• Functional programming model
• Types checked at compile time
• DataFrame (pass SQL)
• SQL programming model (can be optimized)
• Types checked at run time
• Dataset (pass SQL)
• Combines best of RDDs and DataFrames
• Some (not all) types checked at compile time
4
47deg.com © Copyright 2016 47 Degrees
Run-Time Scala Checking
• Field/column names
• Names specified as strings
• RT error if no such field
• Field/column types
• Specified via casting to expected type
• RT error if not of expected type
5
47deg.com © Copyright 2016 47 Degrees
Dataset Example
case class ABC(a: Int, b: String, c: String)

case class CA(c: String, a: Int)


val abc = ABC(3, "foo", "test")

val abc1 = ABC(5, "xxx", "alpha")

val abc3 = ABC(10, "aaa", "aaa")

val abcs = Seq(abc, abc1, abc3)

val ds = abcs.toDS()
/* Compile time type checking;
but must pass closure and can’t optimize */

val ds1 = ds.map(abc => CA(abc.b, abc.a * 2 + abc.a))



/* Can be query optimized;
but run-time type and field name checking */

val ds2 = ds.select($"b" as "c",
($"a" * 2 + $"a") as "a").as[CA]
6
Transforms
7
47deg.com © Copyright 2016 47 Degrees
Goal
• Add strong typing to Scala Spark Datasets
• Check field names at compile time
• Check field types at compile time
• Each transform maps one of more Datasets to a new
Dataset.
• Dataset rows are compile-time types: Scala case
classes
8
47deg.com © Copyright 2016 47 Degrees
Transform Example
case class ABC(a: Int, b: String, c: String)

case class CA(c: String, a: Int)


val abc = ABC(3, "foo", "test")

val abc1 = ABC(5, "xxx", "alpha")

val abc3 = ABC(10, "aaa", "aaa")

val abcs = Seq(abc, abc1, abc3)

val ds = abcs.toDS()
/* Compile time type checking;
but can do query optimization */


val smap = SqlMap[ABC, CA]
.act(cols => (cols.b, cols.a * 2 + cols.a))
val ds3 = smap(ds)


9
47deg.com © Copyright 2016 47 Degrees
Current Transforms
• Filter
• Map
• Sort
• Join (combines 2 DataSets)
• Aggregate (sum, count, max)
10
Demos
11
47deg.com © Copyright 2016 47 Degrees
Demo
• Dataset example
• map
• select
• Transform examples
• Map
• Sort
• Join
• Filter
• Aggregate
12
Implementation
13
47deg.com © Copyright 2016 47 Degrees
Scala Macros
• Scala code executed at compile time
• Kinds
• Black box - single result type specified
• * White box - result type computed
14
47deg.com © Copyright 2016 47 Degrees
Transform Implementation
• case class Person(name:String,age:Int)

val p = Person(“Sam”,30)
• Scala macro converts
• from: an arbitrary case class type
• classOf[p]
• to: a meta structure that encodes field names and
types
• case class PersonM(name:StringCol,age:IntCol)

val cols =
PersonM(name:StringCol(“name”),age:IntCol(“age”))
15
47deg.com © Copyright 2016 47 Degrees
Column Operations
• StrCol(“A”) === StrCol(“B”) => BoolCol(“A === B”)
• IntCol(“A”) + IntCol(“B”) => IntCol(“A + B”)
• IntCol(“A”).max => IntCol(“A.max”)
16
47deg.com © Copyright 2016 47 Degrees
White Box Macro Restrictions
• Works fine in SBT and Eclipse
• Not supported in Intellij but can use
• Reports type errors
• Does not show available completions
17
Getting the Code
18
47deg.com © Copyright 2016 47 Degrees
Transforms Code
• https://github.com/nestorpersist/dataset-transform
• Code
• Documentation
• Examples
• "com.persist" % "dataset-transforms_2.11" % "0.0.5"
19
Questions
20

More Related Content

Viewers also liked (20)

PDF
Logging in Scala
John Nestor
 
PDF
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Ashutosh Sonaliya
 
PDF
Unikernels: in search of a killer app and a killer ecosystem
rhatr
 
PDF
Full stack analytics with Hadoop 2
Gabriele Modena
 
PDF
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
PDF
臺灣高中數學講義 - 第一冊 - 數與式
Xuan-Chao Huang
 
PPTX
Think Like Spark: Some Spark Concepts and a Use Case
Rachel Warren
 
PDF
Resilient Distributed Datasets
Gabriele Modena
 
PDF
Apache Spark: killer or savior of Apache Hadoop?
rhatr
 
PPTX
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
PPTX
Apache Spark Introduction @ University College London
Vitthal Gogate
 
PPTX
Think Like Spark
Alpine Data
 
PDF
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
PDF
Hadoop to spark_v2
elephantscale
 
PDF
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
PDF
Spark in 15 min
Christophe Marchal
 
PDF
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
StampedeCon
 
PPTX
Intro to Spark development
Spark Summit
 
PDF
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Logging in Scala
John Nestor
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Ashutosh Sonaliya
 
Unikernels: in search of a killer app and a killer ecosystem
rhatr
 
Full stack analytics with Hadoop 2
Gabriele Modena
 
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
臺灣高中數學講義 - 第一冊 - 數與式
Xuan-Chao Huang
 
Think Like Spark: Some Spark Concepts and a Use Case
Rachel Warren
 
Resilient Distributed Datasets
Gabriele Modena
 
Apache Spark: killer or savior of Apache Hadoop?
rhatr
 
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
Apache Spark Introduction @ University College London
Vitthal Gogate
 
Think Like Spark
Alpine Data
 
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Hadoop to spark_v2
elephantscale
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
Spark in 15 min
Christophe Marchal
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
StampedeCon
 
Intro to Spark development
Spark Summit
 
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 

Similar to Type Checking Scala Spark Datasets: Dataset Transforms (20)

PDF
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Databricks
 
PDF
An Introduction to Spark with Scala
Chetan Khatri
 
PDF
How Modern SQL Databases Come up with Algorithms that You Would Have Never Dr...
Lukas Eder
 
PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
PDF
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
PDF
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
scalaconfjp
 
PDF
icpe2019_ishizaki_public
Kazuaki Ishizaki
 
PDF
Spark Schema For Free with David Szakallas
Databricks
 
PPTX
Spark sql
Zahra Eskandari
 
PDF
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
PDF
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
PDF
Hw09 Sqoop Database Import For Hadoop
Cloudera, Inc.
 
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
PDF
MongoDB for Java Developers
Roman Pichlík
 
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
PPTX
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
DataStax
 
PPTX
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
PDF
Fast and Reliable Apache Spark SQL Releases
DataWorks Summit
 
PDF
Fossasia 2018-chetan-khatri
Chetan Khatri
 
PDF
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Jonathan Seidman
 
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Databricks
 
An Introduction to Spark with Scala
Chetan Khatri
 
How Modern SQL Databases Come up with Algorithms that You Would Have Never Dr...
Lukas Eder
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
scalaconfjp
 
icpe2019_ishizaki_public
Kazuaki Ishizaki
 
Spark Schema For Free with David Szakallas
Databricks
 
Spark sql
Zahra Eskandari
 
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
Hw09 Sqoop Database Import For Hadoop
Cloudera, Inc.
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
MongoDB for Java Developers
Roman Pichlík
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
DataStax
 
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
Fast and Reliable Apache Spark SQL Releases
DataWorks Summit
 
Fossasia 2018-chetan-khatri
Chetan Khatri
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Jonathan Seidman
 
Ad

More from John Nestor (7)

PDF
LambdaFlow: Scala Functional Message Processing
John Nestor
 
PDF
LambdaTest
John Nestor
 
PDF
Messaging patterns
John Nestor
 
PDF
Experience Converting from Ruby to Scala
John Nestor
 
PPTX
Scala and Spark are Ideal for Big Data
John Nestor
 
PDF
Scala Json Features and Performance
John Nestor
 
PPT
Neutronium
John Nestor
 
LambdaFlow: Scala Functional Message Processing
John Nestor
 
LambdaTest
John Nestor
 
Messaging patterns
John Nestor
 
Experience Converting from Ruby to Scala
John Nestor
 
Scala and Spark are Ideal for Big Data
John Nestor
 
Scala Json Features and Performance
John Nestor
 
Neutronium
John Nestor
 
Ad

Recently uploaded (20)

PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
PPTX
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
Tally software_Introduction_Presentation
AditiBansal54083
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Import Data Form Excel to Tally Services
Tally xperts
 

Type Checking Scala Spark Datasets: Dataset Transforms

  • 1. Type Checking Scala Spark Datasets: 
 Data Set Transforms John Nestor 47 Degrees www.47deg.com Seattle Spark Meetup September 22, 2016 147deg.com
  • 2. 47deg.com © Copyright 2016 47 Degrees Outline • Introduction • Transforms • Demos • Implementation • Getting the Code 2
  • 4. 47deg.com © Copyright 2016 47 Degrees Spark Scala APIs • RDD (pass closures) • Functional programming model • Types checked at compile time • DataFrame (pass SQL) • SQL programming model (can be optimized) • Types checked at run time • Dataset (pass SQL) • Combines best of RDDs and DataFrames • Some (not all) types checked at compile time 4
  • 5. 47deg.com © Copyright 2016 47 Degrees Run-Time Scala Checking • Field/column names • Names specified as strings • RT error if no such field • Field/column types • Specified via casting to expected type • RT error if not of expected type 5
  • 6. 47deg.com © Copyright 2016 47 Degrees Dataset Example case class ABC(a: Int, b: String, c: String)
 case class CA(c: String, a: Int) 
 val abc = ABC(3, "foo", "test")
 val abc1 = ABC(5, "xxx", "alpha")
 val abc3 = ABC(10, "aaa", "aaa")
 val abcs = Seq(abc, abc1, abc3)
 val ds = abcs.toDS() /* Compile time type checking; but must pass closure and can’t optimize */
 val ds1 = ds.map(abc => CA(abc.b, abc.a * 2 + abc.a))
 
 /* Can be query optimized; but run-time type and field name checking */
 val ds2 = ds.select($"b" as "c", ($"a" * 2 + $"a") as "a").as[CA] 6
  • 8. 47deg.com © Copyright 2016 47 Degrees Goal • Add strong typing to Scala Spark Datasets • Check field names at compile time • Check field types at compile time • Each transform maps one of more Datasets to a new Dataset. • Dataset rows are compile-time types: Scala case classes 8
  • 9. 47deg.com © Copyright 2016 47 Degrees Transform Example case class ABC(a: Int, b: String, c: String)
 case class CA(c: String, a: Int) 
 val abc = ABC(3, "foo", "test")
 val abc1 = ABC(5, "xxx", "alpha")
 val abc3 = ABC(10, "aaa", "aaa")
 val abcs = Seq(abc, abc1, abc3)
 val ds = abcs.toDS() /* Compile time type checking; but can do query optimization */ 
 val smap = SqlMap[ABC, CA] .act(cols => (cols.b, cols.a * 2 + cols.a)) val ds3 = smap(ds) 
 9
  • 10. 47deg.com © Copyright 2016 47 Degrees Current Transforms • Filter • Map • Sort • Join (combines 2 DataSets) • Aggregate (sum, count, max) 10
  • 12. 47deg.com © Copyright 2016 47 Degrees Demo • Dataset example • map • select • Transform examples • Map • Sort • Join • Filter • Aggregate 12
  • 14. 47deg.com © Copyright 2016 47 Degrees Scala Macros • Scala code executed at compile time • Kinds • Black box - single result type specified • * White box - result type computed 14
  • 15. 47deg.com © Copyright 2016 47 Degrees Transform Implementation • case class Person(name:String,age:Int)
 val p = Person(“Sam”,30) • Scala macro converts • from: an arbitrary case class type • classOf[p] • to: a meta structure that encodes field names and types • case class PersonM(name:StringCol,age:IntCol)
 val cols = PersonM(name:StringCol(“name”),age:IntCol(“age”)) 15
  • 16. 47deg.com © Copyright 2016 47 Degrees Column Operations • StrCol(“A”) === StrCol(“B”) => BoolCol(“A === B”) • IntCol(“A”) + IntCol(“B”) => IntCol(“A + B”) • IntCol(“A”).max => IntCol(“A.max”) 16
  • 17. 47deg.com © Copyright 2016 47 Degrees White Box Macro Restrictions • Works fine in SBT and Eclipse • Not supported in Intellij but can use • Reports type errors • Does not show available completions 17
  • 19. 47deg.com © Copyright 2016 47 Degrees Transforms Code • https://github.com/nestorpersist/dataset-transform • Code • Documentation • Examples • "com.persist" % "dataset-transforms_2.11" % "0.0.5" 19