Type Checking Scala Spark Datasets: Dataset Transforms

Type Checking Scala Spark
Datasets:  
Data Set Transforms
John Nestor 47 Degrees
www.47deg.com
Seattle Spark Meetup
September 22, 2016
147deg.com

47deg.com © Copyright 2016 47 Degrees
Outline
• Introduction
• Transforms
• Demos
• Implementation
• Getting the Code
2

Spark Scala APIs
• RDD (pass closures)
• Functional programming model
• Types checked at compile time
• DataFrame (pass SQL)
• SQL programming model (can be optimized)
• Types checked at run time
• Dataset (pass SQL)
• Combines best of RDDs and DataFrames
• Some (not all) types checked at compile time
4

Run-Time Scala Checking
• Field/column names
• Names specified as strings
• RT error if no such field
• Field/column types
• Specified via casting to expected type
• RT error if not of expected type
5

Dataset Example
case class ABC(a: Int, b: String, c: String) 
case class CA(c: String, a: Int)
 
val abc = ABC(3, "foo", "test") 
val abc1 = ABC(5, "xxx", "alpha") 
val abc3 = ABC(10, "aaa", "aaa") 
val abcs = Seq(abc, abc1, abc3) 
val ds = abcs.toDS()
/* Compile time type checking;
but must pass closure and can’t optimize */ 
val ds1 = ds.map(abc => CA(abc.b, abc.a * 2 + abc.a)) 
 
/* Can be query optimized;
but run-time type and field name checking */ 
val ds2 = ds.select($"b" as "c",
($"a" * 2 + $"a") as "a").as[CA]
6

Goal
• Add strong typing to Scala Spark Datasets
• Check ﬁeld names at compile time
• Check ﬁeld types at compile time
• Each transform maps one of more Datasets to a new
Dataset.
• Dataset rows are compile-time types: Scala case
classes
8

Transform Example
case class ABC(a: Int, b: String, c: String) 
case class CA(c: String, a: Int)
 
val abc = ABC(3, "foo", "test") 
val abc1 = ABC(5, "xxx", "alpha") 
val abc3 = ABC(10, "aaa", "aaa") 
val abcs = Seq(abc, abc1, abc3) 
val ds = abcs.toDS()
/* Compile time type checking;
but can do query optimization */
 
val smap = SqlMap[ABC, CA]
.act(cols => (cols.b, cols.a * 2 + cols.a))
val ds3 = smap(ds)
 
9

Current Transforms
• Filter
• Map
• Sort
• Join (combines 2 DataSets)
• Aggregate (sum, count, max)
10

Demo
• Dataset example
• map
• select
• Transform examples
• Map
• Sort
• Join
• Filter
• Aggregate
12

Scala Macros
• Scala code executed at compile time
• Kinds
• Black box - single result type speciﬁed
• * White box - result type computed
14

Transform Implementation
• case class Person(name:String,age:Int) 
val p = Person(“Sam”,30)
• Scala macro converts
• from: an arbitrary case class type
• classOf[p]
• to: a meta structure that encodes ﬁeld names and
types
• case class PersonM(name:StringCol,age:IntCol) 
val cols =
PersonM(name:StringCol(“name”),age:IntCol(“age”))
15

Column Operations
• StrCol(“A”) === StrCol(“B”) => BoolCol(“A === B”)
• IntCol(“A”) + IntCol(“B”) => IntCol(“A + B”)
• IntCol(“A”).max => IntCol(“A.max”)
16

White Box Macro Restrictions
• Works ﬁne in SBT and Eclipse
• Not supported in Intellij but can use
• Reports type errors
• Does not show available completions
17

Transforms Code
• https://github.com/nestorpersist/dataset-transform
• Code
• Documentation
• Examples
• "com.persist" % "dataset-transforms_2.11" % "0.0.5"
19

Type Checking Scala Spark Datasets: Dataset Transforms

More Related Content

Viewers also liked (20)

Similar to Type Checking Scala Spark Datasets: Dataset Transforms (20)

More from John Nestor (7)

Recently uploaded (20)

Type Checking Scala Spark Datasets: Dataset Transforms