Data-Defined Typed Schema Generation in Accumulo

Chad Hardin
Data-Defined Typed Schema Generation in Accumulo
Accumulo Summit 2017

● Store semi-structured data in Accumulo
● Not require any type of schema to be defined before storing it.
● Query or otherwise process that data with an existing schema-oriented
framework like Spark SQL.
● Discover the schema instead of create it.
©Koverse 2
Motivation
What do we want to do?

©Koverse 3
The Problem
A lack of:
• Methodologies
• Algorithms
• Experience
What keeps us from doing this?

©Koverse 4
Approach
● Represent all data as records.
● Store records in a data set.
● Determine the schema of each record.
● Combine the schema of all records into a data set schema.
● Apply the data set schema to all records.

©Koverse 5
What is a Record?
Think of it like JSON. They have named fields with different types of values like
numbers, strings, and booleans. Consider these records:
{
“name” : “Samantha”,
“age” : 13,
“homeTown” : “Seattle”,
“favoriteColor” : “blue”
}
{
“name” : “Jackson”,
“age” : 2,
“favoriteColor” : false
}
R R

©Koverse 6
What is a Data Set?
A Data Set is filled with such records and has a name:
R
R
R
RR
R
R
R
“Stuff”

©Koverse 7
What is a Record Schema?
{
“name” : “string”,
“age” : “integer”,
“homeTown” : “string”,
“favoriteColor” : “string”
}
{
“age” : 2,
}
{
“name” : “Samantha”,
“age” : 13,
“homeTown” : “Seattle”,
“favoriteColor” : “blue”
}
{
“favoriteColor” : “boolean”
}
R
R
S
S

©Koverse 8
The schemas?
A Data Set has as many schemas as it does records. So what is the schema of
the Data Set?
S
S
S
S
S
SS
S
S S ?R
R
R
RR
R
R
R

©Koverse 9
Merging Record Schemas {
“name” : [
“string”
],
“age” : [
“integer”
],
“homeTown” : [
“string”
],
“favoriteColor” : [
“string”,
“boolean”
]
}
{
“homeTown” : “string”,
}
{
“favoriteColor” : “boolean”
}
S
S
S

When records have different types for the same field, we have to decide what the
ultimate type will be. The string type is the most general.
©Koverse 10
Merging field types
How to handle conflicts of data types?
string boolean integer double
string string string string string
boolean string boolean string string
integer string string integer double
double string string double double

©Koverse 11
Collapsing the Schema
{
“name” : [
“string”
],
“age” : [
“integer”
],
“homeTown” : [
“string”
],
“favoriteColor” : [
“string”,
“boolean”
]
}
{
“homeTown” : “string”
}
S
S

©Koverse 12
Applying the Collapsed Schema
{
“age” : 2,
“homeTown” : null,
“favoriteColor” : “false”
}
{
“age” : 2,
}
{
“homeTown” : “string”
}
S
R
R

● Key
– Row ID: Data Set Identifier + Record Identifier
– Column Family: Field Name
– Column Qualifier: N/A
– Visibility: Whatever you need
– Timestamp: Write time
● Value (byte array)
– 1st byte: Field type
– Remaining bytes: Field value
©Koverse 13
Record Implementation in Accumulo
Many possibilities. A basic idea for storing records in a table...

● Key
– Row ID: Data Set Id
– Column Family: Field Name
– Column Qualifier: N/A
– Visibility: Whatever you need
– Timestamp: Write time
● Value: Field Type
©Koverse 14
Schema Implementation in Accumulo
Write record schemas to a different table, use a Combiner Iterator (next slide)

Combine the Accumulo field value using this table.
©Koverse 15
Schema Combiner Iterator
Create a custom Combiner Iterator to reduce the schema types.
string boolean integer double
string string string string string
boolean string boolean string string
integer string string integer double
double string string double double

1. Read the records of a data set into an RDD (from Accumulo)
2. Read the data set schema (from Accumulo)
3. Convert that schema into a Spark Data Frame Schema
4. Map the RDD so that every record conforms to the schema
5. Create a Spark Data Frame using the schema and the RDD
6. You can now use SQL queries for your schema-less records.
©Koverse 16
Spark SQL
How to make it work?

Use Spark SQL hooks:
● Column Filtering (Pruned Scans)
● Push-Down Predicates (Pruned Filtered Scans)
● Use a Spark SQL Data Set (same name but different)
● Use Spark SQL Catalogs (like a real SQL database!)
● Use Spark SQL Data Streams
©Koverse 17
Spark SQL
Future Improvements

Data-Defined Typed Schema Generation in Accumulo

More Related Content

What's hot (9)

Similar to Data-Defined Typed Schema Generation in Accumulo (20)

Recently uploaded (20)

Data-Defined Typed Schema Generation in Accumulo