Chad	Hardin
Data-Defined	Typed	Schema	Generation	in	Accumulo
Accumulo	Summit	2017
● Store	semi-structured	data	in	Accumulo
● Not	require	any	type	of	schema	to	be	defined	before	storing	it.
● Query	or	otherwise	process	that	data	with	an	existing	schema-oriented	
framework	like	Spark	SQL.	
● Discover	the	schema	instead	of	create	it.
©Koverse 2
Motivation
What	do	we	want	to	do?
©Koverse 3
The	Problem
A	lack	of:
• Methodologies
• Algorithms
• Experience
What	keeps	us	from	doing	this?
©Koverse 4
Approach
● Represent	all	data	as	records.
● Store	records	in	a	data	set.
● Determine	the	schema	of	each	record.
● Combine	the	schema	of	all	records	into	a	data	set	schema.
● Apply	the	data	set	schema	to	all	records.
©Koverse 5
What	is	a	Record?
Think	of	it	like	JSON.	They	have	named	fields	with	different	types	of	values	like	
numbers,	strings,	and	booleans.	Consider	these	records:
{
“name”	:	“Samantha”,
“age”	:	13,
“homeTown”	:	“Seattle”,
“favoriteColor”	:	“blue”
}
{
“name”	:	“Jackson”,
“age”	:	2,
“favoriteColor”	:	false
}
R R
©Koverse 6
What	is	a	Data	Set?
A	Data	Set	is	filled	with	such	records	and	has	a	name:
R
R
R
RR
R
R
R
“Stuff”
©Koverse 7
What	is	a	Record	Schema?
{
“name”	:	“string”,
“age”	:	“integer”,
“homeTown”	:	“string”,
“favoriteColor”	:	“string”
}
{
“name”	:	“Jackson”,
“age”	:	2,
“favoriteColor”	:	false
}
{
“name”	:	“Samantha”,
“age”	:	13,
“homeTown”	:	“Seattle”,
“favoriteColor”	:	“blue”
}
{
“name”	:	“string”,
“age”	:	“integer”,
“favoriteColor”	:	“boolean”
}
R
R
S
S
©Koverse 8
The	schemas?
A	Data	Set	has	as	many	schemas	as	it	does	records.	So	what	is	the	schema	of	
the	Data	Set?
S
S
S
S
S
SS
S
S S ?R
R
R
RR
R
R
R
©Koverse 9
Merging	Record	Schemas {
“name”	:	[
“string”
],
“age”	:	[
“integer”
],
“homeTown”	:	[
“string”
],
“favoriteColor”	:	[
“string”,
“boolean”
]
}
{
“name”	:	“string”,
“age”	:	“integer”,
“homeTown”	:	“string”,
“favoriteColor”	:	“string”
}
{
“name”	:	“string”,
“age”	:	“integer”,
“favoriteColor”	:	“boolean”
}
S
S
S
When	records	have	different	types	for	the	same	field,	we	have	to	decide	what	the	
ultimate	type	will	be.	The	string	type	is	the	most	general.
©Koverse 10
Merging	field	types
How	to	handle	conflicts	of	data	types?
string boolean integer double
string string string string string
boolean string boolean string string
integer string string integer double
double string string double double
©Koverse 11
Collapsing	the	Schema
{
“name”	:	[
“string”
],
“age”	:	[
“integer”
],
“homeTown”	:	[
“string”
],
“favoriteColor”	:	[
“string”,
“boolean”
]
}
{
“name”	:	“string”,
“age”	:	“integer”,
“homeTown”	:	“string”
“favoriteColor”	:	“string”
}
S
S
©Koverse 12
Applying	the	Collapsed	Schema
{
“name”	:	“Jackson”,
“age”	:	2,
“homeTown”	:	null,
“favoriteColor”	:	“false”
}
{
“name”	:	“Jackson”,
“age”	:	2,
“favoriteColor”	:	false
}
{
“name”	:	“string”,
“age”	:	“integer”,
“homeTown”	:	“string”
“favoriteColor”	:	“string”
}
S
R
R
● Key
– Row	ID:	Data	Set	Identifier	+	Record	Identifier
– Column	Family:	Field	Name
– Column	Qualifier:	N/A
– Visibility:	Whatever	you	need
– Timestamp:	Write	time
● Value	(byte	array)
– 1st	byte:	Field	type
– Remaining	bytes:	Field	value	
©Koverse 13
Record	Implementation	in	Accumulo
Many	possibilities.	A	basic	idea	for	storing	records	in	a	table...
● Key
– Row	ID:	Data	Set	Id
– Column	Family:	Field	Name
– Column	Qualifier:	N/A
– Visibility:	Whatever	you	need
– Timestamp:	Write	time
● Value:	Field	Type
©Koverse 14
Schema	Implementation	in	Accumulo
Write	record	schemas	to	a	different	table,	use	a	Combiner	Iterator	(next	slide)
Combine	the	Accumulo	field	value	using	this	table.
©Koverse 15
Schema	Combiner	Iterator
Create	a	custom	Combiner	Iterator	to	reduce	the	schema	types.
string boolean integer double
string string string string string
boolean string boolean string string
integer string string integer double
double string string double double
1. Read	the	records	of	a	data	set	into	an	RDD	(from	Accumulo)
2. Read	the	data	set	schema	(from	Accumulo)
3. Convert	that	schema	into	a	Spark	Data	Frame	Schema
4. Map	the	RDD	so	that	every	record	conforms	to	the	schema
5. Create	a	Spark	Data	Frame	using	the	schema	and	the	RDD
6. You	can	now	use	SQL	queries	for	your	schema-less	records.
©Koverse 16
Spark	SQL
How	to	make	it	work?
Use	Spark	SQL	hooks:
● Column	Filtering	(Pruned	Scans)
● Push-Down	Predicates	(Pruned	Filtered	Scans)
● Use	a	Spark	SQL	Data	Set	(same	name	but	different)
● Use	Spark	SQL	Catalogs	(like	a	real	SQL	database!)
● Use	Spark	SQL	Data	Streams
©Koverse 17
Spark	SQL
Future	Improvements
Questions?
©Koverse 18
Thank	You!
http://www.koverse.com/

More Related Content

PDF
Smart Searching Through Trillion of Research Papers with Apache Spark ML with...
PPTX
#MongoDB indexes
PDF
How to search extracted data
PPTX
introduction to NOSQL Database
PPTX
NoSQL Roundup
PPTX
Smart Searching Through Trillion of Research Papers with Apache Spark ML with...
#MongoDB indexes
How to search extracted data
introduction to NOSQL Database
NoSQL Roundup

What's hot (9)

PPTX
Nosql Introduction, Basics
PPTX
Introduction to Redis Data Structures: Sorted Sets
PDF
Nosql database presentation
PPTX
Introduction to Redis Data Structures: Sets
PPTX
Advanced Databases: Introduction to NoSQL, Big Data and Google's Big Table
PDF
NoSQL Databases, Not just a Buzzword
PDF
Webtech Conference: NoSQL and Web scalability
KEY
Strengths and Weaknesses of MongoDB
PDF
SQL vs NoSQL | MySQL vs MongoDB Tutorial | Edureka
Nosql Introduction, Basics
Introduction to Redis Data Structures: Sorted Sets
Nosql database presentation
Introduction to Redis Data Structures: Sets
Advanced Databases: Introduction to NoSQL, Big Data and Google's Big Table
NoSQL Databases, Not just a Buzzword
Webtech Conference: NoSQL and Web scalability
Strengths and Weaknesses of MongoDB
SQL vs NoSQL | MySQL vs MongoDB Tutorial | Edureka
Ad

Similar to Data-Defined Typed Schema Generation in Accumulo (20)

PPTX
Getting Started with R
PDF
Spark & Cassandra - DevFest Córdoba
PDF
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
PPTX
[Mas 500] Data Basics
PDF
springdatajpatwjug-120527215242-phpapp02.pdf
PDF
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
PDF
NoSql and it's introduction features-Unit-1.pdf
PPTX
Case study of Rujhaan.com (A social news app )
PDF
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
PDF
Introduction to Graph databases and Neo4j (by Stefan Armbruster)
PPSX
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
PPTX
Data infrastructure architecture for medium size organization: tips for colle...
PPTX
NoSQL.pptx
PPT
PPTX
cours database pour etudiant NoSQL (1).pptx
PDF
Data processing with spark in r & python
PPT
PPT
data mining
PPTX
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Getting Started with R
Spark & Cassandra - DevFest Córdoba
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
[Mas 500] Data Basics
springdatajpatwjug-120527215242-phpapp02.pdf
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
NoSql and it's introduction features-Unit-1.pdf
Case study of Rujhaan.com (A social news app )
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Data infrastructure architecture for medium size organization: tips for colle...
NoSQL.pptx
cours database pour etudiant NoSQL (1).pptx
Data processing with spark in r & python
data mining
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Ad

Recently uploaded (20)

PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Global Data and Analytics Market Outlook Report
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPT
Image processing and pattern recognition 2.ppt
PPTX
Managing Community Partner Relationships
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PDF
Introduction to the R Programming Language
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Microsoft 365 products and services descrption
PPT
statistic analysis for study - data collection
PPTX
SET 1 Compulsory MNH machine learning intro
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
chrmotography.pptx food anaylysis techni
PDF
Navigating the Thai Supplements Landscape.pdf
Optimise Shopper Experiences with a Strong Data Estate.pdf
Global Data and Analytics Market Outlook Report
Pilar Kemerdekaan dan Identi Bangsa.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Image processing and pattern recognition 2.ppt
Managing Community Partner Relationships
Topic 5 Presentation 5 Lesson 5 Corporate Fin
STERILIZATION AND DISINFECTION-1.ppthhhbx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
DU, AIS, Big Data and Data Analytics.ppt
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Introduction to the R Programming Language
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Microsoft 365 products and services descrption
statistic analysis for study - data collection
SET 1 Compulsory MNH machine learning intro
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
chrmotography.pptx food anaylysis techni
Navigating the Thai Supplements Landscape.pdf

Data-Defined Typed Schema Generation in Accumulo