SlideShare a Scribd company logo
2015- © GMC - 1
with Apache MetaModel
Unified access to all your
data points
2015- © GMC - 2
Who am I?
Kasper Sørensen, dad, geek, guitarist …
@kaspersor
Long-time developer and
PMC member of:
Founder also of another
nice open source project:
Principal Software Engineer @
2015- © GMC - 3
Session agenda
- Introduction to Apache MetaModel
- Use case: Query composition
- The role of Apache MetaModel in Big Data
- (New) architectural possibilities with Apache MetaModel
Introduction
to Apache MetaModel
2015- © GMC - 5
The Apache journey
2011-xx: MetaModel founded (outside of Apache)
2013-06: Incubation of Apache MetaModel starts
2014-12: Apache MetaModel is officially an Apache TLP
2015-08: Latest stable release, 4.3.6
2015-09: Latest RC, 4.4.0-RC1
- We're still a small project, just 10 committers (plus some
extra contributors), so we have room for you!
2015- © GMC - 6
Helicopter view
You can look at Apache MetaModel by it's formal description:
"… a uniform connector and query API
to many very different datastore types …"
But let's start with a problem to solve ...
2015- © GMC - 7
A problem
How to process multiple data sources while:
- Staying agnostic to the database engine.
- Respecting the metadata of the underlying database.
- Not repeating yourself.
- Avoiding fragility towards metadata changes.
- Mutating the data structure when needed.
We used to rely on ORM frameworks to handle the bulk of these ...
2015- © GMC - 8
ORM - Queries via domain models
public class Person {
public String getName() {...}
}
ORM.query(Person.class).where(Person::getName).eq("John Doe");
2015- © GMC - 9
Why an ORM might not work
Requires a domain model
- While many data-oriented apps are agnostic to a domain model.
- Sometimes the data itself is the domain.
Model cannot change at runtime
- Usually needed especially if your app creates new tables/entities
Type-safety through metadata assumptions
- This is quite common, yet it might be a problem that the model is statically
mapped and cannot survive e.g. column renames etc.
2015- © GMC - 10
Alternative to ORM: Use JDBC
Metadata discoverable via java.sql.DatabaseMetaData.
Queries can be assembled safely with a bit of String magic:
"SELECT " + columnNames + " FROM " + tableName + …
...
2015- © GMC - 11
Alternative to ORM: Use JDBC
Wrong!
It turns out that …
- not all databases have the same SQL "dialect".
- they also don't all implement DatabaseMetaData the same way.
- you cannot use this on much else than relational databases that use SQL.
- What about NoSQL, various file formats, web services etc.
2015- © GMC - 12
MetaModel - Queries via metadata model
DataContext dc = …
Table[] tables = dc.getDefaultSchema().getTables();
Table table = tables[0];
Column[] primaryKeys = table.getPrimaryKeys();
dc.query().from(table).selectAll().where(primaryKeys[0]).eq(42);
dc.query().from("person").selectAll().where("id").eq(42);
2015- © GMC - 13
MetaModel updates
UpdateableDataContext dc = …
dc.executeUpdate(new UpdateScript() {
// multiple updates go here - transactional characteristics (ACID, synchronization etc.)
// as per the capabilities of the concrete DataContext implementation.
});
dc.executeUpdate(new BatchUpdateScript() {
// multiple "batch/bulk" (non atomic) updates go here
});
// if I only want to do a single update, there are convenience classes for this
InsertInto insert = new InsertInto(table).value(nameColumn, "John Doe");
dc.executeUpdate(insert);
2015- © GMC - 14
MetaModel - Connectivity
Sometimes we have SQL, sometimes we have another native query engine (e.g.
Lucene) and sometimes we use MetaModel's own query engine.
(A few more connectors available via the MetaModel-extras (LGPL) project)
2015- © GMC - 15
Tables, Columns and SQL
– that doesn't sound right.
This is one of our trade-offs to make life easier!
We map NoSQL models (doc, key/value etc.) to a table-based model.
As a user you have some choices:
• Supply your own mapping (non-dynamic).
• Use schema inference provided by Apache MetaModel.
• Provide schema inference instructions
(e.g. turn certain key/value pairs into separate tables).
2015- © GMC - 16
Other key characteristic about MetaModel
• Non-intrusive, you can use it when you want to.
• It's just a library – no need for a running a separate service.
• Easy to test, stub and mock. Inject a replacement DataContext
instead of the real one (typically PojoDataContext).
• Easy to implement a new DataContext for your curious
database or data format – reuse our abstract
implementations.
Use case:
Query composition
2015- © GMC - 18
MetaModel schema model
2015- © GMC - 19
MetaModel query and schema model
2015- © GMC - 20
Query representation for developers?
"SELECT * FROM foo"
Query as a string
- Easy to write.
- Easy to read.
- Prone to parsing errors.
- Prone to typos etc.
- Fails at runtime.
Query q = new Query();
q.from(table).selectAll();
Query as an object
- More involved code to
read and write.
- Lends itself to inspection
and mutation.
- Fails at compile time.
2015- © GMC - 21
This STATUS=DELIVERED
filter never actually executes, it
just updates the query on the
ORDERS table.
Context-based Query optimization
You might not always need to
pass data around between
components …
Sometimes you can just pass
the query around!
The role of Apache MetaModel
in Big Data
2015- © GMC - 23
So how does this all
relate to Big Data?
2015- © GMC - 24
Big Data and the need for metadata
Variety
- Not only structured data
- Social
- Sensors
- Many new sources, but
also the “old”:
• Relational
• NoSQL
• Files
• Cloud
Volume
Volume
Velocity
Volume
Velocity
Variety
Volume
Velocity
Variety
Veracity ?
2015- © GMC - 29
Query language examples
SELECT * FROM customers
WHERE country_code = 'GB'
OR country_code IS NULL
db.customers.find({
$or: [
{"country.code": "GB"},
{"country.code": {$exists: false}}
]
})
for (line : customers.csv) {
values = parse(line);
country = values[country_index];
if (country == null || "GB".equals(country) {
emit(line);
}
}
SQL
CSV
MongoDB
2015- © GMC - 30
Query language examples
dataContext.query()
.from(customers)
.selectAll()
.where(countryCode).eq("GB")
.or(countryCode).isNull();
Any datastore
2015- © GMC - 31
Make it easy to ingest data in your lake
2015- © GMC - 32
Metadata to enable automation
Big Data Variety and Veracity means we will be handling:
● Different physical formats of the same data
● Different query engines
● Different quality-levels of data
How can we automate data ingestion in such a landscape?
2015- © GMC - 33
Metadata to enable automation
Big Data Variety and Veracity means we will be handling:
● Different physical formats of the same data
We need a uniform metamodel for all the datastores.
And enough metadata to infer the ingestion transformations needed.
● Different query engines
We need a uniform query API based on the metamodel.
● Different quality-levels of data
We need our ingestion target to be aware of the ingestion sources.
2015- © GMC - 34
Metadata to enable automation
Big Data Variety and Veracity means we will be handling:
● Different physical formats of the same data
We need a uniform metamodel for all the datastores.
And enough metadata to infer the ingestion transformations needed.
● Different query engines
We need a uniform query API based on the metamodel.
● Different quality-levels of data
We need our ingestion target to be aware of the ingestion sources.
As an industry we lack more
elaborate metadata to support this!
2015- © GMC - 35
Traditional view on metadata
user
id (pk) CHAR(32) java.lang.String not null
username VARCHAR(64) java.lang.String not null, unique
real_name VARCHAR(256) java.lang.String not null
address VARCHAR(256) java.lang.String nullable
age INT int nullable
2015- © GMC - 36
Traditional view on metadata
user
id (pk) CHAR(32) java.lang.String not null
username VARCHAR(64) java.lang.String not null, unique
real_name VARCHAR(256) java.lang.String not null
address VARCHAR(256) java.lang.String nullable
age INT int nullable
customer
id (pk) BIGINT long not null
firstname VARCHAR(128) java.lang.String not null
lastname VARCHAR(128) java.lang.String not null
street VARCHAR(64) java.lang.String not null
house number INT int not null
2015- © GMC - 37
A more elaborate metadata view
user
id (pk) CHAR(32) java.lang.String not null
username VARCHAR(64) java.lang.String not null, unique
real_name VARCHAR(256) java.lang.String not null
address VARCHAR(256) java.lang.String nullable
age INT int nullable
customer
id (pk) BIGINT long not null
firstname VARCHAR(128) java.lang.String not null
lastname VARCHAR(128) java.lang.String not null
street VARCHAR(64) java.lang.String not null
house number INT int not null
32 char UUID
Person full name
Address (unstructured)
Person first name
Person last name
Address part - street
Same real-
world
entity?
Address part - house number
Numeric nominal variable
Person age
Numeric ratio variable
2015- © GMC - 38
Elaborate metadata and querying
SELECT (columns with header 'name') FROM (customer and user)
SELECT firstname, lastname FROM customer
SELECT real_name FROM user
SELECT EXTRACT(full name FROM (name columns))
FROM (tables containing person names)
SELECT (person name columns) FROM (customer and user)
Static
Manual
Dynamic
Automated
(or supervised)
2015- © GMC - 39
Elaborate metadata and querying
SELECT street, house_number FROM customer
SELECT address FROM user
SELECT EXTRACT(street, hno, city, zip) FROM (Address part columns))
FROM (tables containing addresses)
SELECT (Address part columns) FROM (customer and user)
Static
Manual
Dynamic
Automated
(or supervised)
2015- © GMC - 40
Elaborate metadata and querying
Individual/specific DB connectors
JDBC
Apache MetaModel (today)
Apache MetaModel (future)
Static
Manual
Dynamic
Automated
(or supervised)
(New) architectural possibilities
with Apache MetaModel
2015- © GMC - 42
Data Integration scenario
Copy (ETL)
2015- © GMC - 43
Data Federation scenario
2015- © GMC - 44
What about speed?
Performance
- Performance is important in
development of MetaModel.
- But not in favor of uniformness.
- In some cases the metadata may
benefit performance by
automatically tweaking query
parameters (e.g. fetching
strategies).
- We usually expose the native
connector objects too.
Time to market
- MetaModel makes it easy to cover
all the data sources with the same
codebase.
- Typical 80/20 trade-off scenario.
- Avoid premature optimization.
2015- © GMC - 45
The future of Apache MetaModel?
Stuff currently being built (version 4.4.0):
• Pluggable operators in queries
• Pluggable functions in queries
• Phase-out Java 6 support
Ideas being prototyped:
• Elaborate metadata service
• Spark module (Turn a Query into a RDD or DataFrame)
• More connectors - Apache Solr, Neo4j, Couchbase, Riak
Let's see some code
Query a remote CSV file
https://github.com/kaspersorensen/ApacheBigDataMetaModelExample
Thank you!
Questions?

More Related Content

What's hot (20)

PPTX
Spark sql
Zahra Eskandari
 
PDF
Spark sql
Freeman Zhang
 
PDF
Spark meetup TCHUG
Ryan Bosshart
 
PPTX
Spark meetup v2.0.5
Yan Zhou
 
PDF
Data Migration with Spark to Hive
Databricks
 
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
DOCX
Spark,Hadoop,Presto Comparition
Sandish Kumar H N
 
PPTX
Spark SQL
Caserta
 
PDF
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
PPTX
Big Data and Hadoop Guide
Simplilearn
 
PDF
Using Apache Spark as ETL engine. Pros and Cons
Provectus
 
PDF
Building a Hadoop Data Warehouse with Impala
huguk
 
PDF
Building large scale transactional data lake using apache hudi
Bill Liu
 
PDF
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
DataWorks Summit
 
PDF
20140908 spark sql & catalyst
Takuya UESHIN
 
PDF
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Skills Matter
 
PPTX
Reshape Data Lake (as of 2020.07)
Eric Sun
 
PDF
Ingesting data at scale into elasticsearch with apache pulsar
Timothy Spann
 
PDF
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
Spark sql
Zahra Eskandari
 
Spark sql
Freeman Zhang
 
Spark meetup TCHUG
Ryan Bosshart
 
Spark meetup v2.0.5
Yan Zhou
 
Data Migration with Spark to Hive
Databricks
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Spark,Hadoop,Presto Comparition
Sandish Kumar H N
 
Spark SQL
Caserta
 
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Big Data and Hadoop Guide
Simplilearn
 
Using Apache Spark as ETL engine. Pros and Cons
Provectus
 
Building a Hadoop Data Warehouse with Impala
huguk
 
Building large scale transactional data lake using apache hudi
Bill Liu
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
DataWorks Summit
 
20140908 spark sql & catalyst
Takuya UESHIN
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Skills Matter
 
Reshape Data Lake (as of 2020.07)
Eric Sun
 
Ingesting data at scale into elasticsearch with apache pulsar
Timothy Spann
 
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 

Similar to Apache MetaModel - unified access to all your data points (20)

PDF
黑豹 ch4 ddd pattern practice (2)
Fong Liou
 
PDF
NoSQL and MySQL: News about JSON
Mario Beck
 
PPTX
Elevate MongoDB with ODBC/JDBC
MongoDB
 
PDF
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
James Anderson
 
PDF
FOSDEM 2015 - NoSQL and SQL the best of both worlds
Andrew Morgan
 
PPTX
Webinar on MongoDB BI Connectors
Sumit Sarkar
 
PDF
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
Embarcadero Technologies
 
PDF
SQL for Analytics.pdfSQL for Analytics.pdf
namtunguyen6
 
PDF
exploring-spring-boot-clients.pdf Spring Boot
baumi3
 
PPTX
Challenges of Implementing an Advanced SQL Engine on Hadoop
DataWorks Summit
 
PDF
Ad hoc analytics with Cassandra and Spark
Mohammed Guller
 
PPTX
Practical OData
Vagif Abilov
 
PDF
Presto talk @ Global AI conference 2018 Boston
kbajda
 
PDF
Dbt documentation for general setups chapter 3
AlokNayak66
 
PDF
Tutorial Workgroup - Model versioning and collaboration
PascalDesmarets1
 
PDF
Adding Data into your SOA with WSO2 WSAS
sumedha.r
 
PPT
Mondrian - Geo Mondrian
Simone Campora
 
ODP
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Codemotion
 
PDF
From Legacy Database to Domain Layer Using a New Cincom VisualWorks Tool
ESUG
 
PDF
Upcoming changes in MySQL 5.7
Morgan Tocker
 
黑豹 ch4 ddd pattern practice (2)
Fong Liou
 
NoSQL and MySQL: News about JSON
Mario Beck
 
Elevate MongoDB with ODBC/JDBC
MongoDB
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
James Anderson
 
FOSDEM 2015 - NoSQL and SQL the best of both worlds
Andrew Morgan
 
Webinar on MongoDB BI Connectors
Sumit Sarkar
 
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
Embarcadero Technologies
 
SQL for Analytics.pdfSQL for Analytics.pdf
namtunguyen6
 
exploring-spring-boot-clients.pdf Spring Boot
baumi3
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
DataWorks Summit
 
Ad hoc analytics with Cassandra and Spark
Mohammed Guller
 
Practical OData
Vagif Abilov
 
Presto talk @ Global AI conference 2018 Boston
kbajda
 
Dbt documentation for general setups chapter 3
AlokNayak66
 
Tutorial Workgroup - Model versioning and collaboration
PascalDesmarets1
 
Adding Data into your SOA with WSO2 WSAS
sumedha.r
 
Mondrian - Geo Mondrian
Simone Campora
 
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Codemotion
 
From Legacy Database to Domain Layer Using a New Cincom VisualWorks Tool
ESUG
 
Upcoming changes in MySQL 5.7
Morgan Tocker
 
Ad

Recently uploaded (20)

PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Ad

Apache MetaModel - unified access to all your data points

  • 1. 2015- © GMC - 1 with Apache MetaModel Unified access to all your data points
  • 2. 2015- © GMC - 2 Who am I? Kasper Sørensen, dad, geek, guitarist … @kaspersor Long-time developer and PMC member of: Founder also of another nice open source project: Principal Software Engineer @
  • 3. 2015- © GMC - 3 Session agenda - Introduction to Apache MetaModel - Use case: Query composition - The role of Apache MetaModel in Big Data - (New) architectural possibilities with Apache MetaModel
  • 5. 2015- © GMC - 5 The Apache journey 2011-xx: MetaModel founded (outside of Apache) 2013-06: Incubation of Apache MetaModel starts 2014-12: Apache MetaModel is officially an Apache TLP 2015-08: Latest stable release, 4.3.6 2015-09: Latest RC, 4.4.0-RC1 - We're still a small project, just 10 committers (plus some extra contributors), so we have room for you!
  • 6. 2015- © GMC - 6 Helicopter view You can look at Apache MetaModel by it's formal description: "… a uniform connector and query API to many very different datastore types …" But let's start with a problem to solve ...
  • 7. 2015- © GMC - 7 A problem How to process multiple data sources while: - Staying agnostic to the database engine. - Respecting the metadata of the underlying database. - Not repeating yourself. - Avoiding fragility towards metadata changes. - Mutating the data structure when needed. We used to rely on ORM frameworks to handle the bulk of these ...
  • 8. 2015- © GMC - 8 ORM - Queries via domain models public class Person { public String getName() {...} } ORM.query(Person.class).where(Person::getName).eq("John Doe");
  • 9. 2015- © GMC - 9 Why an ORM might not work Requires a domain model - While many data-oriented apps are agnostic to a domain model. - Sometimes the data itself is the domain. Model cannot change at runtime - Usually needed especially if your app creates new tables/entities Type-safety through metadata assumptions - This is quite common, yet it might be a problem that the model is statically mapped and cannot survive e.g. column renames etc.
  • 10. 2015- © GMC - 10 Alternative to ORM: Use JDBC Metadata discoverable via java.sql.DatabaseMetaData. Queries can be assembled safely with a bit of String magic: "SELECT " + columnNames + " FROM " + tableName + … ...
  • 11. 2015- © GMC - 11 Alternative to ORM: Use JDBC Wrong! It turns out that … - not all databases have the same SQL "dialect". - they also don't all implement DatabaseMetaData the same way. - you cannot use this on much else than relational databases that use SQL. - What about NoSQL, various file formats, web services etc.
  • 12. 2015- © GMC - 12 MetaModel - Queries via metadata model DataContext dc = … Table[] tables = dc.getDefaultSchema().getTables(); Table table = tables[0]; Column[] primaryKeys = table.getPrimaryKeys(); dc.query().from(table).selectAll().where(primaryKeys[0]).eq(42); dc.query().from("person").selectAll().where("id").eq(42);
  • 13. 2015- © GMC - 13 MetaModel updates UpdateableDataContext dc = … dc.executeUpdate(new UpdateScript() { // multiple updates go here - transactional characteristics (ACID, synchronization etc.) // as per the capabilities of the concrete DataContext implementation. }); dc.executeUpdate(new BatchUpdateScript() { // multiple "batch/bulk" (non atomic) updates go here }); // if I only want to do a single update, there are convenience classes for this InsertInto insert = new InsertInto(table).value(nameColumn, "John Doe"); dc.executeUpdate(insert);
  • 14. 2015- © GMC - 14 MetaModel - Connectivity Sometimes we have SQL, sometimes we have another native query engine (e.g. Lucene) and sometimes we use MetaModel's own query engine. (A few more connectors available via the MetaModel-extras (LGPL) project)
  • 15. 2015- © GMC - 15 Tables, Columns and SQL – that doesn't sound right. This is one of our trade-offs to make life easier! We map NoSQL models (doc, key/value etc.) to a table-based model. As a user you have some choices: • Supply your own mapping (non-dynamic). • Use schema inference provided by Apache MetaModel. • Provide schema inference instructions (e.g. turn certain key/value pairs into separate tables).
  • 16. 2015- © GMC - 16 Other key characteristic about MetaModel • Non-intrusive, you can use it when you want to. • It's just a library – no need for a running a separate service. • Easy to test, stub and mock. Inject a replacement DataContext instead of the real one (typically PojoDataContext). • Easy to implement a new DataContext for your curious database or data format – reuse our abstract implementations.
  • 18. 2015- © GMC - 18 MetaModel schema model
  • 19. 2015- © GMC - 19 MetaModel query and schema model
  • 20. 2015- © GMC - 20 Query representation for developers? "SELECT * FROM foo" Query as a string - Easy to write. - Easy to read. - Prone to parsing errors. - Prone to typos etc. - Fails at runtime. Query q = new Query(); q.from(table).selectAll(); Query as an object - More involved code to read and write. - Lends itself to inspection and mutation. - Fails at compile time.
  • 21. 2015- © GMC - 21 This STATUS=DELIVERED filter never actually executes, it just updates the query on the ORDERS table. Context-based Query optimization You might not always need to pass data around between components … Sometimes you can just pass the query around!
  • 22. The role of Apache MetaModel in Big Data
  • 23. 2015- © GMC - 23 So how does this all relate to Big Data?
  • 24. 2015- © GMC - 24 Big Data and the need for metadata Variety - Not only structured data - Social - Sensors - Many new sources, but also the “old”: • Relational • NoSQL • Files • Cloud
  • 29. 2015- © GMC - 29 Query language examples SELECT * FROM customers WHERE country_code = 'GB' OR country_code IS NULL db.customers.find({ $or: [ {"country.code": "GB"}, {"country.code": {$exists: false}} ] }) for (line : customers.csv) { values = parse(line); country = values[country_index]; if (country == null || "GB".equals(country) { emit(line); } } SQL CSV MongoDB
  • 30. 2015- © GMC - 30 Query language examples dataContext.query() .from(customers) .selectAll() .where(countryCode).eq("GB") .or(countryCode).isNull(); Any datastore
  • 31. 2015- © GMC - 31 Make it easy to ingest data in your lake
  • 32. 2015- © GMC - 32 Metadata to enable automation Big Data Variety and Veracity means we will be handling: ● Different physical formats of the same data ● Different query engines ● Different quality-levels of data How can we automate data ingestion in such a landscape?
  • 33. 2015- © GMC - 33 Metadata to enable automation Big Data Variety and Veracity means we will be handling: ● Different physical formats of the same data We need a uniform metamodel for all the datastores. And enough metadata to infer the ingestion transformations needed. ● Different query engines We need a uniform query API based on the metamodel. ● Different quality-levels of data We need our ingestion target to be aware of the ingestion sources.
  • 34. 2015- © GMC - 34 Metadata to enable automation Big Data Variety and Veracity means we will be handling: ● Different physical formats of the same data We need a uniform metamodel for all the datastores. And enough metadata to infer the ingestion transformations needed. ● Different query engines We need a uniform query API based on the metamodel. ● Different quality-levels of data We need our ingestion target to be aware of the ingestion sources. As an industry we lack more elaborate metadata to support this!
  • 35. 2015- © GMC - 35 Traditional view on metadata user id (pk) CHAR(32) java.lang.String not null username VARCHAR(64) java.lang.String not null, unique real_name VARCHAR(256) java.lang.String not null address VARCHAR(256) java.lang.String nullable age INT int nullable
  • 36. 2015- © GMC - 36 Traditional view on metadata user id (pk) CHAR(32) java.lang.String not null username VARCHAR(64) java.lang.String not null, unique real_name VARCHAR(256) java.lang.String not null address VARCHAR(256) java.lang.String nullable age INT int nullable customer id (pk) BIGINT long not null firstname VARCHAR(128) java.lang.String not null lastname VARCHAR(128) java.lang.String not null street VARCHAR(64) java.lang.String not null house number INT int not null
  • 37. 2015- © GMC - 37 A more elaborate metadata view user id (pk) CHAR(32) java.lang.String not null username VARCHAR(64) java.lang.String not null, unique real_name VARCHAR(256) java.lang.String not null address VARCHAR(256) java.lang.String nullable age INT int nullable customer id (pk) BIGINT long not null firstname VARCHAR(128) java.lang.String not null lastname VARCHAR(128) java.lang.String not null street VARCHAR(64) java.lang.String not null house number INT int not null 32 char UUID Person full name Address (unstructured) Person first name Person last name Address part - street Same real- world entity? Address part - house number Numeric nominal variable Person age Numeric ratio variable
  • 38. 2015- © GMC - 38 Elaborate metadata and querying SELECT (columns with header 'name') FROM (customer and user) SELECT firstname, lastname FROM customer SELECT real_name FROM user SELECT EXTRACT(full name FROM (name columns)) FROM (tables containing person names) SELECT (person name columns) FROM (customer and user) Static Manual Dynamic Automated (or supervised)
  • 39. 2015- © GMC - 39 Elaborate metadata and querying SELECT street, house_number FROM customer SELECT address FROM user SELECT EXTRACT(street, hno, city, zip) FROM (Address part columns)) FROM (tables containing addresses) SELECT (Address part columns) FROM (customer and user) Static Manual Dynamic Automated (or supervised)
  • 40. 2015- © GMC - 40 Elaborate metadata and querying Individual/specific DB connectors JDBC Apache MetaModel (today) Apache MetaModel (future) Static Manual Dynamic Automated (or supervised)
  • 42. 2015- © GMC - 42 Data Integration scenario Copy (ETL)
  • 43. 2015- © GMC - 43 Data Federation scenario
  • 44. 2015- © GMC - 44 What about speed? Performance - Performance is important in development of MetaModel. - But not in favor of uniformness. - In some cases the metadata may benefit performance by automatically tweaking query parameters (e.g. fetching strategies). - We usually expose the native connector objects too. Time to market - MetaModel makes it easy to cover all the data sources with the same codebase. - Typical 80/20 trade-off scenario. - Avoid premature optimization.
  • 45. 2015- © GMC - 45 The future of Apache MetaModel? Stuff currently being built (version 4.4.0): • Pluggable operators in queries • Pluggable functions in queries • Phase-out Java 6 support Ideas being prototyped: • Elaborate metadata service • Spark module (Turn a Query into a RDD or DataFrame) • More connectors - Apache Solr, Neo4j, Couchbase, Riak
  • 46. Let's see some code Query a remote CSV file https://github.com/kaspersorensen/ApacheBigDataMetaModelExample