Introduction to Apache Drill

Introduction to Apache Drill
Michael Hausenblas, Chief Data Engineer EMEA, MapR
6th Swiss Big Data User Group Meeting, Zurich, 2013-03-25

1

Kudos to http://cmx.io/

2
2

Workloads
• Batch processing (MapReduce)

• Light-weight OLTP (HBase, Cassandra, etc.)

• Stream processing (Storm, S4)

• Search (Solr, Elasticsearch)

• Interactive, ad-hoc query and analysis (?)

3

Interactive Query at Scale

Impala

low-latency
4

Use Case I
• Jane, a marketing analyst
• Determine target segments
• Data from different sources

5

Use Case II
• Logistics – supplier status
• Queries
– How many shipments from supplier X?
– How many shipments in region Y?
{
"shipment": 100123,
SUPPLIER_ID NAME REGION "supplier": "ACM",
“timestamp": "2013-02-01",
ACM ACME Corp US
"description": ”first delivery today”
GAL GotALot Inc US },
{
BAP Bits and Pieces Ltd Europe "shipment": 100124,
"supplier": "BAP",
ZUP Zu Pli Asia "timestamp": "2013-02-02",
"description": "hope you enjoy it”
}
6 …

Today’s Solutions
• RDBMS-focused
– ETL data from MongoDB and Hadoop
– Query data using SQL

• MapReduce-focused
– ETL from RDBMS and MongoDB
– Use Hive, etc.

7

Requirements
• Support for different data sources
• Support for different query interfaces
• Low-latency/real-time
• Ad-hoc queries
• Scalable, reliable

8

Google’s Dremel

http://research.google.com/pubs/pub36632.html

9

Apache Drill Overview
• Inspired by Google’s Dremel
• Standard SQL 2003 support
• Other QL possible
• Plug-able data sources
• Support for nested data
• Schema is optional
• Community driven, open, 100’s involved

10

Apache Drill Overview

11

High-level Architecture

12

• Each node: Drillbit - maximize data locality
• Co-ordination, query planning, execution, etc, are distributed
• By default Drillbits hold all roles
• Any node can act as endpoint for a query

Drillbit Drillbit Drillbit Drillbit

Storage Storage Storage Storage
Process Process Process Process

node node node node

13

• Zookeeper for ephemeral cluster membership info
• Distributed cache (Hazelcast) for metadata, locality
information, etc.
Zookeeper

Distributed Cache Distributed Cache Distributed Cache Distributed Cache


node node node node

14

• Originating Drillbit acts as foreman, manages query execution,
scheduling, locality information, etc.
• Streaming data communication avoiding SerDe
Zookeeper

Distributed Cache Distributed Cache Distributed Cache Distributed Cache


node node node node

15

Principled Query Execution

Source Logical Physical
Query Parser Plan Optimizer Plan Execution

SQL 2003 parser API query: [
{
topology scanner API
DrQL @id: "log",
op: "sequence",
MongoQL do: [
{
DSL op: "scan",
source: “logs”
},
{
op:
"filter",
condition:
"x > 3”
},
16

Drillbit Modules
RPC Endpoint

SQL
Scheduler

Storage Engine Interface
DFS Engine

Physical Plan
Logical Plan

HiveQL
Optimizer Foreman

Pig HBase Engine

Operators
Mongo

Parser

Distributed Cache

17

Key Features
• Full SQL 2003
• Nested data
• Optional schema
• Extensibility points

18

Full SQL – ANSI SQL 2003
• SQL-like is often not enough
• Integration with existing tools
– Datameer, Tableau, Excel, SAP Crystal Reports
– Use standard ODBC/JDBC driver

19

Nested Data
• Nested data becoming prevalent
– JSON/BSON, XML, ProtoBuf, Avro
– Some data sources support it natively
(MongoDB, etc.)
• Flattening nested data is error-prone
• Extension to ANSI SQL 2003

20

Optional Schema
• Many data sources don’t have rigid schemas
– Schema changes rapidly
– Different schema per record (e.g. HBase)
• Supports queries against unknown schema
• User can define schema or via discovery

21

Extensibility Points
• Source query – parser API
• Custom operators, UDF – logical plan
• Optimizer
• Data sources and formats – scanner API

Source Logical Physical
Query Parser Plan Optimizer Plan Execution

22

… and Hadoop?
• HDFS can be a data source

• Complementary use cases …

• … use Apache Drill
– Find record with specified condition
– Aggregation under dynamic conditions

• … use MapReduce
– Data mining with multiple iterations
– ETL
https://cloud.google.com/files/BigQueryTechnicalWP.pdf
23
23

Example
{
"id": "0001",
"type": "donut",
”ppu": 0.55,
"batters":
{
{
"batter”:
"sales" : 700.0,
[
"typeCount" : 1,
{ "id": "1001", "type": "Regular" },
"quantity" : 700,
{ "id": "1002", "type": "Chocolate" },
"ppu" : 1.0
…
}
{
"sales" : 109.71,
data source: donuts.json "typeCount" : 2,
"quantity" : 159,
query:[ { "ppu" : 0.69
op:"sequence", }
do:[ {
{ "sales" : 184.25,
op: "scan", "typeCount" : 2,
ref: "donuts", "quantity" : 335,
source: "local-logs", "ppu" : 0.55
selection: {data: "activity"} }
},
{ result: out.json
op: "filter",
expr: "donuts.ppu < 2.00"
},
…

logical plan: simple_plan.json https://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo

24

Status
• Heavy development by multiple organizations

• Available
– Logical plan (ADSP)
– Reference interpreter
– Basic SQL parser
– Basic demo
– Basic HBase back-end

25

Status
March/April

• Larger SQL syntax
• Physical plan
• In-memory compressed data interfaces
• Distributed execution focused on large cluster
high performance sort, aggregation and join

26

Contributing
• Dremel-inspired columnar format: Twitter’s Parquet and
Hive’s ORC file

• Integration with Hive metastore (?)

• DRILL-13 Storage Engine: Define Java Interface

• DRILL-15 Build HBase storage engine implementation

27

Contributing
• DRILL-48 RPC interface for query submission and physical plan
execution

• DRILL-53 Setup cluster configuration and membership mgmt
system
– ZK for coordination
– Helix for partition and resource assignment (?)

• Further schedule
– Alpha Q2
– Beta Q3
28

Kudos to …
• Julian Hyde, Pentaho
• Timothy Chen, Microsoft
• Chris Merrick, RJMetrics
• David Alves, UT Austin
• Sree Vaadi, SSS/NGData
• Jacques Nadeau, MapR
• Ted Dunning, MapR

29

Engage!
• Follow @ApacheDrill on Twitter

• Sign up at mailing lists (user | dev)
http://incubator.apache.org/drill/mailing-lists.html

• Learn where and how to contribute
https://cwiki.apache.org/confluence/display/DRILL/Contributing

• Keep an eye on http://drill-user.org/

30

Introduction to Apache Drill

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to Introduction to Apache Drill (20)

More from Swiss Big Data User Group (20)

Recently uploaded (20)

Introduction to Apache Drill

Editor's Notes