© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Contact Information
Ted Dunning
Chief Applications Architect at MapR Technologies
Committer & PMC for Apache’s Drill, Zookeeper & others
VP of Incubator at Apache Foundation
Email tdunning@apache.org tdunning@maprtech.com
Twitter @ted_dunning
© 2014 MapR Technologies 5
What is Drill?
© 2014 MapR Technologies 6
A Query engine that has…
• Columnar/Vectorized
• Optimistic/pipelined
• Runtime compilation
• Late binding
• Extensible
© 2014 MapR Technologies 7
Table Can Be an Entire Directory Tree
// On a file
select errorLevel, count(*)
from dfs.logs.`/AppServerLogs/2014/Janpart0001.parquet`
group by errorLevel;
// On the entire data collection: all years, all months
select errorLevel, count(*)
from dfs.logs.`/AppServerLogs`
group by errorLevel;
© 2014 MapR Technologies 8
Basic Process
Zookeepe
r
DFS/HBase DFS/HBase DFS/HBase
Drillbit
Distributed
Cache
Drillbit
Distributed
Cache
Drillbit
Distributed
Cache
Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf)
2. Drillbit generates execution plan based on query optimization & locality
3. Fragments are farmed to individual nodes
4. Result is returned to driving node
c c c
© 2014 MapR Technologies 9
Stages of Query Planning
Parser
Logical
Planner
Physical
Planner
Query
Foreman
Plan
fragments
sent to drill
bits
SQL
Query
Heuristic and
cost based
Cost based
© 2014 MapR Technologies 10
Query Execution
SQL Parser
Optimizer
Scheduler
Pig Parser
PhysicalPlan
Mongo
Cassandra
HiveQL
Parser
RPC Endpoint
Distributed Cache
StorageInterface
OperatorsOperators
Foreman
LogicalPlan
HDFS
HBase
JDBC
Endpoint
ODBC
Endpoint
© 2014 MapR Technologies 11
Batches of Values
• Value vectors
– List of values, with same schema
– With the 4-value semantics for each value
• Shipped around in batches
– max 256k bytes in a batch
– max 64K rows in a batch
• RPC designed for multiple replies to a request
© 2014 MapR Technologies 12
Fixed Value Vectors
© 2014 MapR Technologies 13
Vectorization
• Drill operates on more than one record at a time
– Word-sized manipulations
– SIMD instructions
• GCC, LLVM and JVM all do various optimizations automatically
– Manually code algorithms
• Logical Vectorization
– Bitmaps allow lightning fast null-checks
– Avoid branching to speed CPU pipeline
© 2014 MapR Technologies 14
Runtime Compilation is Faster
• JIT is smart, but more gains with runtime compilation
• Janino: Java-based Java compiler
From http://bit.ly/16Xk32x
© 2014 MapR Technologies 15
Drill compiler
Loaded class
Merge byte-
code of the
two classes
Janino
compiles
runtime
byte-code
CodeModel
generates
code
Precompiled
byte-code
templates
© 2014 MapR Technologies 16
Optimistic
0
20
40
60
80
100
120
140
160
cmd pipeline small db med db large db dw compilation hadoop
Speed vs. check-pointing
No need to checkpoint
Checkpoint frequentlyApache Drill
© 2014 MapR Technologies 17
Optimistic Execution
• Recovery code trivial
– Running instances discard the failed query’s intermediate state
• Pipelining possible
– Send results as soon as batch is large enough
– Requires barrier-less decomposition of query
© 2014 MapR Technologies 18
Pipelining
• Record batches are pipelined between
nodes
– ~256kB usually
• Unit of work for Drill
– Operators works on a batch
• Operator reconfiguration happens at
batch boundaries
DrillBit
DrillBit DrillBit
© 2014 MapR Technologies 19
Pipelining
• Random access: sort without copy or restructuring
• Avoids serialization/deserialization
• Off-heap (no GC woes when lots of memory)
• Read/write to disk
– when data larger than memory
Drill Bit
Memory
overflow
uses disk
Disk
© 2014 MapR Technologies 20
Cost-based Optimization
• Using Optiq, an extensible framework
– Pluggable rules, and cost model
• Rules for distributed plan generation
– Insert Exchange operator into physical plan
– Optiq enhanced to explore parallel query plans
• Pluggable cost model
– CPU, IO, memory, network cost (data locality)
– Storage engine features (HDFS vs HIVE vs HBase)
Query
Optimizer
Pluggable
rules
Pluggable
cost model
© 2014 MapR Technologies 21
What is SparkSQL?
© 2014 MapR Technologies 22
What is Spark SQL
• Essentially syntactic sugar over a limited subset of Spark
• Inherits all the virtues (and vices) of Spark
– Lambdas can serve as UDFs (has subtle issues for performance)
• Inputs have to be loaded
– Perhaps lazily, not obvious when load actually happens
• Not designed as a streaming engine, requires more memory
• Some JSON support, but not so much for large or variable
objects
• Embedded in a real language!
© 2014 MapR Technologies 23
In More Detail
• A Spark program consists of a computation graph that consumes
and produces so-called resilient data datasets
• SparkSQL allows these computations to be defined using SQL
(but needs schema definitions on the RDD’s)
• Conventional Spark programs and SparkSQL programs
interoperate nearly seamlessly
© 2014 MapR Technologies 24
Many Similarities
SQL Parser
Optimizer
Java
PhysicalPlan
Scala
LogicalPlan
Python
group
filter
filter
© 2014 MapR Technologies 25
Important Differences
• Spark execution assumes RDD’s are complete representation,
not a stream of row batches
• Input sources don’t inject optimization rules, nor expose detailed
cost models
• Most RDD’s don’t have a zero-copy capability
• Spark inherits JVM memory model, very limited use of off-heap
© 2014 MapR Technologies 26
scala> sqlContext.sql("select * from json.`foo.json`").show
+---+------+----+
| a| b| c|
+---+------+----+
| 3|[3, 2]| xyz|
| 7| null| wxy|
| 7| []|null|
+---+------+----+
© 2014 MapR Technologies 27
scala> sqlContext.sql(
"select a, explode(b) b_v from json.`bug.json`"
).show
+---+---------+
| a| b_v|
+---+---------+
| 3| 3|
| 3| 2|
+---+---------+
© 2014 MapR Technologies 28
First Synthesis
• Drill has a more nuanced optimizer, better code generation
– This often leads to ~2x speed advantage
• Drill has ValueVector and row batches
– This leads to much less memory pressure
• Drill has much stricter memory life-cycle
– Query and done and gone, no need for big GC’s even on big memory
• Drill is all about SQL execution
© 2014 MapR Technologies 29
But …
• Spark can optimize across entire program
– This often leads to ~2x speed advantage
• Spark has much more flexible memory structures
– This can lead to much less memory pressure
• Spark has much more flexible RDD life-cycle
– RDD’s can be cached, persisted or simply recomputed as necessary
• Spark is not all about SQL execution
© 2014 MapR Technologies 30
The Really Big Differences
• Drill focuses heavily on secure, multi-tenant access to data
– Strong impersonation semantics
– Cascading rights via views
– Queries co-exist in a cluster and reserve only their momentary resource
requirements
• Spark focuses heavily on fully integrated execution models
– Any spark function works with (almost) any RDD’s
– Memory residency of RDD’s is the highest goal
© 2014 MapR Technologies 31
Drill security
➢ End to end security from
BI tools to Hadoop
➢ Standard based PAM
Authentication
➢ 2 level user
Impersonation
➢ Fine-grained row and
column level access
control with Drill Views –
no centralized security
repository required
© 2014 MapR Technologies 32
Granular security permissions through Drill views
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Owner
Admins
Permission
Admins
Business Analyst Data Scientist
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist View (/views/maskedcards.view.drill)
Not a physical data copy
Name City State
Dave San Jose CA
John Boulder CO
Business Analyst View
Owner
Admins
Permission
Business
Analysts
Owner
Admins
Permission
Data
Scientists
© 2014 MapR Technologies 33
Ownership Chaining
• Combine Self Service Exploration with Data Governance
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)
John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)
Jane(Owner)
RAWFILEV_ScientistV_Analyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership
chaining
Access
path
© 2014 MapR Technologies 34
But was that the right
question?
© 2014 MapR Technologies 35
Unification is Feasible
• It is relatively easy to build a DrillContext in Spark
– compare to SqlContext
• Define Datasets as Drill data sources and sinks
– Drill runs at the same time as Spark
• Orchestrate transport of Spark data to/from Drill
• Cost of transport is remarkably small
© 2014 MapR Technologies 36
What does the Spark and Drill integration look like
Features at a glance:
• Use Drill as an input to Spark
• Query Spark RDDs via Drill and create data pipelines
Disk (DFS)
Memory
RDD
Files Files
© 2014 MapR Technologies 37
Is unification
valuable?
© 2014 MapR Technologies 38
Example of Unification
Callers
Universe
Towers
cdr data
© 2014 MapR Technologies 39
Simple Session Protocol
• Calls started at random
intervals
• During calls, reconnection
is done periodically
idle
connect
HELLO
FAIL
TIME
OUT
active
END
CONNECT
END
HELLO
start
SETUP
• Many log events are buffered
and sent to current tower during
active state
© 2014 MapR Technologies 40
The Resulting Data
• Signal strength reports
– Tower, timestamp, rank, caller, caller location*, signal strength
• Tower log events: HELLO, FAIL, CONNECT, END
• Call end
• Note that data for one tower is often received by another due to
caller buffering to diagnostic data
*Location isn’t quite location … poetic license applied for
© 2014 MapR Technologies 41
What can we do with it?
© 2014 MapR Technologies 42
Baby Steps
• What does signal propagation look like?
select x, y, signal from cdr_stream where tower = 3
• Plot results to get a map of signal strength around a tower
© 2014 MapR Technologies 43
Baby Steps
• What does tower coverage look like?
select x, y from cdr_stream
where tower = 3 and event_type = ‘CONNECT’.
• Plot results to get a map of coverage area for a tower
© 2014 MapR Technologies 44
What about anomaly detection?
© 2014 MapR Technologies 45
Detecting Tower Loss
It’s important to know if traffic is stopped or delayed
because of a problem…
But events from towers come at irregular intervals
How long after the last event should you begin to worry?
© 2014 MapR Technologies 46
Event Stream (timing)
• Events of various types arrive at irregular intervals
– we can assume Poisson distribution
• The key question is whether frequency has changed relative to
expected values
– This shows up as a change in interval
• Want alert as soon as possible
© 2014 MapR Technologies 47
Converting Event Times to Anomaly
99.9%-ile
99.99%-ile
© 2014 MapR Technologies 48
But in the real world, event
rates often change
© 2014 MapR Technologies 49
Time Intervals Are Key to Modeling Sporadic Events
0 1 2 3 4
02468
t (days)
dt(min)
© 2014 MapR Technologies 50
Time Intervals Are Key to Modeling Sporadic Events
0 1 2 3 4
02468
t (days)
dt(min)
© 2014 MapR Technologies 51
After Rate Correction
0 1 2 3 4
0246810
t (days)
dt/rate
99.9%−ile
99.99%−ile
© 2014 MapR Technologies 52
Detecting Anomalies in Sporadic Events
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t
© 2014 MapR Technologies 53
Propagation Anomalies
• What happens when something shadows part of the coverage
field?
– Can happen in urban areas with a construction crane
• Can solve heuristically
– Subtract from reference image composed by long term averages
– Doesn’t deal well with weak signal regions and low S/N
• Can solve probabilistically
– Compute anomaly for each measurement, use mean of log(p)
© 2014 MapR Technologies 54
© 2014 MapR Technologies 55
© 2014 MapR Technologies 56
Variable Signal/Noise Makes Heuristic Tricky
Far from the transmitter,
received signal is dominated by
noise. This makes subtraction of
average value a bad algorithm.
© 2014 MapR Technologies 57
Other Issues
• Finding anomalies in coverage area is similar tricky
• Coverage area is roughly where tower signal strength is higher
than neighbors
• Except for fuzziness due to hand-off delays
• Except for bias due to large-scale caller motions
– Rush hour
– Event mobs
© 2014 MapR Technologies 58
Simple Answer for Propagation Anomalies
• Cluster signal strength reports
• Cluster locations using k-means, large k
• Model report rate anomaly using discrete event models
• Model signal strength anomaly using percentile model
• Trade larger k against higher report rates, faster detection
• Overall anomaly is sum of individual log(p) anomalies
© 2014 MapR Technologies 59
Coverage Areas
© 2014 MapR Technologies 60
Just One Tower
© 2014 MapR Technologies 61
Cluster Reports for That Tower
© 2014 MapR Technologies 62
Cluster Reports for That Tower
1
2 3
4
5
6
7
8
9
© 2014 MapR Technologies 63
General Dataflow
Group by tower,
filter data (SQL)
k-means cluster
(ML LIB)
Split data
(SQL)
Location model
(Java)
Mark cluster
(ML LIB)
Rate detection
per cluster
© 2014 MapR Technologies 64
Summary
• Drill and Spark provide healthy competition in Apache
• Over time, they have converged in many respects
– But important distinctions remain
• Projects can work together to share key technology
– Apache Arrow … started as off-shoot of Drill, now has >12 major
projects as participants, including Spark
• Systems can work together even more deeply
– DrillContext makes integration first class
© 2014 MapR Technologies 65
e-book available courtesy of MapR
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
© 2014 MapR Technologies 66
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
© 2014 MapR Technologies 67
Thank you for coming today!
© 2014 MapR Technologies 68
…helping you put data technology to work
● Find answers
● Ask technical questions
● Join on-demand training course
discussions
● Follow release announcements
● Share and vote on product ideas
● Find Meetup and event listings
Connect with fellow Apache
Hadoop and Spark professionals
community.mapr.com

Spark SQL versus Apache Drill: Different Tools with Different Rules

  • 1.
    © 2014 MapRTechnologies 1© 2014 MapR Technologies
  • 2.
    © 2014 MapRTechnologies 2 Contact Information Ted Dunning Chief Applications Architect at MapR Technologies Committer & PMC for Apache’s Drill, Zookeeper & others VP of Incubator at Apache Foundation Email [email protected] [email protected] Twitter @ted_dunning
  • 3.
    © 2014 MapRTechnologies 5 What is Drill?
  • 4.
    © 2014 MapRTechnologies 6 A Query engine that has… • Columnar/Vectorized • Optimistic/pipelined • Runtime compilation • Late binding • Extensible
  • 5.
    © 2014 MapRTechnologies 7 Table Can Be an Entire Directory Tree // On a file select errorLevel, count(*) from dfs.logs.`/AppServerLogs/2014/Janpart0001.parquet` group by errorLevel; // On the entire data collection: all years, all months select errorLevel, count(*) from dfs.logs.`/AppServerLogs` group by errorLevel;
  • 6.
    © 2014 MapRTechnologies 8 Basic Process Zookeepe r DFS/HBase DFS/HBase DFS/HBase Drillbit Distributed Cache Drillbit Distributed Cache Drillbit Distributed Cache Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf) 2. Drillbit generates execution plan based on query optimization & locality 3. Fragments are farmed to individual nodes 4. Result is returned to driving node c c c
  • 7.
    © 2014 MapRTechnologies 9 Stages of Query Planning Parser Logical Planner Physical Planner Query Foreman Plan fragments sent to drill bits SQL Query Heuristic and cost based Cost based
  • 8.
    © 2014 MapRTechnologies 10 Query Execution SQL Parser Optimizer Scheduler Pig Parser PhysicalPlan Mongo Cassandra HiveQL Parser RPC Endpoint Distributed Cache StorageInterface OperatorsOperators Foreman LogicalPlan HDFS HBase JDBC Endpoint ODBC Endpoint
  • 9.
    © 2014 MapRTechnologies 11 Batches of Values • Value vectors – List of values, with same schema – With the 4-value semantics for each value • Shipped around in batches – max 256k bytes in a batch – max 64K rows in a batch • RPC designed for multiple replies to a request
  • 10.
    © 2014 MapRTechnologies 12 Fixed Value Vectors
  • 11.
    © 2014 MapRTechnologies 13 Vectorization • Drill operates on more than one record at a time – Word-sized manipulations – SIMD instructions • GCC, LLVM and JVM all do various optimizations automatically – Manually code algorithms • Logical Vectorization – Bitmaps allow lightning fast null-checks – Avoid branching to speed CPU pipeline
  • 12.
    © 2014 MapRTechnologies 14 Runtime Compilation is Faster • JIT is smart, but more gains with runtime compilation • Janino: Java-based Java compiler From http://bit.ly/16Xk32x
  • 13.
    © 2014 MapRTechnologies 15 Drill compiler Loaded class Merge byte- code of the two classes Janino compiles runtime byte-code CodeModel generates code Precompiled byte-code templates
  • 14.
    © 2014 MapRTechnologies 16 Optimistic 0 20 40 60 80 100 120 140 160 cmd pipeline small db med db large db dw compilation hadoop Speed vs. check-pointing No need to checkpoint Checkpoint frequentlyApache Drill
  • 15.
    © 2014 MapRTechnologies 17 Optimistic Execution • Recovery code trivial – Running instances discard the failed query’s intermediate state • Pipelining possible – Send results as soon as batch is large enough – Requires barrier-less decomposition of query
  • 16.
    © 2014 MapRTechnologies 18 Pipelining • Record batches are pipelined between nodes – ~256kB usually • Unit of work for Drill – Operators works on a batch • Operator reconfiguration happens at batch boundaries DrillBit DrillBit DrillBit
  • 17.
    © 2014 MapRTechnologies 19 Pipelining • Random access: sort without copy or restructuring • Avoids serialization/deserialization • Off-heap (no GC woes when lots of memory) • Read/write to disk – when data larger than memory Drill Bit Memory overflow uses disk Disk
  • 18.
    © 2014 MapRTechnologies 20 Cost-based Optimization • Using Optiq, an extensible framework – Pluggable rules, and cost model • Rules for distributed plan generation – Insert Exchange operator into physical plan – Optiq enhanced to explore parallel query plans • Pluggable cost model – CPU, IO, memory, network cost (data locality) – Storage engine features (HDFS vs HIVE vs HBase) Query Optimizer Pluggable rules Pluggable cost model
  • 19.
    © 2014 MapRTechnologies 21 What is SparkSQL?
  • 20.
    © 2014 MapRTechnologies 22 What is Spark SQL • Essentially syntactic sugar over a limited subset of Spark • Inherits all the virtues (and vices) of Spark – Lambdas can serve as UDFs (has subtle issues for performance) • Inputs have to be loaded – Perhaps lazily, not obvious when load actually happens • Not designed as a streaming engine, requires more memory • Some JSON support, but not so much for large or variable objects • Embedded in a real language!
  • 21.
    © 2014 MapRTechnologies 23 In More Detail • A Spark program consists of a computation graph that consumes and produces so-called resilient data datasets • SparkSQL allows these computations to be defined using SQL (but needs schema definitions on the RDD’s) • Conventional Spark programs and SparkSQL programs interoperate nearly seamlessly
  • 22.
    © 2014 MapRTechnologies 24 Many Similarities SQL Parser Optimizer Java PhysicalPlan Scala LogicalPlan Python group filter filter
  • 23.
    © 2014 MapRTechnologies 25 Important Differences • Spark execution assumes RDD’s are complete representation, not a stream of row batches • Input sources don’t inject optimization rules, nor expose detailed cost models • Most RDD’s don’t have a zero-copy capability • Spark inherits JVM memory model, very limited use of off-heap
  • 24.
    © 2014 MapRTechnologies 26 scala> sqlContext.sql("select * from json.`foo.json`").show +---+------+----+ | a| b| c| +---+------+----+ | 3|[3, 2]| xyz| | 7| null| wxy| | 7| []|null| +---+------+----+
  • 25.
    © 2014 MapRTechnologies 27 scala> sqlContext.sql( "select a, explode(b) b_v from json.`bug.json`" ).show +---+---------+ | a| b_v| +---+---------+ | 3| 3| | 3| 2| +---+---------+
  • 26.
    © 2014 MapRTechnologies 28 First Synthesis • Drill has a more nuanced optimizer, better code generation – This often leads to ~2x speed advantage • Drill has ValueVector and row batches – This leads to much less memory pressure • Drill has much stricter memory life-cycle – Query and done and gone, no need for big GC’s even on big memory • Drill is all about SQL execution
  • 27.
    © 2014 MapRTechnologies 29 But … • Spark can optimize across entire program – This often leads to ~2x speed advantage • Spark has much more flexible memory structures – This can lead to much less memory pressure • Spark has much more flexible RDD life-cycle – RDD’s can be cached, persisted or simply recomputed as necessary • Spark is not all about SQL execution
  • 28.
    © 2014 MapRTechnologies 30 The Really Big Differences • Drill focuses heavily on secure, multi-tenant access to data – Strong impersonation semantics – Cascading rights via views – Queries co-exist in a cluster and reserve only their momentary resource requirements • Spark focuses heavily on fully integrated execution models – Any spark function works with (almost) any RDD’s – Memory residency of RDD’s is the highest goal
  • 29.
    © 2014 MapRTechnologies 31 Drill security ➢ End to end security from BI tools to Hadoop ➢ Standard based PAM Authentication ➢ 2 level user Impersonation ➢ Fine-grained row and column level access control with Drill Views – no centralized security repository required
  • 30.
    © 2014 MapRTechnologies 32 Granular security permissions through Drill views Name City State Credit Card # Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711 Raw File (/raw/cards.csv) Owner Admins Permission Admins Business Analyst Data Scientist Name City State Credit Card # Dave San Jose CA 1374-1111-1111-1111 John Boulder CO 1374-1111-1111-1111 Data Scientist View (/views/maskedcards.view.drill) Not a physical data copy Name City State Dave San Jose CA John Boulder CO Business Analyst View Owner Admins Permission Business Analysts Owner Admins Permission Data Scientists
  • 31.
    © 2014 MapRTechnologies 33 Ownership Chaining • Combine Self Service Exploration with Data Governance Name City State Credit Card # Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711 Raw File (/raw/cards.csv) Name City State Credit Card # Dave San Jose CA 1374-1111-1111-1111 John Boulder CO 1374-1111-1111-1111 Data Scientist (/views/V_Scientist) Jane (Read) John (Owner) Name City State Dave San Jose CA John Boulder CO Analyst(/views/V_Analyst) Jack (Read) Jane(Owner) RAWFILEV_ScientistV_Analyst Does Jack have access to V_Analyst? ->YES Who is the owner of V_Analyst? ->Jane Drill accesses V_Analyst as Jane (Impersonation hop 1) Does Jane have access to V_Scientist ? -> YES Who is the owner of V_Scientist? ->John Drill accesses V_Scientist as John (Impersonation hop 2) John(Owner) Does John have permissions on raw file? -> YES Who is the owner of raw file? ->John Drill accesses source file as John (no impersonation here) Jack queries the view V_Analyst *Ownership chain length (# hops) is configurable Ownership chaining Access path
  • 32.
    © 2014 MapRTechnologies 34 But was that the right question?
  • 33.
    © 2014 MapRTechnologies 35 Unification is Feasible • It is relatively easy to build a DrillContext in Spark – compare to SqlContext • Define Datasets as Drill data sources and sinks – Drill runs at the same time as Spark • Orchestrate transport of Spark data to/from Drill • Cost of transport is remarkably small
  • 34.
    © 2014 MapRTechnologies 36 What does the Spark and Drill integration look like Features at a glance: • Use Drill as an input to Spark • Query Spark RDDs via Drill and create data pipelines Disk (DFS) Memory RDD Files Files
  • 35.
    © 2014 MapRTechnologies 37 Is unification valuable?
  • 36.
    © 2014 MapRTechnologies 38 Example of Unification Callers Universe Towers cdr data
  • 37.
    © 2014 MapRTechnologies 39 Simple Session Protocol • Calls started at random intervals • During calls, reconnection is done periodically idle connect HELLO FAIL TIME OUT active END CONNECT END HELLO start SETUP • Many log events are buffered and sent to current tower during active state
  • 38.
    © 2014 MapRTechnologies 40 The Resulting Data • Signal strength reports – Tower, timestamp, rank, caller, caller location*, signal strength • Tower log events: HELLO, FAIL, CONNECT, END • Call end • Note that data for one tower is often received by another due to caller buffering to diagnostic data *Location isn’t quite location … poetic license applied for
  • 39.
    © 2014 MapRTechnologies 41 What can we do with it?
  • 40.
    © 2014 MapRTechnologies 42 Baby Steps • What does signal propagation look like? select x, y, signal from cdr_stream where tower = 3 • Plot results to get a map of signal strength around a tower
  • 41.
    © 2014 MapRTechnologies 43 Baby Steps • What does tower coverage look like? select x, y from cdr_stream where tower = 3 and event_type = ‘CONNECT’. • Plot results to get a map of coverage area for a tower
  • 42.
    © 2014 MapRTechnologies 44 What about anomaly detection?
  • 43.
    © 2014 MapRTechnologies 45 Detecting Tower Loss It’s important to know if traffic is stopped or delayed because of a problem… But events from towers come at irregular intervals How long after the last event should you begin to worry?
  • 44.
    © 2014 MapRTechnologies 46 Event Stream (timing) • Events of various types arrive at irregular intervals – we can assume Poisson distribution • The key question is whether frequency has changed relative to expected values – This shows up as a change in interval • Want alert as soon as possible
  • 45.
    © 2014 MapRTechnologies 47 Converting Event Times to Anomaly 99.9%-ile 99.99%-ile
  • 46.
    © 2014 MapRTechnologies 48 But in the real world, event rates often change
  • 47.
    © 2014 MapRTechnologies 49 Time Intervals Are Key to Modeling Sporadic Events 0 1 2 3 4 02468 t (days) dt(min)
  • 48.
    © 2014 MapRTechnologies 50 Time Intervals Are Key to Modeling Sporadic Events 0 1 2 3 4 02468 t (days) dt(min)
  • 49.
    © 2014 MapRTechnologies 51 After Rate Correction 0 1 2 3 4 0246810 t (days) dt/rate 99.9%−ile 99.99%−ile
  • 50.
    © 2014 MapRTechnologies 52 Detecting Anomalies in Sporadic Events Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t
  • 51.
    © 2014 MapRTechnologies 53 Propagation Anomalies • What happens when something shadows part of the coverage field? – Can happen in urban areas with a construction crane • Can solve heuristically – Subtract from reference image composed by long term averages – Doesn’t deal well with weak signal regions and low S/N • Can solve probabilistically – Compute anomaly for each measurement, use mean of log(p)
  • 52.
    © 2014 MapRTechnologies 54
  • 53.
    © 2014 MapRTechnologies 55
  • 54.
    © 2014 MapRTechnologies 56 Variable Signal/Noise Makes Heuristic Tricky Far from the transmitter, received signal is dominated by noise. This makes subtraction of average value a bad algorithm.
  • 55.
    © 2014 MapRTechnologies 57 Other Issues • Finding anomalies in coverage area is similar tricky • Coverage area is roughly where tower signal strength is higher than neighbors • Except for fuzziness due to hand-off delays • Except for bias due to large-scale caller motions – Rush hour – Event mobs
  • 56.
    © 2014 MapRTechnologies 58 Simple Answer for Propagation Anomalies • Cluster signal strength reports • Cluster locations using k-means, large k • Model report rate anomaly using discrete event models • Model signal strength anomaly using percentile model • Trade larger k against higher report rates, faster detection • Overall anomaly is sum of individual log(p) anomalies
  • 57.
    © 2014 MapRTechnologies 59 Coverage Areas
  • 58.
    © 2014 MapRTechnologies 60 Just One Tower
  • 59.
    © 2014 MapRTechnologies 61 Cluster Reports for That Tower
  • 60.
    © 2014 MapRTechnologies 62 Cluster Reports for That Tower 1 2 3 4 5 6 7 8 9
  • 61.
    © 2014 MapRTechnologies 63 General Dataflow Group by tower, filter data (SQL) k-means cluster (ML LIB) Split data (SQL) Location model (Java) Mark cluster (ML LIB) Rate detection per cluster
  • 62.
    © 2014 MapRTechnologies 64 Summary • Drill and Spark provide healthy competition in Apache • Over time, they have converged in many respects – But important distinctions remain • Projects can work together to share key technology – Apache Arrow … started as off-shoot of Drill, now has >12 major projects as participants, including Spark • Systems can work together even more deeply – DrillContext makes integration first class
  • 63.
    © 2014 MapRTechnologies 65 e-book available courtesy of MapR http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
  • 64.
    © 2014 MapRTechnologies 66 Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams
  • 65.
    © 2014 MapRTechnologies 67 Thank you for coming today!
  • 66.
    © 2014 MapRTechnologies 68 …helping you put data technology to work ● Find answers ● Ask technical questions ● Join on-demand training course discussions ● Follow release announcements ● Share and vote on product ideas ● Find Meetup and event listings Connect with fellow Apache Hadoop and Spark professionals community.mapr.com

Editor's Notes

  • #46 ELLEN: set up
  • #48 Talk track: This is what it looks like to have events such as those on website that come in at randomized times (people come when they want to) but the underlying average rate in this case is constant, in other words, a fairly steady stream of traffic. This looks at lot like the first signal we talked about: a randomized but even signal… We can use t-digest on it to set thresholds, everything works just grand. (Like radio activity Geiger counter clicks)
  • #50 Talk track: (Describe figure) Horizontal axis is days, with noon in the middle of each day. The faint shadow shows the underlying rate of events.The vertical axis is the time interval between events. Notice that as the rate of events is high, the time interval between events is small, but when the rate of events slows down, the time between events is much larger. Ellen: For this reason, we cannot set a simple threshold: if set low in day, we have an alert every night even though we expect a longer interval then. If we set it too high, we miss the real problems when traffic really is abnormally delayed or stopped altogether. What can you do to solve this? Ted: We build a model, multiple the modelled rate x the interval, we get a number we can threshold accurately.
  • #51 Talk track: (Describe figure) Horizontal axis is days, with noon in the middle of each day. The faint shadow shows the underlying rate of events.The vertical axis is the time interval between events. Notice that as the rate of events is high, the time interval between events is small, but when the rate of events slows down, the time between events is much larger. Ellen: For this reason, we cannot set a simple threshold: if set low in day, we have an alert every night even though we expect a longer interval then. If we set it too high, we miss the real problems when traffic really is abnormally delayed or stopped altogether. What can you do to solve this? Ted: We build a model, multiple the modelled rate x the interval, we get a number we can threshold accurately.
  • #53 Talk track: You need a rate predictor Ellen: sometimes simple is good enough